Friday, December 10, 2010

Six Data Management Predictions for 2011

This time of year everyone makes prognostications about the state of the data management field for 2011. I thought I’d take my turn by offering my predictions for the coming year.

Data will become more open
In the old days good quality reference data was an asset kept in the corporate lockbox. If you had a good reference table for common misspellings of parts, cities, or names for example, the mind set was to keep it close and away from falling into the wrong hands.  The data might have been sold for profit or simply not available.  Today, there really is no “wrong hands”.  Governments and corporations alike are seeing the societal benefits of sharing information. More reference data is there for the taking on the internet from sites like and  That trend will continue in 2011.  Perhaps we’ll even see some of the bigger players make announcements as to the availability of their data. Are you listening Google?

Business and IT will become blurry
It’s becoming harder and harder to tell an IT guy from the head of marketing. That’s because in order to succeed, the IT folks need to become more like the marketer and vice versa.  In the coming year, the difference will be less noticeable and business people get more and more involved in using data to their benefit.  Newsflash One: If you’re in IT, you need marketing skills to pitch your projects and get funding.  Newsflash Two: If you’re in business, you need to know enough about data management practices to succeed.

Tools will become easier to use
As the business users come into the picture, they will need access to the tools to manage data.  Vendors must respond to this new marketplace or die.

Tools will do less heavy lifting
Despite the improvements in the tools, corporations will turn to improving processes and reporting in order to achieve better data management. Dwindling are the days where we’re dealing with data that is so poorly managed that it requires overly complicated data quality tools.  We’re getting better at the data management process and therefore, the burden on the tools becomes less. Future tools with focus on supporting the process improvement with work flow features, reporting and better graphical user interfaces.

CEOs and Government Officials will gain enlightenment
Feeding off the success of a few pioneers in data governance as well as failures of IT projects in our past, CEOs and governments will gain enlightenment about managing their data and put teams in place to handle it.  It has taken decades of our sweet-talk and cajoling for government and CEOs to achieve enlightenment, but I believe it is practically here.

We will become more reliant on data
Ten years ago, it was difficult to imagine us where we are today with respect to our data addiction. Today, data is a pervasive part of our internet-connected society, living in our PCs, our TVs, our mobile phones many other devices. It’s a huge part of our daily lives. As I’ve said in past posts, the world is addicted to data and that bodes well for anyone who helps the world manage it. In 2011, no matter if the economy turns up or down, our industry will continue to feed the addiction to good, clean data.

Tuesday, November 30, 2010

Match Mitigation: When Algorithms Aren’t Enough

I’d like to get a little technical on this post. I try to keep my posts business-friendly, but sometimes there's importance in detail. If none of this post makes any sense to you, I wrote a sort of primer on how matching works in many data quality tools, which you can get here.

Matching Algorithms
When you use a data quality tool, you’re often using matching algorithms and rules to make decisions on whether records match or not.  You might be using deterministic algorithms like Jaro, SoundEx and Metaphones. You might also be using probabilistic matching algorithms.

In many tools, you can set the rules to be tight where the software uses tougher criteria to determine a match, or loose where the software is not so particular. Tight and loose matches are important because you may have strict rules for putting records together, like customers of a bank, or not so strict rules, like when you’re putting together a customer list for marketing purposes.

What to do with Matches
Once data has been processed through the matcher, there are several possible outcomes. Between any two given records, the matcher may find:

  • No relationship
  • Match – the matcher found a definite match based on the criteria given
  • Suspect – the matcher thinks it found a match but is not confident. The results should be manually reviewed.
It’s that last category that the tough one.  Mitigating the suspect matches is the most time-consuming follow-up task after the matching is complete. Envision a million record database where you have 20,000 suspect matches.   That’s still going to take you some time to review.

Some of the newer (and cooler) tools offer strategies for dealing with suspect matches. The tools will present the suspect matches in a graphical user interface and allow users to pick which relationships are accurate and which are not. For example, Talend now offers a data stewardship console that lets you pick and choose records and attributes that will make up a best of breed record.

The goal, of course, is to not have suspect matches, so tuning the matches and limiting the suspect matches is the ultimate. The newest tools will make this easy. Some of the legacy tools make this hard.

Match mitigation is perhaps one of the most often overlooked processes of data quality. Don’t overlook it in your planning and processes.

Tuesday, November 16, 2010

Ideas Having Sex: The Path to Innovation in Data Management

I read a recent analyst report on the data quality market and “enterprise-class” data quality solutions. Per usual, the open source solutions were mentioned at a passing while the data quality solutions of the past were given high marks. Some of the solutions picked in the top originated from days when mainframe was king. Some of the top contenders still contained cobbled-together applications from ill-conceived acquisitions. It got me thinking about the way we do business today and how so much of it is changing.

Back in the 1990’s or earlier, if you had an idea for a new product, you’d work with an internal team of engineers and build the individual parts.  This innovation took time, as you might not always have exactly the right people working on the job.  It was slow and tedious. The product was always confined by its own lineage.

The Android phone market is a perfect examples of the modern way to innovate.  Today, when you want to build something groundbreaking like an Android, you pull in expertise from all around the world. Sure, Samsung might make the CPU and Video processing chips, but Primax Electronics in Taiwan might make the digital camera and Broadcomm in the US makes the touch screen, plus many others. Software vendors push the platform further with their cool apps. Innovation happens at break-neck speed because the Android is a collection of ideas that have sex and produce incredible offspring.

Isn’t that really the model of a modern company?  You have ideas getting together and making new ideas. When you have free exchange between people, there is no need to re-invent something that has already been invented. See the TED for more on this concept, where British author Matt Ridley argues that, through history, the engine of human progress and prosperity is "ideas having sex.”

The business model behind open source has a similar mission.  Open source simply creates better software. Everyone collaborates, not just within one company, but among an Internet-connected, worldwide community. As a result, the open source model often builds higher quality, more secure, more easily integrated software. It does so at a vastly accelerated pace and often at a lower cost.

So why do some industry analysts ignore it? There’s no denying that there are capitalist and financial reasons.  I think if an industry analyst were to actually come out and say that the open source solution is the best, it would be career suicide. The old-school would shun the analysts making him less relevant. The link between the way the industry pays and promotes analysts and vice versa seems to favor enterprise application vendors.

Yet the open source community along with Talend has developed a very strong data management offering that should be considered in the top of its class. The solution leverages other cutting edge solutions. To name just a few examples:
  • if you want to scale up, you can use distributed platform technology from Hadoop, which enables it to work with thousands of nodes and petabytes of data.
  • very strong enterprise class data profiling.  
  • matching that users can actually use and tune without having to jump between multiple applications.
  • a platform that grows with your data management strategy so that if your future is MDM, you can seamlessly move there without having to learn a new GUI.
The way we do business today has changed. Innovation can only happen when ideas have sex, as Matt Ridley puts it. As long as we’re engaged in exchange and specialization, we will achieve those new levels of innovation.

Saturday, October 16, 2010

Is 99.8 % data accuracy enough?

Ripped from recent headlines, we see how even a .2% failure can have a big impact.

WASHINGTON (AP) ― More than 89,000 stimulus payments of $250 each went to people who were either dead or in prison, a government investigator says in a new report.

Let’s take a good, hard look at this story. It begins with the US economy slumping.  The president proposes and passes through congress one of the biggest stimulus packages ever. The idea is sound to many; get America working by offering jobs in green energy, shovel-ready infrastructure projects. Among other actions, the plan is to give lower income people some government money so they can stimulate the economy.

I’m not really here to praise or zing the wisdom of this. I’m just here to give the facts. In hindsight, it appears as though it hasn’t stimulated the economy as many had hoped, but that’s beside the point.

Continuing on, the government issues 52 million people on social security a check for $250. It turns out of that number nearly 100,000 people were in prison or dead, roughly 0.2% of the checks. Some checks are returned, some are cashed. Ultimately, the government loses $22.3 million on the 0.2% error.

While $22.3 million is a HUGE number, 0.2% is a tiny number.  It strikes at the heart at why data quality is so important.  Social Security spokesman Mark Lassiter said, "…Each year we make payments to a small number of deceased recipients usually because we have not yet received reports of their deaths."

There is strong evidence that the SSA is hooked up to the right commercial data feeds and have the processes in place to use them. It seems as though the social security administration is quite proactive in their search for the dead and imprisoned, but people die and go to prison all the time. They also move, get married and become independent of their parents.

If we try to imagine what it would take to achieve closer to 100% accuracy, it would take up-to-the-minute reference data. It seems that the only real solution is to put forth legislation that requires the reporting to the federal government any of these life changing events. Should we mandate the bereaved or perhaps funeral directors to report the death immediately in a central database? Even with such a law, there still would be a small percentage of checks that would be issued while the recipient was alive and delivered after the recipient is dead. We’d have better accuracy for this issue, but not 100%

While this story takes a poke at the SSA for sending checks to dead people, I have to applaud their achievement of 99.8% accuracy. It could be a lot worse America.  A lot worse.

Saturday, August 28, 2010

ERP and SCM Data Profiling Techniques

In this YouTube tutorial for Talend, I walk through some techniques for profiling ERP, SCM and materials master data using Talend Open Profiler. In addition to basic profiling, the correlation analysis feature can be used to identify relationships between part numbers and descriptions.

Monday, August 16, 2010

Data Governance and Data Quality Insider 100th

I have reached my 100th post milestone.  I hope you won't mind if I get a little introspective here and tell you a little about my social media journey over these past three years.

How did I get started?  One day back in 2007, I disagreed with Vince McBurney’s post (topic unimportant now).  I responded and Vince politely told me to shut up and if I really wanted to have an opinion to write my own blog.  I did.  Thanks for the kick in the pants, Vince.

Some of my most popular posts over these past three years have been:

  • Probabilistic Matching: Sounds like a good idea, but…
    Here, I take a swipe at the sanctity of probabilistic matching. I probably have received the most hate-mail from this post. My stance still is that a hybrid approach to matching, using both probabilistic and deterministic is key to getting match results. Probabilistic alone is not the solution.
  • Data Governance and the Coke Machine Syndrome
    I recount a parable given to me by a well-respected boss in my past about meeting management. Meetings can take unexpected turns where huge issues can be settled in minutes, while insignificant ones can eat up the resources of your company. I probably wrote it just after a meeting.
  • Data Quality Project Selection
    A posting about picking the right data quality projects to work on.
  • The “Do Nothing” Option
    A posting the recounts a lesson I learned about selling the power of data quality to management.
Somewhere around my 50th post, I was contacted by a small publishing firm in the UK about publishing a book on data governance. They liked what they saw in the blog.  I published the Data Governance Imperative in 2009. I pulled upon my experiences with some of the people I met while working in the industry. It's thanks to some of you that the book is a reality.

Blogging has not always been easy. I’ve met some opposition to along the way. There were times when my blogging was perceived as somehow threatening to corporate. At the time, blogging was new and corporations didn't know how to handle it. More companies now have definitive blogging policies and realize the positive impact it has.

What about the people I’ve met? I’ve gained a lot of friendships along the way with people I’ve yet to meet face-to-face. We’re able to build a community here in cyberspace – a data geek community that I am very fond of.  I’m hesitant to write a list because I don’t want to leave anyone out, but you know who you are.

If you're thinking of blogging, please, find something you’re passionate about and write.  You’ll have a great time!

Thursday, August 12, 2010

Change Management and Data Governance

Years ago, I worked for a large company that spent time and effort on change management. It has been popular with corporations that plan significant changes as they grow or down-size. Companies, particularly high-tech companies, use change management to be more agile and respond to rapid changes in the market.

As I read through the large amount of information on change management, I’m struck by the parallels between change management and data governance. The focus is on processes. It ensures that no matter what changes happen in a corporation, whether it’s downsizing or rapid growth, significant changes are implemented in an orderly fashion and make everyone more effective.

On the other hand, humans are resistant to change. Change management aims to gain buy-in from management to achieve the organization's goal of an orderly and effective transformation. Sound familiar? Data governance speaks to this ability to manage data properly, no matter what growth spurts, mergers or downsizing occurs. It is about changing the hearts and minds of individuals to better manage data and achieve more success while doing so.

Change Management Models
As you examine data governance models, look toward change management models that have been developed by vendors and analysts in the change management space.  One that struck my attention was the ADKAR model developed by a company called Prosci. In this model, there are five specific stages that must be realized in order for an organization to successfully change. They include:
  • Awareness - An organization must know why a specific change is necessary.
  • Desire - The organizational must have the motivation and desire to participate in the call for change.
  • Knowledge – The organization must know how to change. Knowing why you must change is not enough.
  • Ability - Every individual in the company must implement new skills and processes to make the necessary changes happen.
  • Reinforcement - Individuals must sustain the changes, making them the new behavior, averting the tendency to revert back to their old processes.
These same factors can be applied when assessing how to change our own teams to manage data more effectively.  Positive change will only come if you work on all of these factors.

I often talk about business users and IT working together to solve the data governance problem. By looking at the extensive information available on change management, you can learn a lot about making changes for data governance.

Monday, August 9, 2010

Data Quality Pro Discussion

Last week I sat down with Dylan Jones of to talk about data governance. Here is the replay. We discussed a range of topics including organic governance approaches, challenges of defining data governance, industry adoption trends, policy enforcement vs legislature and much more.


Friday, July 30, 2010

Deterministic and Probabilistic Matching White Paper

I’ve been busy this summer working on a white paper on record matching, the result of which is available on the Talend web site here.

The white paper is sort of a primer containing elementary principles of record matching,  As the description says, it outlines the basic theories and strategies of record matching. It describes the nuances of deterministic and probabilistic matching and the algorithms used to identify relationships within records. It covers the processes to employ in conjunction with matching technology to transform raw data into powerful information that drives success in enterprise applications like CRM, data warehouse and ERP.

Wednesday, July 28, 2010

DGDQI Viewer Mail

From time to time, people read my blog or book and contact me to chat about data governance and data quality. I welcome it. It’s great to talk to people in the industry and hear their concerns.

Occasionally, I see things in my in-box that bother me, though.  Here is one item that I’ll address in a post. The names have been changed to protect the innocent.

A public relations firm asked:

Hi Steve,
I wonder if you could answer these questions for me.
- What are the key business drivers for the advent of data governance software solutions?
- What industries can best take advantage of data governance software solutions?
- Do you see cloud computing-based data governance solutions developing?

I couldn’t answer these questions, because they all pre-supposed that data governance is a software solution.  It made me wonder if I have made myself clear enough on the fact that data governance is mostly about changing the hearts and minds of your colleagues to re-think their opinion of data and its importance.  Data governance is a company’s mindful decision that information is important and they’re going to start leveraging it. Yes, technology can help, but a complete data governance software solution would have more features than a Workchamp XL Swiss Army Knife. It would have to include data profiling, data quality, data integration, business process management, master data management, wikis, a messaging platform, a toothpick and a nail file in order to be complete. 

Can you put all this on the cloud?  Yes.  Can you put the hearts and minds of your company on a cloud?  If only it were that easy...

Wednesday, July 21, 2010

Lemonade Stand Data Quality

My children expressed interest in opening up a lemonade stand this weekend. I’m not sure if it’s done worldwide, but here in America every kid between the age of five and twelve tries their hand at earning extra money during the summer months. Most parents in America indulge this because the whole point of a lemonade stand is really to learn about capitalism. You figure out your costs, how much the lemonade, ice and cups cost, then you charge a little more than what it costs you. At the end of the day, you can hope to show a little profit.

I couldn’t help but think there are lessons we can learn from the lemonade stand that apply to the way we manage our own data quality initiatives.  Data governance programs and data quality projects are still driven by capitalism and lemonade stand fundamentals.

  • Concept – While the lemonade stand requires your audience to have a clear understanding of the product and the price, so does data quality.  In the data world, profiling can help you create an accurate assessment of it and tell the world exactly what it is and how much it’s going to cost.
  • Marketing – My kids proved that more people will come to your lemonade stand if you shout out “Ice Cold Lemonade” and put a few flyers around the neighborhood. Likewise you need to tell management, business people and anyone who will listen about data quality – it’s ice cold and delicious.
  • Pricing – A lemonade stand works by setting the right price. Too little and the profit will be too low, too high and no one will buy. In the data quality world, setting the scope with the proper amount of spend and the right amount of return on investment will be successful.
  • Location – While a busy street and a hot day make a profitable lemonade stand, data quality project managers know that you begin by picking the projects with the least effort and highest potential ROI. In turn, you get to open more lemonade stands and build your data quality projects into a data governance program.

When it comes down to it, data quality projects are a form of capitalism; you need to sell the customers a refreshing glass and keep them coming back for more.

Friday, July 9, 2010

The Book of Life

I found a very interesting paper on record matching in my research for my own upcoming white paper. The paper was written by Chief of the National Office of Vital Statistics for the U. S. Public Health Service. The paper describes in almost poetic fashion of a person’s “book of life”.  It describes how a person leaves a data trail as they go through life.  It describes how hard it is to put together the pages of your book of life as you get born, get married, change homes, earn degrees and certifications.

Naturally, there are benefits to society for each person having their own book of life. In the case of the bureau chief, he cited the need to understand what factors influence health and longevity. The tricky part, he said was to “bind the book of life” despite its tendency to be misalign, non-standard and incoherent.

It sounds like the good Doctor is describing record matching and data cleansing, and to some degree a national ID. But the most interesting and amazing thing about this is that the paper was written in 1946. Even back then, there were smart people who knew what we had to do to bring benefit to society.

Thursday, May 13, 2010

Three Conversations to Have with an Executive - the Only Three

If you’re reading this, you’re most likely in the business of data management. In many companies, particularly large ones, the folks who manage data don’t much talk to the executives. But every so often, there is that luncheon, a chance meeting in the elevator, or even a break from a larger meeting where you and an executive are standing face to face.  (S)he asks, what you’re working on. Like a boy scout, be prepared.  Keep your response to one of these three things:

  1. Revenue – How has your team increased revenue for the corporation?
  2. Efficiency – How has your team lowered costs by improving efficiency for the corporation?
  3. Risk – How have you and your team lowered the risk to the corporation with better compliance to corporate regulations?

The executive doesn’t want to hear about schemas, transformations or even data quality. Some examples of appropriate responses might include:

  • We work on making the CRM/ERP system more efficient by keeping an eye on the information within it. My people ensure that the reports are accurate and complete so you have the tools to make the right decisions.
  • We’re doing things like making sure we’re in compliance with [HIPAA/Solvency II/Basel II/Antispam] so no one runs afoul of the law.
  • We’re speeding up the time it takes to get valuable information to the [marketing/sales/business development] team so they can react quickly to sales opportunities
  • We’re fixing [business problem] to [company benefit].

When you talk to your CEO, it’s your opportunity get him/her in the mindset that your team is beneficial, so when it comes to funding, it will be something they remember. It’s your chance to get out of the weeds and elevate the conversation.  Let the sales guys talk about deals. Let the marketing people talk about the market forces or campaigns. As data champions, we also need to be prepared to talk about the value we bring to the game.

Thursday, May 6, 2010

Open Source Data Profiler Demo

Here I am giving a demo on Talend Open Profiler. Demo and commentary included.

You can download Talend Open Profiler here.

Tuesday, May 4, 2010

Are we ready for a National ID card in America?

I read with interest the story about the National ID card.  Although I don’t like to link myself to one political party or another, I applaud the effort of trying to get a system in place for national ID. I like efficient data.

However, I have my doubts that a group of Senators can really understand the enormous challenges of such a project.  The issue is a politically charged one for certain, so that will be the focus. The details, which we all know contain the devil, will likely be forgotten.

I recall just a short time ago the US government’s Cash for Clunkers program. The program involved buying a new car and turning in your old “clunker” for a new fuel-efficient one.  The idea was to support the auto industry and get the gas guzzlers off the road. The devil was in the details, however. Rather than a secure web site with sufficient backbone to properly serve car dealerships, the program required the dealers complete pages and pages of paperwork… real paper paperwork…  and fax it into the newly formed government agency for approval. Then they hired workers on other end to enter the data. It was a business process that would have been appropriate for 1975, not 2010.

ACLU legislative counsel Christopher Calabrese said of this National ID program that “all of this will come with a new federal bureaucracy — one that combines the worst elements of the DMV and the TSA”. Based on recent history it’s an accurate description of what will likely happen.

If the government wants to do this thing, they need to bring in a dream team of database experts. Guys like Dr. Ralph Kimball or Bill Inmon, both of whom are world renown for data modeling, should contribute if they are willing.  They should ask in Dr. Rich Wang from MIT’s IQ program to be in charge of information quality issues.  They should invite guys like Jim Harris to communicate the complex issues to the public.  Also, they need to bring in folks with practical experience, like a Jill Dyche or Gwen Thomas.  There are probably some others that I haven’t mentioned. Security experts, hardware scalability experts and business process experts need to be part of the mix to protect the citizenry of the United States. They would need to make a plan without bias toward any district or political action committee.  That’s why a national database won’t happen.

Don’t get me wrong, if we do so, we could come up with much more efficient systems for checking backgrounds, I-9 job verification, international travel, and more. Identity theft is a big problem here and everywhere, but with a central citizen repository, the US could legislate a notification system when new bank accounts are opened in your name.  The census would always show a more accurate number and wouldn't cost billions and billions of dollars to us every ten years. Let's face it, the business process of the census, mailing paper forms and personal door to door interviews, is outdated.

Let’s start this by making it voluntary. If you want to be in the database and avoid long lines at the airport, fine.  If you want to be anonymous and wait, that’s fine, too.  We’ll get the kinks worked out with the early adopters and roll it out to the laggards later.

What we’re really talking about here is a personal primary key.  That data already exists in multiple linkable systems with your name and addresses (past and present) linking it.  We as data professionals spend a lot of time and effort working with data to try to find these links. So why not have a primary key to link your personal data instead? Are you really giving up anything that  DBAs haven't already figured out?

For those of you against a national database, I don’t think you have anything to fear.  Call me a skeptic, but given the political divide between groups, it’s unlikely that any national database of citizens will be done within this decade. But if you’re listening Senators and you decide to move forward, make sure you have the right people, processes and technology in place to do it right.

Friday, April 9, 2010

Links from my eLearning Webinar

I recently delivered a webinar on the Secrets of Affordable Data Governance. In the webinar, I promised to deliver links for lowering the costs of data management.  Here are those links:

  • Talend Open Source - Download free data profiling, data integration and MDM software.
  • US Census - Download census data for cleansing of city name and state with latitude and longitude appends.
  • - The data available from the US government.
  • Geonames - Postal codes and other location reference data for almost every country in the world.
  • GRC Data - A source of low-cost customer reference data, including names, addresses, salutations, and more.
  • Regular Expressions - Check the shape of data in profiling software or within your database application.
If you search on the term "download reference data", you will find many other sources.

Friday, April 2, 2010

Donating the Data Quality Asset

If you believe like I do that proper data management can change the world, then you have to start wondering if it’s time for all us data quality professionals to stand up and start changing it.

It’s clear that everyone organization, no matter what the size or influence, can benefit from properly managing their data. Even charitable organizations can benefit with a cleaner customer list to get the word out when they need donations.  Non-profits who handle charitable goods can benefit from better data in their inventory management.  If food banks had a better way of managing data and soliciting volunteers, wouldn’t more people be fed? If churches kept better records of their members, would their positive influence be more widespread?  If organizations who accept goods in donation kept a better inventory system, wouldn’t more people benefit? The data asset is not limited to Fortune 1000 companies, but until recently, solutions to manage data properly were only available to the elite.

Open source is coming on strong and is a factor that eases us to donate the data quality.  In the past, it many have been a challenge to get mega-vendors to donate high-end solutions, but we can make significant progress on the data quality problem with little or no solutions cost these days. Solutions like Talend Open Profiler, Talend Open Studio, Pentaho and DataCleaner offer data integration and data profiling.

In my last post, I discussed the reference data that is now available for download.  Reference data used to be proprietary and costly. It’s a new world – a better one for low-cost data management solutions.

Can we save the world through data quality?  If we can help good people spread more goodness, then we can. Let’s give it a try.

Monday, February 22, 2010

Referential Treatment - The Open Source Reference Data Trend

Reference data can be used in a huge number of data quality and data enrichment processes.  The simplest example is a table that contains cities and their associated postal codes – you can use an ETL process to make sure that all your customer records that contain 02026 for a postal code always refer to the standardized “Dedham, MA” for the city and state, not variations like “Deadham Mass”  or “Dedam, Massachusetts”.

Reference data is not limited to customer address, however. If everyone were to use the same reference data for parts, you could easily exchange procurement data between partners.  If only certain values are allowed in any given table, it would support validation.  By having standards for supply chain data, procurement, supply chain, finance and accounting data, processes are more efficient.  Organizations like the ISO and ECCMA are working on that.

Availability of Reference Data
In the past, it was difficult to get your hands on reference data. Long ago, no one wanted to share reference data with you - you had to send your customer data to a service provider and get the enriched data back.  Others struggled to develop reference data on their own. Lately I’m seeing more and more high quality reference data available for free on the Internet.   For data jockeys, these are good times.

A good example of this is GeoNames.  The GeoNames geographical database is available for download free of charge under a creative commons attribution license. According to the web site, it “aggregates over 100 different data sets to build a list containing over eight million geographical names and consists of 7 million unique features whereof 2.6 million populated places and 2.8 million alternate names. The data is accessible free of charge through a number of web services and a daily database export. “

GeoNames combines geographical data such as names of places in various languages, elevation, population and others from various sources. All lat/long coordinates are in WGS84 (World Geodetic System 1984). Like Wikipedia, users may manually edit, correct and add new names.

US Census Data
Another rich set of reference data is the US Census “Gazetteer” data. Courtesy of the US government, you can download a database with the following fields:
  • Field 1 - State Fips Code
  • Field 2 - 5-digit Zipcode
  • Field 3 - State Abbreviation
  • Field 4 - Zipcode Name
  • Field 5 - Longitude in Decimal Degrees (West is assumed, no minus sign)
  • Field 6 - Latitude in Decimal Degrees (North is assumed, no plus sign)
  • Field 7 - 2000 Population (100%)
  • Field 8 - Allocation Factor (decimal portion of state within zipcode)
So, our Dedham, MA entry includes this data:
  • "25","02026","MA","DEDHAM",71.163741,42.243685,23782,0.003953
It’s Really Exciting!
When I talk about reference data at parties, I immediately see eyes glaze over and it’s clear that my fellow party-goers want to escape my enthusiasm for it.  But this availability of reference data is really great news! Together with the open source data integration tools like Talend Open Studio, we’re starting to see what I like to call “open source reference data” becoming available. It all makes the price of improving data quality much lower and our future much brighter.

There’s so much to talk about with regard to reference data and so many good sources.  I plan to make more posts on this topic, but feel free to post your beloved reference data sources here in the comments section.

Tuesday, February 16, 2010

The Secret Ingredient in Major IT Initiatives

One of my first jobs was that of assistant cook at a summer camp.  (In this case, the term ‘cook’ was loosely applied meaning to scrub pots and pans for the head cook.) It was there I learned that most cooks have ingredients that they tend to use more often.  The cook at Camp Marlin tended to use honey where applicable.  Food TV star Emeril likes to use garlic and pork fat.  Some cooks add a little hot pepper to their chocolate recipes – it is said to bring out the flavor of the chocolate.  Definitely a secret ingredient.
For head chefs taking on major IT initiatives the secret ingredient is always data quality technology. Attention to data quality doesn’t make the recipe of an IT initiative alone so much as it makes an IT initiative better.  Let’s take a look at how this happens.

No matter what the project, data profiling provides a complete understanding of the data before the project team attempts to migrate it. This can help the project team create a more accurate plan for integration.  On the other hand, it is ill-advised to migrate data to your new solution as-is, as it can lead to major costs over-runs and project delays as you have to load and reload it.

Customer Relationship Management (CRM)
By using data quality technology in CRM, the organization will benefit from a cleaner customer list with fewer duplicate records. Data quality technology can work as a real-time process, limiting the amount of typos and duplicates in the system, thus leading to improved call center efficiency.  Data profiling can also help an organization understand and monitor the quality of a purchased list for integration will avoid issues with third-party data.

Enterprise Resource Planning (ERP) and Supply Chain Management (SCM)

If data is accurate, you will have a more complete picture of the supply chain. Data quality technology can be used to more accurately report inventory levels, lowering inventory costs. When you make it part of your ERP project, you may also be able to improve bargaining power with suppliers by gaining improved intelligence about their corporate buying power. 

Data Warehouse and Business  Intelligence
Data quality helps disparate data sources to act as one when migrated to a data warehouse. Data quality makes data warehouse possible by standardizing disparate data. You will be able to generate more accurate reports when trying to understand sales patterns, revenue, customer demographics and more.

Master Data Management (MDM)
Data quality is a key component of master data management.     An integral part of making applications communicate and share data is to have standardized data.  MDM enhances the basic premise of data quality with additional features like persistent keys, a graphical user interface to mitigate matching, the ability to publish and subscribe to enterprise applications, and more.

So keep in mind, when you decide to improve data quality, it is often because of your need to make a major IT initiative even stronger.  In most projects, data quality is the secret ingredient to make your IT projects extraordinary.  Share the recipe.

Monday, February 1, 2010

A Data Governance Mission Statement

Every organization, including your data governance team has a purpose and a mission. It can be very effective to communicate your mission in a mission statement to show the company that you mean business.  When you show the value of your team, it can change your relationship with management for the better.

The mission statement should pay tribute to the mission of the organization with regard to values, while defining why the data governance organization exists and setting a big picture goal for the future.
The data governance mission statement could revolve around any of the following key components:

  • increasing revenue
  • lowering costs
  • reducing risks (compliance)
  • meeting any of the organization’s other policies such as being green or socially responsible

The most popular format seems to follow:
Our mission is to [purpose] by doing [high level initiatives] to achieve [business benefits]

So, let’s try one:
Our mission is to ensure that the highest quality data is delivered via company-wide data governance strategy for the purpose of improving the efficiency, increasing the profitability and lowering the risk of the business units we serve.
Flopped around:
Our mission is to improve the efficiency, increase the profitability and lower the business risks to Acme’s business units by ensuring that the highest quality data is delivered via company-wide data governance strategy.
Not bad, but a mission statement should be inspiring to the team and to management. Since the passions of the company described above are unknown, it’s difficult for a generic mission statement to be inspirational about the data governance program. That’s up to you.
Goals & Objectives
There are mission statements and there are objectives. While every mission statement should say who you are and why you exist, every objective should specify what you’re going to do and the results you expect.  Objectives include activities that can be easily tracked, measured, achieved and, of course, meet the objectives of the mission.  When you start data governance projects, you can look back to the mission statement to make sure we’re on track. Are you using our people and technology in a way that will benefit the company?

Staying On Mission
When you take on a new project, the mission statement can help protect us and ensure that the project is worthwhile for both the team and the company. The mission statement should be considered as a way to block busy-work and unimportant projects.  In our mission statement example above, if the project doesn’t improve efficiency, lower costs or lower business risk, it should not be considered.

In this case, your can clearly map three projects to the mission, but the fourth project is not as clear.  Dig deeper into the mainframe project to see if any efficiency will come out of the migration.  Is the data being used by anyone for a business purpose?

A Mission Never Ends
A mission statement is a written declaration of a data governance team's purpose and focus. This focus  normally remains steady, while objectives may change often to adapt to changes in the business environment. A properly crafted mission statement will serve as a filter to separate what is important from what is not and to communicate your value to the entire organization.


Thursday, January 21, 2010

ETL, Data Quality and MDM for Mid-sized Business

Is data quality a luxury that only large companies should be able to afford?  Of course the answer is no. Your company should be paying attention to data quality no matter if you are a Fortune 1000 or a startup. Like a toothache, poor data quality will never get better on its own.

As a company naturally grows, the effects of poor data quality multiply.  When a small company expands, it naturally develops new IT systems. Mergers often bring in new IT systems, too. The impact of poor data quality slowly invades and hinders the company’s ability to service customers, keep the supply chain efficient and understand its own business. Paying attention to data quality early and often is a winning strategy for even the small and medium-sized enterprise (SME).

However, SME’s have challenges with the investment needed in enterprise level software. While it’s true that the benefit often outweighs the costs, it is difficult for the typical SME to invest in the license, maintenance and services needed to implement a major data integration, data quality or MDM solution.

At the beginning of this year, I started with a new employer, Talend. I became interested in them because they were offering something completely different in our world – open source data integration, data quality and MDM.  If you go to the Talend Web site, you can download some amazing free software, like:
  • a fully functional, very cool data integration package (ETL) called Talend Open Studio
  • a data profiling tool, called Talend Open Profiler, providing charts and graphs and some very useful analytics on your data
The two packages sit on top of a database, typically MySQL – also an open source success.

For these solutions, Talend uses a business model similar to what my friend Jim Harris has just blogged about – Freemium. Under this new model, free open source content is made available to everyone—providing the opportunity to “up-sell” premium content to a percentage of the audience. Talend works like this.  You can enhance your experience from Talend Open Studio by purchasing Talend Integration Suite (in various flavors).  You can take your data quality initiative to the next level by upgrading Talend Open Profiler to Talend Data Quality.

If you want to take the combined data integration and data quality to an even higher level, Talend just announced a complete Master Data Management (MDM) solution, which you can use in a more enterprise-wide approach to data governance. There’s a very inexpensive place to start and an evolutionary path your company can take as it matures its data management strategy.

The solutions have been made possible by the combined efforts of the open source community and Talend, the corporation. If you’d like, you can take a peek at some source code, use the basic software and try your hand at coding an enhancement. Sharing that enhancement with community will only lead to a world full of better data, and that’s a very good thing.

Disclaimer: The opinions expressed here are my own and don't necessarily reflect the opinion of my employer. The material written here is copyright (c) 2010 by Steve Sarsfield. To request permission to reuse, please e-mail me.