Friday, July 30, 2010

Deterministic and Probabilistic Matching White Paper

I’ve been busy this summer working on a white paper on record matching, the result of which is available on the Talend web site here.

The white paper is sort of a primer containing elementary principles of record matching,  As the description says, it outlines the basic theories and strategies of record matching. It describes the nuances of deterministic and probabilistic matching and the algorithms used to identify relationships within records. It covers the processes to employ in conjunction with matching technology to transform raw data into powerful information that drives success in enterprise applications like CRM, data warehouse and ERP.

Wednesday, July 28, 2010

DGDQI Viewer Mail

From time to time, people read my blog or book and contact me to chat about data governance and data quality. I welcome it. It’s great to talk to people in the industry and hear their concerns.

Occasionally, I see things in my in-box that bother me, though.  Here is one item that I’ll address in a post. The names have been changed to protect the innocent.

A public relations firm asked:

Hi Steve,
I wonder if you could answer these questions for me.
- What are the key business drivers for the advent of data governance software solutions?
- What industries can best take advantage of data governance software solutions?
- Do you see cloud computing-based data governance solutions developing?

I couldn’t answer these questions, because they all pre-supposed that data governance is a software solution.  It made me wonder if I have made myself clear enough on the fact that data governance is mostly about changing the hearts and minds of your colleagues to re-think their opinion of data and its importance.  Data governance is a company’s mindful decision that information is important and they’re going to start leveraging it. Yes, technology can help, but a complete data governance software solution would have more features than a Workchamp XL Swiss Army Knife. It would have to include data profiling, data quality, data integration, business process management, master data management, wikis, a messaging platform, a toothpick and a nail file in order to be complete. 

Can you put all this on the cloud?  Yes.  Can you put the hearts and minds of your company on a cloud?  If only it were that easy...

Wednesday, July 21, 2010

Lemonade Stand Data Quality

My children expressed interest in opening up a lemonade stand this weekend. I’m not sure if it’s done worldwide, but here in America every kid between the age of five and twelve tries their hand at earning extra money during the summer months. Most parents in America indulge this because the whole point of a lemonade stand is really to learn about capitalism. You figure out your costs, how much the lemonade, ice and cups cost, then you charge a little more than what it costs you. At the end of the day, you can hope to show a little profit.

I couldn’t help but think there are lessons we can learn from the lemonade stand that apply to the way we manage our own data quality initiatives.  Data governance programs and data quality projects are still driven by capitalism and lemonade stand fundamentals.

  • Concept – While the lemonade stand requires your audience to have a clear understanding of the product and the price, so does data quality.  In the data world, profiling can help you create an accurate assessment of it and tell the world exactly what it is and how much it’s going to cost.
  • Marketing – My kids proved that more people will come to your lemonade stand if you shout out “Ice Cold Lemonade” and put a few flyers around the neighborhood. Likewise you need to tell management, business people and anyone who will listen about data quality – it’s ice cold and delicious.
  • Pricing – A lemonade stand works by setting the right price. Too little and the profit will be too low, too high and no one will buy. In the data quality world, setting the scope with the proper amount of spend and the right amount of return on investment will be successful.
  • Location – While a busy street and a hot day make a profitable lemonade stand, data quality project managers know that you begin by picking the projects with the least effort and highest potential ROI. In turn, you get to open more lemonade stands and build your data quality projects into a data governance program.

When it comes down to it, data quality projects are a form of capitalism; you need to sell the customers a refreshing glass and keep them coming back for more.

Friday, July 9, 2010

The Book of Life

I found a very interesting paper on record matching in my research for my own upcoming white paper. The paper was written by Chief of the National Office of Vital Statistics for the U. S. Public Health Service. The paper describes in almost poetic fashion of a person’s “book of life”.  It describes how a person leaves a data trail as they go through life.  It describes how hard it is to put together the pages of your book of life as you get born, get married, change homes, earn degrees and certifications.

Naturally, there are benefits to society for each person having their own book of life. In the case of the bureau chief, he cited the need to understand what factors influence health and longevity. The tricky part, he said was to “bind the book of life” despite its tendency to be misalign, non-standard and incoherent.

It sounds like the good Doctor is describing record matching and data cleansing, and to some degree a national ID. But the most interesting and amazing thing about this is that the paper was written in 1946. Even back then, there were smart people who knew what we had to do to bring benefit to society.

Disclaimer: The opinions expressed here are my own and don't necessarily reflect the opinion of my employer. The material written here is copyright (c) 2010 by Steve Sarsfield. To request permission to reuse, please e-mail me.