Wednesday, July 21, 2010

Lemonade Stand Data Quality


My children expressed interest in opening up a lemonade stand this weekend. I’m not sure if it’s done worldwide, but here in America every kid between the age of five and twelve tries their hand at earning extra money during the summer months. Most parents in America indulge this because the whole point of a lemonade stand is really to learn about capitalism. You figure out your costs, how much the lemonade, ice and cups cost, then you charge a little more than what it costs you. At the end of the day, you can hope to show a little profit.

I couldn’t help but think there are lessons we can learn from the lemonade stand that apply to the way we manage our own data quality initiatives.  Data governance programs and data quality projects are still driven by capitalism and lemonade stand fundamentals.

  • Concept – While the lemonade stand requires your audience to have a clear understanding of the product and the price, so does data quality.  In the data world, profiling can help you create an accurate assessment of it and tell the world exactly what it is and how much it’s going to cost.
  • Marketing – My kids proved that more people will come to your lemonade stand if you shout out “Ice Cold Lemonade” and put a few flyers around the neighborhood. Likewise you need to tell management, business people and anyone who will listen about data quality – it’s ice cold and delicious.
  • Pricing – A lemonade stand works by setting the right price. Too little and the profit will be too low, too high and no one will buy. In the data quality world, setting the scope with the proper amount of spend and the right amount of return on investment will be successful.
  • Location – While a busy street and a hot day make a profitable lemonade stand, data quality project managers know that you begin by picking the projects with the least effort and highest potential ROI. In turn, you get to open more lemonade stands and build your data quality projects into a data governance program.

When it comes down to it, data quality projects are a form of capitalism; you need to sell the customers a refreshing glass and keep them coming back for more.

Friday, July 9, 2010

The Book of Life

I found a very interesting paper on record matching in my research for my own upcoming white paper. The paper was written by Chief of the National Office of Vital Statistics for the U. S. Public Health Service. The paper describes in almost poetic fashion of a person’s “book of life”.  It describes how a person leaves a data trail as they go through life.  It describes how hard it is to put together the pages of your book of life as you get born, get married, change homes, earn degrees and certifications.

Naturally, there are benefits to society for each person having their own book of life. In the case of the bureau chief, he cited the need to understand what factors influence health and longevity. The tricky part, he said was to “bind the book of life” despite its tendency to be misalign, non-standard and incoherent.

It sounds like the good Doctor is describing record matching and data cleansing, and to some degree a national ID. But the most interesting and amazing thing about this is that the paper was written in 1946. Even back then, there were smart people who knew what we had to do to bring benefit to society.

Thursday, May 13, 2010

Three Conversations to Have with an Executive - the Only Three

If you’re reading this, you’re most likely in the business of data management. In many companies, particularly large ones, the folks who manage data don’t much talk to the executives. But every so often, there is that luncheon, a chance meeting in the elevator, or even a break from a larger meeting where you and an executive are standing face to face.  (S)he asks, what you’re working on. Like a boy scout, be prepared.  Keep your response to one of these three things:

  1. Revenue – How has your team increased revenue for the corporation?
  2. Efficiency – How has your team lowered costs by improving efficiency for the corporation?
  3. Risk – How have you and your team lowered the risk to the corporation with better compliance to corporate regulations?

The executive doesn’t want to hear about schemas, transformations or even data quality. Some examples of appropriate responses might include:

  • We work on making the CRM/ERP system more efficient by keeping an eye on the information within it. My people ensure that the reports are accurate and complete so you have the tools to make the right decisions.
  • We’re doing things like making sure we’re in compliance with [HIPAA/Solvency II/Basel II/Antispam] so no one runs afoul of the law.
  • We’re speeding up the time it takes to get valuable information to the [marketing/sales/business development] team so they can react quickly to sales opportunities
  • We’re fixing [business problem] to [company benefit].

When you talk to your CEO, it’s your opportunity get him/her in the mindset that your team is beneficial, so when it comes to funding, it will be something they remember. It’s your chance to get out of the weeds and elevate the conversation.  Let the sales guys talk about deals. Let the marketing people talk about the market forces or campaigns. As data champions, we also need to be prepared to talk about the value we bring to the game.

Thursday, May 6, 2010

Open Source Data Profiler Demo

Here I am giving a demo on Talend Open Profiler. Demo and commentary included.


You can download Talend Open Profiler here.

Tuesday, May 4, 2010

Are we ready for a National ID card in America?

I read with interest the story about the National ID card.  Although I don’t like to link myself to one political party or another, I applaud the effort of trying to get a system in place for national ID. I like efficient data.

However, I have my doubts that a group of Senators can really understand the enormous challenges of such a project.  The issue is a politically charged one for certain, so that will be the focus. The details, which we all know contain the devil, will likely be forgotten.

I recall just a short time ago the US government’s Cash for Clunkers program. The program involved buying a new car and turning in your old “clunker” for a new fuel-efficient one.  The idea was to support the auto industry and get the gas guzzlers off the road. The devil was in the details, however. Rather than a secure web site with sufficient backbone to properly serve car dealerships, the program required the dealers complete pages and pages of paperwork… real paper paperwork…  and fax it into the newly formed government agency for approval. Then they hired workers on other end to enter the data. It was a business process that would have been appropriate for 1975, not 2010.

ACLU legislative counsel Christopher Calabrese said of this National ID program that “all of this will come with a new federal bureaucracy — one that combines the worst elements of the DMV and the TSA”. Based on recent history it’s an accurate description of what will likely happen.

If the government wants to do this thing, they need to bring in a dream team of database experts. Guys like Dr. Ralph Kimball or Bill Inmon, both of whom are world renown for data modeling, should contribute if they are willing.  They should ask in Dr. Rich Wang from MIT’s IQ program to be in charge of information quality issues.  They should invite guys like Jim Harris to communicate the complex issues to the public.  Also, they need to bring in folks with practical experience, like a Jill Dyche or Gwen Thomas.  There are probably some others that I haven’t mentioned. Security experts, hardware scalability experts and business process experts need to be part of the mix to protect the citizenry of the United States. They would need to make a plan without bias toward any district or political action committee.  That’s why a national database won’t happen.

Don’t get me wrong, if we do so, we could come up with much more efficient systems for checking backgrounds, I-9 job verification, international travel, and more. Identity theft is a big problem here and everywhere, but with a central citizen repository, the US could legislate a notification system when new bank accounts are opened in your name.  The census would always show a more accurate number and wouldn't cost billions and billions of dollars to us every ten years. Let's face it, the business process of the census, mailing paper forms and personal door to door interviews, is outdated.

Let’s start this by making it voluntary. If you want to be in the database and avoid long lines at the airport, fine.  If you want to be anonymous and wait, that’s fine, too.  We’ll get the kinks worked out with the early adopters and roll it out to the laggards later.

What we’re really talking about here is a personal primary key.  That data already exists in multiple linkable systems with your name and addresses (past and present) linking it.  We as data professionals spend a lot of time and effort working with data to try to find these links. So why not have a primary key to link your personal data instead? Are you really giving up anything that  DBAs haven't already figured out?

For those of you against a national database, I don’t think you have anything to fear.  Call me a skeptic, but given the political divide between groups, it’s unlikely that any national database of citizens will be done within this decade. But if you’re listening Senators and you decide to move forward, make sure you have the right people, processes and technology in place to do it right.

Friday, April 9, 2010

Links from my eLearning Webinar

I recently delivered a webinar on the Secrets of Affordable Data Governance. In the webinar, I promised to deliver links for lowering the costs of data management.  Here are those links:

  • Talend Open Source - Download free data profiling, data integration and MDM software.
  • US Census - Download census data for cleansing of city name and state with latitude and longitude appends.
  • Data.gov - The data available from the US government.
  • Geonames - Postal codes and other location reference data for almost every country in the world.
  • GRC Data - A source of low-cost customer reference data, including names, addresses, salutations, and more.
  • Regular Expressions - Check the shape of data in profiling software or within your database application.
If you search on the term "download reference data", you will find many other sources.

Friday, April 2, 2010

Donating the Data Quality Asset

If you believe like I do that proper data management can change the world, then you have to start wondering if it’s time for all us data quality professionals to stand up and start changing it.

It’s clear that everyone organization, no matter what the size or influence, can benefit from properly managing their data. Even charitable organizations can benefit with a cleaner customer list to get the word out when they need donations.  Non-profits who handle charitable goods can benefit from better data in their inventory management.  If food banks had a better way of managing data and soliciting volunteers, wouldn’t more people be fed? If churches kept better records of their members, would their positive influence be more widespread?  If organizations who accept goods in donation kept a better inventory system, wouldn’t more people benefit? The data asset is not limited to Fortune 1000 companies, but until recently, solutions to manage data properly were only available to the elite.

Open source is coming on strong and is a factor that eases us to donate the data quality.  In the past, it many have been a challenge to get mega-vendors to donate high-end solutions, but we can make significant progress on the data quality problem with little or no solutions cost these days. Solutions like Talend Open Profiler, Talend Open Studio, Pentaho and DataCleaner offer data integration and data profiling.

In my last post, I discussed the reference data that is now available for download.  Reference data used to be proprietary and costly. It’s a new world – a better one for low-cost data management solutions.

Can we save the world through data quality?  If we can help good people spread more goodness, then we can. Let’s give it a try.

Disclaimer: The opinions expressed here are my own and don't necessarily reflect the opinion of my employer. The material written here is copyright (c) 2010 by Steve Sarsfield. To request permission to reuse, please e-mail me.