Monday, May 9, 2011

MIT Information Quality Symposium

This year I’m planning to attend the MIT IQ symposium again.  I’m also one of the vice chairs of the event. The symposium is a July event in Boston that is a discussion and exchange of ideas about data quality between practitioners and academicians.

I return to this conference and participate in the planning every year because I think it’s one of the most important data quality events.  The people here really do change the course of information management.  On these hot summer days in Boston, government, healthcare and general business professionals collaborate on the latest updates about data quality.  This event has the potential to dramatically change the world – the people, organizations, and governments who manage data. I’ve grown to really enjoy the combination of ground-breaking presentations, high ranking government officials, sharp consultants and MIT hallway chat that you find here.

If you have some travel budget, please consider joining me for this event.

Friday, April 29, 2011

Open Source and Data Quality

My latest video on the Talend Channel about data quality and open source.


This was filmed in the Paris office in January. I can get excited in any time zone when it comes to data quality.

Monday, April 25, 2011

Data Quality Scorecard: Making Data Quality Relevant

Most data governance practitioners agree that a data quality scorecard is an important tool in any data governance program. It provides comprehensive information about quality of data in a database, and perhaps even more importantly, allows business users and technical users to collaborate on the quality issue.

However, there are multiple levels of metrics that you should consider. There are:

METRIC CLASSIFICATION
EXAMPLES
1
Metrics that the technologists use to fix data quality problems

7% of the e-mail attribute is blank. 12% of the e-mail attribute does not follow the standard e-mail syntax. 13% of our US mail addresses fail address validation.
2
Metrics business people use to make decisions about the data
9% of my contacts have invalid e-mails.  3% have both invalid e-mails and invalid addresses.
3
Metrics managers use to get a big picture
This customer data is good enough to use for a campaign.

All levels are important for the various members of the data governance team.  Level one shows the steps you need to take to fix the data.  Level two shows context to the task at hand. Level three tells the uniformed about the business issue without having to dig into the details.

So, when you’re building your DQ metrics, remember to roll-up the data into metrics into slightly higher formulations. You must design the scorecards to meet the needs of the interest of the different audiences, from technical through to business and up to executive. At the beginning of a data quality scorecard is information about data quality of individual data attributes. This is the default information that most profilers will deliver out of the box. As you aggregate scores, the high-level measures of the data quality become more meaningful. In the middle are various score sets allowing your company to analyze and summarize data quality from different perspectives. If you define the objective of a data quality assessment project as calculating these different aggregations, you will have much easier time maturing your data governance program. The business users and c-level will begin to pay attention.

Tuesday, March 15, 2011

Open Source Data Management or Do-it-Yourself

With the tough economy people are still cutting back on corporate spending.  There is a sense of urgency to just get things done, and sometimes that can lead to hand-coding your own data integration, data quality or MDM functions. When you begin to develop your plan and strategies for data management, you have to think about all the hidden costs of getting solutions out-of-the-box versus building on your own.

Reusability is one key consideration. Using data management technologies that only plug into one system just doesn’t make sense.  It’s difficult to get that re-usability with custom code, unless your programmers have high visibility into other projects. On the other hand, all tool vendors, even open source ones have pressure from their clients to support multiple databases and business solutions.  Open source solutions are built to work in a wider variety of architectures. You can move your data management processes between JD Edwards and SAP and SalesForce, for example, with relative ease.

Indemnity is another consideration. What if something goes wrong with your home-grown solution after the chief architect leaves his job? Who are you going to call? If something goes wrong with your open source solution, you can turn to the community or call the vendor for support.

Long-term costs are yet another issue.  Home-grown solutions have the tendency to start cheap and get more expensive as time goes on.  It’s difficult to manage custom code, especially if it is poorly documented. You hire consultants to manage code.  Eventually, you have to rip and replace and that can be costly.

You should consider your human resources, too. Does it make sense to have a team work on hand-coding database extractions and transformation, or would the total cost/benefit be better if you used an open source data integration tool? It might just free up some of your programmers to pursue more important, ROI-centric ventures.

If you’re thinking of cooking up your own technical solutions for data management, hoping to just get it done, think again. Your most economical solution might just be to leverage the community of experts and go with open source.

Thursday, March 10, 2011

My Interview in the Talend Newsletter

Q. Some people would say that data quality technology is mature and that the topic is sort of stale. Are there major changes happening in the data quality world today?
A. Probably the biggest over-arching change we see today is that the distinction between those managing data from the business standpoint and those managing the technical aspects of data quality is getting more and more blurry. It used to be that data quality was... read more

Friday, December 10, 2010

Six Data Management Predictions for 2011

This time of year everyone makes prognostications about the state of the data management field for 2011. I thought I’d take my turn by offering my predictions for the coming year.

Data will become more open
In the old days good quality reference data was an asset kept in the corporate lockbox. If you had a good reference table for common misspellings of parts, cities, or names for example, the mind set was to keep it close and away from falling into the wrong hands.  The data might have been sold for profit or simply not available.  Today, there really is no “wrong hands”.  Governments and corporations alike are seeing the societal benefits of sharing information. More reference data is there for the taking on the internet from sites like data.gov and geonames.org.  That trend will continue in 2011.  Perhaps we’ll even see some of the bigger players make announcements as to the availability of their data. Are you listening Google?

Business and IT will become blurry
It’s becoming harder and harder to tell an IT guy from the head of marketing. That’s because in order to succeed, the IT folks need to become more like the marketer and vice versa.  In the coming year, the difference will be less noticeable and business people get more and more involved in using data to their benefit.  Newsflash One: If you’re in IT, you need marketing skills to pitch your projects and get funding.  Newsflash Two: If you’re in business, you need to know enough about data management practices to succeed.

Tools will become easier to use
As the business users come into the picture, they will need access to the tools to manage data.  Vendors must respond to this new marketplace or die.

Tools will do less heavy lifting
Despite the improvements in the tools, corporations will turn to improving processes and reporting in order to achieve better data management. Dwindling are the days where we’re dealing with data that is so poorly managed that it requires overly complicated data quality tools.  We’re getting better at the data management process and therefore, the burden on the tools becomes less. Future tools with focus on supporting the process improvement with work flow features, reporting and better graphical user interfaces.

CEOs and Government Officials will gain enlightenment
Feeding off the success of a few pioneers in data governance as well as failures of IT projects in our past, CEOs and governments will gain enlightenment about managing their data and put teams in place to handle it.  It has taken decades of our sweet-talk and cajoling for government and CEOs to achieve enlightenment, but I believe it is practically here.

We will become more reliant on data
Ten years ago, it was difficult to imagine us where we are today with respect to our data addiction. Today, data is a pervasive part of our internet-connected society, living in our PCs, our TVs, our mobile phones many other devices. It’s a huge part of our daily lives. As I’ve said in past posts, the world is addicted to data and that bodes well for anyone who helps the world manage it. In 2011, no matter if the economy turns up or down, our industry will continue to feed the addiction to good, clean data.

Tuesday, November 30, 2010

Match Mitigation: When Algorithms Aren’t Enough

I’d like to get a little technical on this post. I try to keep my posts business-friendly, but sometimes there's importance in detail. If none of this post makes any sense to you, I wrote a sort of primer on how matching works in many data quality tools, which you can get here.

Matching Algorithms
When you use a data quality tool, you’re often using matching algorithms and rules to make decisions on whether records match or not.  You might be using deterministic algorithms like Jaro, SoundEx and Metaphones. You might also be using probabilistic matching algorithms.

In many tools, you can set the rules to be tight where the software uses tougher criteria to determine a match, or loose where the software is not so particular. Tight and loose matches are important because you may have strict rules for putting records together, like customers of a bank, or not so strict rules, like when you’re putting together a customer list for marketing purposes.

What to do with Matches
Once data has been processed through the matcher, there are several possible outcomes. Between any two given records, the matcher may find:

  • No relationship
  • Match – the matcher found a definite match based on the criteria given
  • Suspect – the matcher thinks it found a match but is not confident. The results should be manually reviewed.
It’s that last category that the tough one.  Mitigating the suspect matches is the most time-consuming follow-up task after the matching is complete. Envision a million record database where you have 20,000 suspect matches.   That’s still going to take you some time to review.

Some of the newer (and cooler) tools offer strategies for dealing with suspect matches. The tools will present the suspect matches in a graphical user interface and allow users to pick which relationships are accurate and which are not. For example, Talend now offers a data stewardship console that lets you pick and choose records and attributes that will make up a best of breed record.

The goal, of course, is to not have suspect matches, so tuning the matches and limiting the suspect matches is the ultimate. The newest tools will make this easy. Some of the legacy tools make this hard.

Match mitigation is perhaps one of the most often overlooked processes of data quality. Don’t overlook it in your planning and processes.

Disclaimer: The opinions expressed here are my own and don't necessarily reflect the opinion of my employer. The material written here is copyright (c) 2010 by Steve Sarsfield. To request permission to reuse, please e-mail me.