Wednesday, August 24, 2011

Top Ten Root Causes of Data Quality Problems: Part One

Part 1 of 5: The Basics
We all know data quality problems when we see them.  They can undermine your organization’s ability to work efficiently, comply with government regulations and make revenue. The specific technical problems include missing data, misfielded attributes, duplicate records and broken data models to name just a few.
But rather than merely patching up bad data, most experts agree that the best strategy for fighting data quality issues is to understand the root causes and put new processes in place to prevent them.  This five part blog series discusses the top ten root causes of data quality problems and suggests steps the business can implement to prevent them.
In this first blog post, we'll confront some of the more obvious root causes of data quality problems.

Root Cause Number One: Typographical Errors and Non-Conforming Data
Despite a lot of automation in our data architecture these days, data is still typed into Web forms and other user interfaces by people. A common source of data inaccuracy is that the person manually entering the data just makes a mistake. People mistype. They choose the wrong entry from a list. They enter the right data value into the wrong box.

Given complete freedom on a data field, those who enter data have to go from memory.  Is the vendor named Grainger, WW Granger, or W. W. Grainger? Ideally, there should be a corporate-wide set of reference data so that forms help users find the right vendor, customer name, city, part number, and so on.

Root Cause Attack Plan
  • Training – Make sure that those people who enter data know the impact they have on downstream applications.
  • Metadata Definitions – By locking down exactly what people can enter into a field using a definitive list, many problems can be alleviated. This metadata (for vendor names, part numbers, and so on can) become part of data quality in data integration, business applications and other solutions.
  • Monitoring – Make public the results of poorly entered data and praise those who enter data correctly. You can keep track of this with data monitoring software such as the Talend Data Quality Portal.
  • Real-time Validation – In addition to forms, validation data quality tools can be implemented to validate addresses, e-mail addresses and other important information as it is entered. Ensure that your data quality solution provides the ability to deploy data quality in application server environments, in the cloud or in an enterprise service bus (ESB).

Root Cause Number Two: Information Obfuscation
Data entry errors might not be completely by mistake. How often do people give incomplete or incorrect information to safeguard their privacy?  If there is nothing at stake for those who enter data, there will be a tendency to fudge.

Even if the people entering data want to do the right thing, sometimes they cannot. If a field is not available, an alternate field is often used. This can lead to such data quality issues as having Tax ID numbers in the name field or contact information in the comments field.

Root Cause Attack Plan
  • Reward – Offer an incentive for those who enter personal data correctly. This should be focused on those who enter data from the outside, like those using Web forms. Employees should not need a reward to do their job. The type of reward will depend upon how important it is to have the correct information.
  • Accessibility – As a technologist in charge of data stewardship, be open and accessible about criticism from users. Give them a voice when processes change requiring technology change.  If you’re not accessible, users will look for quiet ways around your forms validation.
  • Real-time Validation – In addition to forms, validation data quality tools can be implemented to validate addresses, e-mail addresses and other important information as it is entered.
This post is an excerpt from a white paper available here. More to come on this subject in the days ahead.

4 comments:

Rich said...

The foundations of data quality are weak. They were derived from the manufacturing quality arena without sufficient consideration as to the distinctive nature of data. Data has no physical properties, is temporal and data quality measures are subjective. Next, estimates of the impacts of data quality were imagined and exaggerated. From statements such as data quality results in $600B a year in additional costs to claims of “death by data”. And finally, the deployment of data quality did not consider the long term implications and impacts or benefits. It was assumed data quality was the right thing to do.

Current data quality practices are remedial in nature. Once deployed, a data quality program results in diminishing returns. Once the low hanging fruit problems are resolved, there is little more to do. The root causes of egregious data quality problems can be found in the policies of the organization. These policies are typically sacrosanct and changing them is unlikely unless a crisis occurs and even then, the changes maybe so institutionalized they will remain. There is one concept from manufacturing quality that has been ignored in data quality and that is that quality has to be designed in. For data quality to be effective, the design of applications, databases and business processes all have to be subject to quality control. These are the machinery that produces the data and if they are flawed, the result is bad data.

Current data quality practices are like polishing a rusty car. It shines but continues to deteriorate with time. There are few preventative practices in data quality and the impacts to the business are limited.

Those who recognized this suggest developing business cases, ROI and now data governance programs which take data quality to another level of abstraction and obfuscation.

Data quality remains a practice of fixing data. Simplification of the problem is not a solution. Quality is a systemic problem. If an organization is interested in addressing quality it has to be addressed at the systemic level.

Data quality is but a minor factor and is a distraction from the genuine “root” causes.

Steve Sarsfield said...

In other words, preventing data from getting 'bad' in the first place is preferred. I agree. We'll cover some of the areas that tools can't really help much in part two and beyond.
This series is mostly about changing processes and the hearts and minds of people, not tools.

Ramon Andrews said...

A very informative read on the causes of data quality. Thank you for sharing these helpful insights.

Clarissa Lucas said...

Thank you for posting this article on the causes of data quality problems. The information are really helpful. Cheers!

Disclaimer: The opinions expressed here are my own and don't necessarily reflect the opinion of my employer. The material written here is copyright (c) 2010 by Steve Sarsfield. To request permission to reuse, please e-mail me.