Thursday, August 25, 2011

Top Ten Root Causes of Data Quality Problems: Part Two

Part 2 of 5: Renegades and Pirates
In this continuing series, we're looking at root causes of data quality problems and the business processes you can put in place to solve them.  In part two, we examine IT renegades and corporate pirates as two of the root causes for data quality problems.

Root Cause Number Three: Renegade IT and Spreadmarts
A renegade is a person who deserts and betrays an organizational set of principles. That’s exactly what some impatient business owners unknowingly do by moving data in and out of business solutions, databases and the like. Rather than wait for some professional help from IT, eager business units may decide to create their own set of local applications without the knowledge of IT. While the application may meet the immediate departmental need, it is unlikely to adhere to standards of data, data model or interfaces. The database might start by making a copy of a sanctioned database to a local application on team desktops. So-called “spreadmarts,” which are important pieces of data stored in Excel spreadsheets, are easily replicated to team desktops. In this scenario, you lose control of versions as well as standards. There are no backups, versioning or business rules.

Root Cause Attack Plan
  • Corporate Culture – There should be a consequence for renegade data, making it more difficult for the renegades to create local data applications.
  • Communication – Educate and train your employees on the negative impact of renegade data.
  • Sandbox – Having tools that can help business users and IT professionals experiment with the data in a safe environment is crucial. A sandbox, where users are experimenting on data subsets and copies of production data, has proven successful for many for limiting renegade IT.
  • Locking Down the Data – A culture where creating unsanctioned spreadmarts is shunned is the goal.  Some organizations have found success in locking down the data to make it more difficult to export.

Root Cause Number Four: Corporate Mergers

Corporate mergers increase the likelihood for data quality errors because they usually happen fast and are unforeseen by IT departments. Almost immediately, there is pressure to consolidate and take shortcuts on proper planning. The consolidation will likely include the need to share data among a varied set of disjointed applications. Many shortcuts are taken to “make it happen,” often involving known or unknown risks to the data quality.
On top of the quick schedule, merging IT departments may encounter culture clash and a different definition of truth.  Additionally, mergers can result in a loss of expertise when key people leave midway through the project to seek new ventures.

Root Cause Attack Plan
  • Corporate Awareness – Whenever possible civil division of labor should be mandated by management to avoid culture clashes and data grabs by the power hungry.
  • Document – Your IT initiative should survive even if the entire team leaves, disbands or gets hit by a bus when crossing the street.  You can do this with proper documentation of the infrastructure.
  • Third-party Consultants – Management should be aware that there is extra work to do and that conflicts can arise after a merger. Consultants can provide the continuity needed to get through the transition.
  • Agile Data Management – Choose solutions and strategies that will keep your organization agile, giving you the ability to divide and conquer the workload without expensive licensing of commercial applications.
This post is an excerpt from a white paper available here. More to come on this subject in the days ahead.

Wednesday, August 24, 2011

Top Ten Root Causes of Data Quality Problems: Part One

Part 1 of 5: The Basics
We all know data quality problems when we see them.  They can undermine your organization’s ability to work efficiently, comply with government regulations and make revenue. The specific technical problems include missing data, misfielded attributes, duplicate records and broken data models to name just a few.
But rather than merely patching up bad data, most experts agree that the best strategy for fighting data quality issues is to understand the root causes and put new processes in place to prevent them.  This five part blog series discusses the top ten root causes of data quality problems and suggests steps the business can implement to prevent them.
In this first blog post, we'll confront some of the more obvious root causes of data quality problems.

Root Cause Number One: Typographical Errors and Non-Conforming Data
Despite a lot of automation in our data architecture these days, data is still typed into Web forms and other user interfaces by people. A common source of data inaccuracy is that the person manually entering the data just makes a mistake. People mistype. They choose the wrong entry from a list. They enter the right data value into the wrong box.

Given complete freedom on a data field, those who enter data have to go from memory.  Is the vendor named Grainger, WW Granger, or W. W. Grainger? Ideally, there should be a corporate-wide set of reference data so that forms help users find the right vendor, customer name, city, part number, and so on.

Root Cause Attack Plan
  • Training – Make sure that those people who enter data know the impact they have on downstream applications.
  • Metadata Definitions – By locking down exactly what people can enter into a field using a definitive list, many problems can be alleviated. This metadata (for vendor names, part numbers, and so on can) become part of data quality in data integration, business applications and other solutions.
  • Monitoring – Make public the results of poorly entered data and praise those who enter data correctly. You can keep track of this with data monitoring software such as the Talend Data Quality Portal.
  • Real-time Validation – In addition to forms, validation data quality tools can be implemented to validate addresses, e-mail addresses and other important information as it is entered. Ensure that your data quality solution provides the ability to deploy data quality in application server environments, in the cloud or in an enterprise service bus (ESB).

Root Cause Number Two: Information Obfuscation
Data entry errors might not be completely by mistake. How often do people give incomplete or incorrect information to safeguard their privacy?  If there is nothing at stake for those who enter data, there will be a tendency to fudge.

Even if the people entering data want to do the right thing, sometimes they cannot. If a field is not available, an alternate field is often used. This can lead to such data quality issues as having Tax ID numbers in the name field or contact information in the comments field.

Root Cause Attack Plan
  • Reward – Offer an incentive for those who enter personal data correctly. This should be focused on those who enter data from the outside, like those using Web forms. Employees should not need a reward to do their job. The type of reward will depend upon how important it is to have the correct information.
  • Accessibility – As a technologist in charge of data stewardship, be open and accessible about criticism from users. Give them a voice when processes change requiring technology change.  If you’re not accessible, users will look for quiet ways around your forms validation.
  • Real-time Validation – In addition to forms, validation data quality tools can be implemented to validate addresses, e-mail addresses and other important information as it is entered.
This post is an excerpt from a white paper available here. More to come on this subject in the days ahead.

Monday, June 13, 2011

The Differences Between Small and Big Data

There is a lot of buzz today about big data and companies stepping up to meet the challenge of ever increasing data volumes. In the center of it all, are Hadoop and the Cloud.  Hadoop can intelligently manage the distribution of processing and your files. It manages the infrastructure needed to break down big data into more manageable chunks for processing by multiple servers. Likewise, a cloud strategy can take data management outside the walls of a corporation into a high scalable infrastructure.

Do you have big data?  It’s difficult to know precisely whether you do because big data is vaguely defined. You may qualify for big data technology if you face hundreds of gigabytes of data, or it may hundreds or thousands of terabytes. The classification of “big data” is not strictly defined by data size, but other business processes, too. Your data management infrastructure needs to take into account factors like future data volumes, peaks and lulls in requirements, business requirements and much more.

Small and Medium-Sized Data

What about “small” and medium-sized data? For example, data from spreadsheet, the occasional flat file, leads from a trade show, and catalog data from vendors may be vital to your business processes. With a new industry focus on transparency, business user involvement and sharing of data, small data is a constant issue.  Spreadsheets and flat files are the preferred method to share data today because most companies have some process for handling them. When you get these small to medium sized data sets,  it is still necessary to:
  • profile them
  • integrate them into your relational database
  • aggregate data from these sources, or extract only the vital parts
  • apply data quality standards when necessary
  • use them as part of a master data management (MDM) initiative

The Difference Goals of Big Data and Little Data
With big data, the concern is usually about your data management technology’s ability to handle massive quantities in order to provide you aggregates that are meaningful.  You need solutions that will scale to meet your data management needs.  However, handling small and medium data sets is more about short and long term costs.  How can you quickly and easily integrate data without a lot of red tape, big license fees, pain and suffering.

Think about it. When you need to handle small and medium data, you have options:
  • Hand-coding: Using hand-coding is sometimes faster than any solution and it still may be OK for ad-hoc, one off data integration.  Once you find yourself hand-coding again and again, you’ll find yourself rethinking that strategy. Eventually managing all that code will waste time and cost you a bundle. If your data volumes grow, hand-coded quickly becomes obsolete due to lack of scaling. Hand-coding gets high marks on speed to value, but falters in sustainability and long-term costs.
  • Open Source: Open source data management tools provide a quick way to get started, low overall costs and high sustainability.  By just downloading and learning the tools, you’re on your way to getting data management done.  The open source solutions may have some limitations on scalability, but most open source providers have low-cost commercial upgrades that meet these needs.  In other words, it's easy to start today and leverage Hadoop and the Cloud if you need it later. Open source gets high marks on speed to value, sustainability and costs.
  • Traditional Data Management Vendors: Small data is a tough issue for the mega-vendors. Even for 50K-100K records, the license cost in both the short term and long term could be prohibitive.  The mega-vendor solutions do tend to scale well, making them sustainable at a cost. However mergers in the data management business do happen. The sustainability of a product can be affected by these mergers.  Commercial vendors get respectable marks in speed to value and sustainability, but falter in high up-front costs and maintenance fees.
I've heard it a million times in this business - start small and fast with technology that gives you a fast success but also scales to future tasks.

    Monday, May 16, 2011

    The Butterfly Effect and Data Quality

    I just wrote a paper called the ‘Butterfly Effect’ of poor data quality for Talend.

    The term butterfly effect refers to the way a minor event – like the movement of a butterfly’s wing – can have a major impact on a complex system – like the weather. The movement of the butterfly wing represents a small change in the initial condition of the system, but it starts a chain of events: moving pollen through the air, which causes a gazelle to sneeze, which triggers a stampede of gazelles, which raises a cloud of dust, which partially blocks the sun, which alters the atmospheric temperature, which ultimately alters the path of a tornado on the other side of the world.

    Enterprise data is equally susceptible to the butterfly effect.  When poor quality data enters the complex system of enterprise data, even a small error – the transposed letters in a street address or part number – can lead to 1) revenue loss; 2) process inefficiency and; 3) failure to comply with industry and government regulations. Organizations depend on the movement and sharing of data throughout the organization, so the impact of data quality errors are costly and far reaching. Data issues often begin with a tiny mistake in one part of the organization, but the butterfly effect can produce far reaching results.

    The Pervasiveness of Data
    When data enters the corporate ecosystem, it rarely stays in one place.  Data is pervasive. As it moves throughout a corporation, data impacts systems and business processes. The negative impact of poor data quality reverberates as it crosses departments, business units and cross-functional systems.
    • Customer Relationship Management (CRM) - By standardizing customer data, you will be able to offer better, more personalized customer service.  And you will be better able to contact your customers and prospects for cross-sell, up-sell, notification and services.
    • ERP / Supply Chain Data- If you have clean data in your supply chain, you can achieve some tangible benefits.  First, the company will have a clear picture about delivery times on orders because of a completely transparent supply chain. Next, you will avoid unnecessary warehouse costs by holding the right amount of inventory in stock.  Finally, you will be able to see all the buying patterns and use that information when negotiating supply contracts.
    • Orders / Billing System - If you have clean data in your billing systems, you can achieve the tangible benefits of more accurate financial reporting and correct invoices that reach the customer in a timely manner.  An accurate bill not only leads to trust among workers in the billing department, but customer attrition rates will be lower if invoices are delivered accurately and on time.
    • Data Warehouse - If you have standardized the data feeding into your data warehouse, you can dramatically improve business intelligence. Employees can access the data warehouse and be assured that the data they use for reports, analysis and decision making is accurate. Using the clean data in a warehouse can help you find trends, see relationships between data, and understand the competition in a new light.
    To read more about the butterfly effect of data quality, download it from the Talend site.

    Monday, May 9, 2011

    MIT Information Quality Symposium

    This year I’m planning to attend the MIT IQ symposium again.  I’m also one of the vice chairs of the event. The symposium is a July event in Boston that is a discussion and exchange of ideas about data quality between practitioners and academicians.

    I return to this conference and participate in the planning every year because I think it’s one of the most important data quality events.  The people here really do change the course of information management.  On these hot summer days in Boston, government, healthcare and general business professionals collaborate on the latest updates about data quality.  This event has the potential to dramatically change the world – the people, organizations, and governments who manage data. I’ve grown to really enjoy the combination of ground-breaking presentations, high ranking government officials, sharp consultants and MIT hallway chat that you find here.

    If you have some travel budget, please consider joining me for this event.

    Friday, April 29, 2011

    Open Source and Data Quality

    My latest video on the Talend Channel about data quality and open source.


    This was filmed in the Paris office in January. I can get excited in any time zone when it comes to data quality.

    Monday, April 25, 2011

    Data Quality Scorecard: Making Data Quality Relevant

    Most data governance practitioners agree that a data quality scorecard is an important tool in any data governance program. It provides comprehensive information about quality of data in a database, and perhaps even more importantly, allows business users and technical users to collaborate on the quality issue.

    However, there are multiple levels of metrics that you should consider. There are:

    METRIC CLASSIFICATION
    EXAMPLES
    1
    Metrics that the technologists use to fix data quality problems

    7% of the e-mail attribute is blank. 12% of the e-mail attribute does not follow the standard e-mail syntax. 13% of our US mail addresses fail address validation.
    2
    Metrics business people use to make decisions about the data
    9% of my contacts have invalid e-mails.  3% have both invalid e-mails and invalid addresses.
    3
    Metrics managers use to get a big picture
    This customer data is good enough to use for a campaign.

    All levels are important for the various members of the data governance team.  Level one shows the steps you need to take to fix the data.  Level two shows context to the task at hand. Level three tells the uniformed about the business issue without having to dig into the details.

    So, when you’re building your DQ metrics, remember to roll-up the data into metrics into slightly higher formulations. You must design the scorecards to meet the needs of the interest of the different audiences, from technical through to business and up to executive. At the beginning of a data quality scorecard is information about data quality of individual data attributes. This is the default information that most profilers will deliver out of the box. As you aggregate scores, the high-level measures of the data quality become more meaningful. In the middle are various score sets allowing your company to analyze and summarize data quality from different perspectives. If you define the objective of a data quality assessment project as calculating these different aggregations, you will have much easier time maturing your data governance program. The business users and c-level will begin to pay attention.

    Disclaimer: The opinions expressed here are my own and don't necessarily reflect the opinion of my employer. The material written here is copyright (c) 2010 by Steve Sarsfield. To request permission to reuse, please e-mail me.