Data Governance Insider: data quality

Showing posts with label data quality. Show all posts

Wednesday, August 31, 2011

Top Ten Root Causes of Data Quality Problems: Part Five

Part 5 of 5: People Issues
In this continuing series, we're looking at root causes of data quality problems and the business processes you can put in place to solve them. Companies rely on data to make significant decisions that can affect customer service, regulatory compliance, supply chain and many other areas. As you collect more and more information about customers, products, suppliers, transactions and billing, you must attack the root causes of data quality.

Root Cause Number Nine: Defining Data Quality
More and more companies recognize the need for data quality, but there are different ways to clean data and improve data quality. You can:

Write some code and cleanse manually
Handle data quality within the source application
Buy tools to cleanse data

However, consider what happens when you have two or more of these types of data quality processes adjusting and massaging the data. Sales has one definition of customer, while billing has another. Due to differing processes, they don’t agree on whether two records are a duplicate.

Root Cause Attack Plan

Standardize Tools – Whenever possible, choose tools that aren’t tied to a particular solution. Having data quality only in SAP, for example, won’t help your Oracle, Salesforce and MySQL data sets. When picking a solution, select one that is capable of accessing any data, anywhere, at any time. It shouldn't cost you a bundle to leverage a common solution across multiple platforms and solutions.
Data Governance – By setting up a cross-functional data governance team, you will have the people in place to define a common data model.

Root Cause Number Ten: Loss of Expertise
On almost every data intensive project, there is one person whose legacy data expertise is outstanding. These are the folks who understand why some employee date of hire information is stored in the date of birth field and why some of the name attributes also contain tax ID numbers.
Data might be a kind of historical record for an organization. It might have come from legacy systems. In some cases, the same value in the same field will mean a totally different thing in different records. Knowledge of these anomalies allows experts to use the data properly.
If you encounter this situation, there are some business processes you can follow.

Root Cause Attack Plan

Profile and Monitor – Profiling the data will help you identify most of these types of issues. For example, if you have a tax ID number embedded in the name field, analysis will let you quickly spot it. Monitoring will prevent a recurrence.
Document – Although they may be reluctant to do so for fear of losing job security, make sure experts document all of the anomalies and transformations that need to happen every time the data is moved.
Use Consultants – Expert employees may be so valuable and busy that there is no time to document the legacy anomalies. Outside consulting firms are usually very good at documenting issues and providing continuity between legacy and new employees.

This post is an excerpt from a white paper available here. More to come on this subject in the days ahead.

See also:

Part One: The Basics
Part Two: Renegades and Pirates
Part Three: Secret Code and Corporate Evolution
Part Four: Data Flow

Tuesday, August 30, 2011

Top Ten Root Causes of Data Quality Problems: Part Four

Part 4 of 5: Data Flow
In this continuing series, we're looking at root causes of data quality problems and the business processes you can put in place to solve them. In part four, we examine some of the areas involving the pervasive nature of data and how it flows to and fro within an organization.

Root Cause Number Seven: Transaction Transition
More and more data is exchanged between systems through real-time (or near real-time) interfaces. As soon as the data enters one database, it triggers procedures necessary to send transactions to other downstream databases. The advantage is immediate propagation of data to all relevant databases.

However, what happens when transactions go awry? A malfunctioning system could cause problems with downstream business applications. In fact, even a small data model change could cause issues.

Root Cause Attack Plan

Schema Checks – Employ schema checks in your job streams to make sure your real-time applications are producing consistent data. Schema checks will do basic testing to make sure your data is complete and formatted correctly before loading.
Real-time Data Monitoring – One level beyond schema checks is to proactively monitor data with profiling and data monitoring tools. Tools like the Talend Data Quality Portal and others will ensure the data contains the right kind of information. For example, if your part numbers are always a certain shape and length, and contain a finite set of values, any variation on that attribute can be monitored. When variations occur, the monitoring software can notify you.

Root Cause Number Eight: Metadata Metamorphosis
Metadata repository should be able to be shared by multiple projects, with audit trail maintained on usage and access. For example, your company might have part numbers and descriptions that are universal to CRM, billing, ERP systems, and so on. When a part number becomes obsolete in the ERP system, the CRM system should know. Metadata changes and needs to be shared.

In theory, documenting the complete picture of what is going on in the database and how various processes are interrelated would allow you to completely mitigate the problem. Sharing the descriptions and part numbers among all applicable applications needs to happen. To get started, you could then analyze the data quality implications of any changes in code, processes, data structure, or data collection procedures and thus eliminate unexpected data errors. In practice, this is a huge task.

Root Cause Attack Plan

Predefined Data Models – Many industries now have basic definitions of what should be in any given set of data. For example, the automotive industry follows certain ISO 8000 standards. The energy industry follows Petroleum Industry Data Exchange standards or PIDX. Look for a data model in your industry to help.
Agile Data Management – Data governance is achieved by starting small and building out a process that first fixes the most important problems from a business perspective. You can leverage agile solutions to share metadata and set up optional processes across the enterprise.

This post is an excerpt from a white paper available here. My final post on this subject in the days ahead.

Thursday, August 25, 2011

Top Ten Root Causes of Data Quality Problems: Part Two

Part 2 of 5: Renegades and Pirates
In this continuing series, we're looking at root causes of data quality problems and the business processes you can put in place to solve them. In part two, we examine IT renegades and corporate pirates as two of the root causes for data quality problems.

Root Cause Number Three: Renegade IT and Spreadmarts
A renegade is a person who deserts and betrays an organizational set of principles. That’s exactly what some impatient business owners unknowingly do by moving data in and out of business solutions, databases and the like. Rather than wait for some professional help from IT, eager business units may decide to create their own set of local applications without the knowledge of IT. While the application may meet the immediate departmental need, it is unlikely to adhere to standards of data, data model or interfaces. The database might start by making a copy of a sanctioned database to a local application on team desktops. So-called “spreadmarts,” which are important pieces of data stored in Excel spreadsheets, are easily replicated to team desktops. In this scenario, you lose control of versions as well as standards. There are no backups, versioning or business rules.

Root Cause Attack Plan

Corporate Culture – There should be a consequence for renegade data, making it more difficult for the renegades to create local data applications.
Communication – Educate and train your employees on the negative impact of renegade data.
Sandbox – Having tools that can help business users and IT professionals experiment with the data in a safe environment is crucial. A sandbox, where users are experimenting on data subsets and copies of production data, has proven successful for many for limiting renegade IT.
Locking Down the Data – A culture where creating unsanctioned spreadmarts is shunned is the goal. Some organizations have found success in locking down the data to make it more difficult to export.

Root Cause Number Four: Corporate Mergers
Corporate mergers increase the likelihood for data quality errors because they usually happen fast and are unforeseen by IT departments. Almost immediately, there is pressure to consolidate and take shortcuts on proper planning. The consolidation will likely include the need to share data among a varied set of disjointed applications. Many shortcuts are taken to “make it happen,” often involving known or unknown risks to the data quality.
On top of the quick schedule, merging IT departments may encounter culture clash and a different definition of truth. Additionally, mergers can result in a loss of expertise when key people leave midway through the project to seek new ventures.

Root Cause Attack Plan

Corporate Awareness – Whenever possible civil division of labor should be mandated by management to avoid culture clashes and data grabs by the power hungry.
Document – Your IT initiative should survive even if the entire team leaves, disbands or gets hit by a bus when crossing the street. You can do this with proper documentation of the infrastructure.
Third-party Consultants – Management should be aware that there is extra work to do and that conflicts can arise after a merger. Consultants can provide the continuity needed to get through the transition.
Agile Data Management – Choose solutions and strategies that will keep your organization agile, giving you the ability to divide and conquer the workload without expensive licensing of commercial applications.

This post is an excerpt from a white paper available here. More to come on this subject in the days ahead.

Wednesday, August 24, 2011

Top Ten Root Causes of Data Quality Problems: Part One

Part 1 of 5: The Basics
We all know data quality problems when we see them. They can undermine your organization’s ability to work efficiently, comply with government regulations and make revenue. The specific technical problems include missing data, misfielded attributes, duplicate records and broken data models to name just a few.
But rather than merely patching up bad data, most experts agree that the best strategy for fighting data quality issues is to understand the root causes and put new processes in place to prevent them. This five part blog series discusses the top ten root causes of data quality problems and suggests steps the business can implement to prevent them.
In this first blog post, we'll confront some of the more obvious root causes of data quality problems.

Root Cause Number One: Typographical Errors and Non-Conforming Data
Despite a lot of automation in our data architecture these days, data is still typed into Web forms and other user interfaces by people. A common source of data inaccuracy is that the person manually entering the data just makes a mistake. People mistype. They choose the wrong entry from a list. They enter the right data value into the wrong box.

Given complete freedom on a data field, those who enter data have to go from memory. Is the vendor named Grainger, WW Granger, or W. W. Grainger? Ideally, there should be a corporate-wide set of reference data so that forms help users find the right vendor, customer name, city, part number, and so on.

Root Cause Attack Plan

Training – Make sure that those people who enter data know the impact they have on downstream applications.
Metadata Definitions – By locking down exactly what people can enter into a field using a definitive list, many problems can be alleviated. This metadata (for vendor names, part numbers, and so on can) become part of data quality in data integration, business applications and other solutions.
Monitoring – Make public the results of poorly entered data and praise those who enter data correctly. You can keep track of this with data monitoring software such as the Talend Data Quality Portal.
Real-time Validation – In addition to forms, validation data quality tools can be implemented to validate addresses, e-mail addresses and other important information as it is entered. Ensure that your data quality solution provides the ability to deploy data quality in application server environments, in the cloud or in an enterprise service bus (ESB).

Root Cause Number Two: Information Obfuscation
Data entry errors might not be completely by mistake. How often do people give incomplete or incorrect information to safeguard their privacy? If there is nothing at stake for those who enter data, there will be a tendency to fudge.

Even if the people entering data want to do the right thing, sometimes they cannot. If a field is not available, an alternate field is often used. This can lead to such data quality issues as having Tax ID numbers in the name field or contact information in the comments field.

Root Cause Attack Plan

Reward – Offer an incentive for those who enter personal data correctly. This should be focused on those who enter data from the outside, like those using Web forms. Employees should not need a reward to do their job. The type of reward will depend upon how important it is to have the correct information.
Accessibility – As a technologist in charge of data stewardship, be open and accessible about criticism from users. Give them a voice when processes change requiring technology change. If you’re not accessible, users will look for quiet ways around your forms validation.
Real-time Validation – In addition to forms, validation data quality tools can be implemented to validate addresses, e-mail addresses and other important information as it is entered.

This post is an excerpt from a white paper available here. More to come on this subject in the days ahead.

Monday, June 13, 2011

The Differences Between Small and Big Data

There is a lot of buzz today about big data and companies stepping up to meet the challenge of ever increasing data volumes. In the center of it all, are Hadoop and the Cloud. Hadoop can intelligently manage the distribution of processing and your files. It manages the infrastructure needed to break down big data into more manageable chunks for processing by multiple servers. Likewise, a cloud strategy can take data management outside the walls of a corporation into a high scalable infrastructure.

Do you have big data? It’s difficult to know precisely whether you do because big data is vaguely defined. You may qualify for big data technology if you face hundreds of gigabytes of data, or it may hundreds or thousands of terabytes. The classification of “big data” is not strictly defined by data size, but other business processes, too. Your data management infrastructure needs to take into account factors like future data volumes, peaks and lulls in requirements, business requirements and much more.

Small and Medium-Sized Data
What about “small” and medium-sized data? For example, data from spreadsheet, the occasional flat file, leads from a trade show, and catalog data from vendors may be vital to your business processes. With a new industry focus on transparency, business user involvement and sharing of data, small data is a constant issue. Spreadsheets and flat files are the preferred method to share data today because most companies have some process for handling them. When you get these small to medium sized data sets, it is still necessary to:

profile them
integrate them into your relational database
aggregate data from these sources, or extract only the vital parts
apply data quality standards when necessary
use them as part of a master data management (MDM) initiative

The Difference Goals of Big Data and Little Data
With big data, the concern is usually about your data management technology’s ability to handle massive quantities in order to provide you aggregates that are meaningful. You need solutions that will scale to meet your data management needs. However, handling small and medium data sets is more about short and long term costs. How can you quickly and easily integrate data without a lot of red tape, big license fees, pain and suffering.

Think about it. When you need to handle small and medium data, you have options:

Hand-coding: Using hand-coding is sometimes faster than any solution and it still may be OK for ad-hoc, one off data integration. Once you find yourself hand-coding again and again, you’ll find yourself rethinking that strategy. Eventually managing all that code will waste time and cost you a bundle. If your data volumes grow, hand-coded quickly becomes obsolete due to lack of scaling. Hand-coding gets high marks on speed to value, but falters in sustainability and long-term costs.
Open Source: Open source data management tools provide a quick way to get started, low overall costs and high sustainability. By just downloading and learning the tools, you’re on your way to getting data management done. The open source solutions may have some limitations on scalability, but most open source providers have low-cost commercial upgrades that meet these needs. In other words, it's easy to start today and leverage Hadoop and the Cloud if you need it later. Open source gets high marks on speed to value, sustainability and costs.
Traditional Data Management Vendors: Small data is a tough issue for the mega-vendors. Even for 50K-100K records, the license cost in both the short term and long term could be prohibitive. The mega-vendor solutions do tend to scale well, making them sustainable at a cost. However mergers in the data management business do happen. The sustainability of a product can be affected by these mergers. Commercial vendors get respectable marks in speed to value and sustainability, but falter in high up-front costs and maintenance fees.

I've heard it a million times in this business - start small and fast with technology that gives you a fast success but also scales to future tasks.

Monday, May 16, 2011

The Butterfly Effect and Data Quality

I just wrote a paper called the ‘Butterfly Effect’ of poor data quality for Talend.

The term butterfly effect refers to the way a minor event – like the movement of a butterfly’s wing – can have a major impact on a complex system – like the weather. The movement of the butterfly wing represents a small change in the initial condition of the system, but it starts a chain of events: moving pollen through the air, which causes a gazelle to sneeze, which triggers a stampede of gazelles, which raises a cloud of dust, which partially blocks the sun, which alters the atmospheric temperature, which ultimately alters the path of a tornado on the other side of the world.

Enterprise data is equally susceptible to the butterfly effect. When poor quality data enters the complex system of enterprise data, even a small error – the transposed letters in a street address or part number – can lead to 1) revenue loss; 2) process inefficiency and; 3) failure to comply with industry and government regulations. Organizations depend on the movement and sharing of data throughout the organization, so the impact of data quality errors are costly and far reaching. Data issues often begin with a tiny mistake in one part of the organization, but the butterfly effect can produce far reaching results.

The Pervasiveness of Data
When data enters the corporate ecosystem, it rarely stays in one place. Data is pervasive. As it moves throughout a corporation, data impacts systems and business processes. The negative impact of poor data quality reverberates as it crosses departments, business units and cross-functional systems.

Customer Relationship Management (CRM) - By standardizing customer data, you will be able to offer better, more personalized customer service. And you will be better able to contact your customers and prospects for cross-sell, up-sell, notification and services.
ERP / Supply Chain Data- If you have clean data in your supply chain, you can achieve some tangible benefits. First, the company will have a clear picture about delivery times on orders because of a completely transparent supply chain. Next, you will avoid unnecessary warehouse costs by holding the right amount of inventory in stock. Finally, you will be able to see all the buying patterns and use that information when negotiating supply contracts.
Orders / Billing System - If you have clean data in your billing systems, you can achieve the tangible benefits of more accurate financial reporting and correct invoices that reach the customer in a timely manner. An accurate bill not only leads to trust among workers in the billing department, but customer attrition rates will be lower if invoices are delivered accurately and on time.
Data Warehouse - If you have standardized the data feeding into your data warehouse, you can dramatically improve business intelligence. Employees can access the data warehouse and be assured that the data they use for reports, analysis and decision making is accurate. Using the clean data in a warehouse can help you find trends, see relationships between data, and understand the competition in a new light.

To read more about the butterfly effect of data quality, download it from the Talend site.

Monday, May 9, 2011

MIT Information Quality Symposium

This year I’m planning to attend the MIT IQ symposium again. I’m also one of the vice chairs of the event. The symposium is a July event in Boston that is a discussion and exchange of ideas about data quality between practitioners and academicians.

I return to this conference and participate in the planning every year because I think it’s one of the most important data quality events. The people here really do change the course of information management. On these hot summer days in Boston, government, healthcare and general business professionals collaborate on the latest updates about data quality. This event has the potential to dramatically change the world – the people, organizations, and governments who manage data. I’ve grown to really enjoy the combination of ground-breaking presentations, high ranking government officials, sharp consultants and MIT hallway chat that you find here.

If you have some travel budget, please consider joining me for this event.

Tuesday, March 15, 2011

Open Source Data Management or Do-it-Yourself

With the tough economy people are still cutting back on corporate spending. There is a sense of urgency to just get things done, and sometimes that can lead to hand-coding your own data integration, data quality or MDM functions. When you begin to develop your plan and strategies for data management, you have to think about all the hidden costs of getting solutions out-of-the-box versus building on your own.

Reusability is one key consideration. Using data management technologies that only plug into one system just doesn’t make sense. It’s difficult to get that re-usability with custom code, unless your programmers have high visibility into other projects. On the other hand, all tool vendors, even open source ones have pressure from their clients to support multiple databases and business solutions. Open source solutions are built to work in a wider variety of architectures. You can move your data management processes between JD Edwards and SAP and SalesForce, for example, with relative ease.

Indemnity is another consideration. What if something goes wrong with your home-grown solution after the chief architect leaves his job? Who are you going to call? If something goes wrong with your open source solution, you can turn to the community or call the vendor for support.

Long-term costs are yet another issue. Home-grown solutions have the tendency to start cheap and get more expensive as time goes on. It’s difficult to manage custom code, especially if it is poorly documented. You hire consultants to manage code. Eventually, you have to rip and replace and that can be costly.

You should consider your human resources, too. Does it make sense to have a team work on hand-coding database extractions and transformation, or would the total cost/benefit be better if you used an open source data integration tool? It might just free up some of your programmers to pursue more important, ROI-centric ventures.

If you’re thinking of cooking up your own technical solutions for data management, hoping to just get it done, think again. Your most economical solution might just be to leverage the community of experts and go with open source.

Thursday, March 10, 2011

My Interview in the Talend Newsletter

Q. Some people would say that data quality technology is mature and that the topic is sort of stale. Are there major changes happening in the data quality world today?
A. Probably the biggest over-arching change we see today is that the distinction between those managing data from the business standpoint and those managing the technical aspects of data quality is getting more and more blurry. It used to be that data quality was... read more

Friday, December 10, 2010

Six Data Management Predictions for 2011

This time of year everyone makes prognostications about the state of the data management field for 2011. I thought I’d take my turn by offering my predictions for the coming year.

Data will become more open
In the old days good quality reference data was an asset kept in the corporate lockbox. If you had a good reference table for common misspellings of parts, cities, or names for example, the mind set was to keep it close and away from falling into the wrong hands. The data might have been sold for profit or simply not available. Today, there really is no “wrong hands”. Governments and corporations alike are seeing the societal benefits of sharing information. More reference data is there for the taking on the internet from sites like data.gov and geonames.org. That trend will continue in 2011. Perhaps we’ll even see some of the bigger players make announcements as to the availability of their data. Are you listening Google?

Business and IT will become blurry
It’s becoming harder and harder to tell an IT guy from the head of marketing. That’s because in order to succeed, the IT folks need to become more like the marketer and vice versa. In the coming year, the difference will be less noticeable and business people get more and more involved in using data to their benefit. Newsflash One: If you’re in IT, you need marketing skills to pitch your projects and get funding. Newsflash Two: If you’re in business, you need to know enough about data management practices to succeed.

Tools will become easier to use
As the business users come into the picture, they will need access to the tools to manage data. Vendors must respond to this new marketplace or die.

Tools will do less heavy lifting
Despite the improvements in the tools, corporations will turn to improving processes and reporting in order to achieve better data management. Dwindling are the days where we’re dealing with data that is so poorly managed that it requires overly complicated data quality tools. We’re getting better at the data management process and therefore, the burden on the tools becomes less. Future tools with focus on supporting the process improvement with work flow features, reporting and better graphical user interfaces.

CEOs and Government Officials will gain enlightenment
Feeding off the success of a few pioneers in data governance as well as failures of IT projects in our past, CEOs and governments will gain enlightenment about managing their data and put teams in place to handle it. It has taken decades of our sweet-talk and cajoling for government and CEOs to achieve enlightenment, but I believe it is practically here.

We will become more reliant on data
Ten years ago, it was difficult to imagine us where we are today with respect to our data addiction. Today, data is a pervasive part of our internet-connected society, living in our PCs, our TVs, our mobile phones many other devices. It’s a huge part of our daily lives. As I’ve said in past posts, the world is addicted to data and that bodes well for anyone who helps the world manage it. In 2011, no matter if the economy turns up or down, our industry will continue to feed the addiction to good, clean data.

Tuesday, November 16, 2010

Ideas Having Sex: The Path to Innovation in Data Management

I read a recent analyst report on the data quality market and “enterprise-class” data quality solutions. Per usual, the open source solutions were mentioned at a passing while the data quality solutions of the past were given high marks. Some of the solutions picked in the top originated from days when mainframe was king. Some of the top contenders still contained cobbled-together applications from ill-conceived acquisitions. It got me thinking about the way we do business today and how so much of it is changing.

Back in the 1990’s or earlier, if you had an idea for a new product, you’d work with an internal team of engineers and build the individual parts. This innovation took time, as you might not always have exactly the right people working on the job. It was slow and tedious. The product was always confined by its own lineage.

The Android phone market is a perfect examples of the modern way to innovate. Today, when you want to build something groundbreaking like an Android, you pull in expertise from all around the world. Sure, Samsung might make the CPU and Video processing chips, but Primax Electronics in Taiwan might make the digital camera and Broadcomm in the US makes the touch screen, plus many others. Software vendors push the platform further with their cool apps. Innovation happens at break-neck speed because the Android is a collection of ideas that have sex and produce incredible offspring.

Isn’t that really the model of a modern company? You have ideas getting together and making new ideas. When you have free exchange between people, there is no need to re-invent something that has already been invented. See the TED for more on this concept, where British author Matt Ridley argues that, through history, the engine of human progress and prosperity is "ideas having sex.”

The business model behind open source has a similar mission. Open source simply creates better software. Everyone collaborates, not just within one company, but among an Internet-connected, worldwide community. As a result, the open source model often builds higher quality, more secure, more easily integrated software. It does so at a vastly accelerated pace and often at a lower cost.

So why do some industry analysts ignore it? There’s no denying that there are capitalist and financial reasons. I think if an industry analyst were to actually come out and say that the open source solution is the best, it would be career suicide. The old-school would shun the analysts making him less relevant. The link between the way the industry pays and promotes analysts and vice versa seems to favor enterprise application vendors.

Yet the open source community along with Talend has developed a very strong data management offering that should be considered in the top of its class. The solution leverages other cutting edge solutions. To name just a few examples:

if you want to scale up, you can use distributed platform technology from Hadoop, which enables it to work with thousands of nodes and petabytes of data.
very strong enterprise class data profiling.
matching that users can actually use and tune without having to jump between multiple applications.
a platform that grows with your data management strategy so that if your future is MDM, you can seamlessly move there without having to learn a new GUI.

The way we do business today has changed. Innovation can only happen when ideas have sex, as Matt Ridley puts it. As long as we’re engaged in exchange and specialization, we will achieve those new levels of innovation.

Friday, July 30, 2010

Deterministic and Probabilistic Matching White Paper

I’ve been busy this summer working on a white paper on record matching, the result of which is available on the Talend web site here.

The white paper is sort of a primer containing elementary principles of record matching, As the description says, it outlines the basic theories and strategies of record matching. It describes the nuances of deterministic and probabilistic matching and the algorithms used to identify relationships within records. It covers the processes to employ in conjunction with matching technology to transform raw data into powerful information that drives success in enterprise applications like CRM, data warehouse and ERP.

Wednesday, July 28, 2010

DGDQI Viewer Mail

From time to time, people read my blog or book and contact me to chat about data governance and data quality. I welcome it. It’s great to talk to people in the industry and hear their concerns.

Occasionally, I see things in my in-box that bother me, though. Here is one item that I’ll address in a post. The names have been changed to protect the innocent.

A public relations firm asked:

Hi Steve,
I wonder if you could answer these questions for me.
- What are the key business drivers for the advent of data governance software solutions?
- What industries can best take advantage of data governance software solutions?
- Do you see cloud computing-based data governance solutions developing?

I couldn’t answer these questions, because they all pre-supposed that data governance is a software solution. It made me wonder if I have made myself clear enough on the fact that data governance is mostly about changing the hearts and minds of your colleagues to re-think their opinion of data and its importance. Data governance is a company’s mindful decision that information is important and they’re going to start leveraging it. Yes, technology can help, but a complete data governance software solution would have more features than a Workchamp XL Swiss Army Knife. It would have to include data profiling, data quality, data integration, business process management, master data management, wikis, a messaging platform, a toothpick and a nail file in order to be complete.

Can you put all this on the cloud? Yes. Can you put the hearts and minds of your company on a cloud? If only it were that easy...

Wednesday, July 21, 2010

Lemonade Stand Data Quality

My children expressed interest in opening up a lemonade stand this weekend. I’m not sure if it’s done worldwide, but here in America every kid between the age of five and twelve tries their hand at earning extra money during the summer months. Most parents in America indulge this because the whole point of a lemonade stand is really to learn about capitalism. You figure out your costs, how much the lemonade, ice and cups cost, then you charge a little more than what it costs you. At the end of the day, you can hope to show a little profit.

I couldn’t help but think there are lessons we can learn from the lemonade stand that apply to the way we manage our own data quality initiatives. Data governance programs and data quality projects are still driven by capitalism and lemonade stand fundamentals.

Concept – While the lemonade stand requires your audience to have a clear understanding of the product and the price, so does data quality. In the data world, profiling can help you create an accurate assessment of it and tell the world exactly what it is and how much it’s going to cost.
Marketing – My kids proved that more people will come to your lemonade stand if you shout out “Ice Cold Lemonade” and put a few flyers around the neighborhood. Likewise you need to tell management, business people and anyone who will listen about data quality – it’s ice cold and delicious.
Pricing – A lemonade stand works by setting the right price. Too little and the profit will be too low, too high and no one will buy. In the data quality world, setting the scope with the proper amount of spend and the right amount of return on investment will be successful.
Location – While a busy street and a hot day make a profitable lemonade stand, data quality project managers know that you begin by picking the projects with the least effort and highest potential ROI. In turn, you get to open more lemonade stands and build your data quality projects into a data governance program.

When it comes down to it, data quality projects are a form of capitalism; you need to sell the customers a refreshing glass and keep them coming back for more.

Friday, April 2, 2010

Donating the Data Quality Asset

If you believe like I do that proper data management can change the world, then you have to start wondering if it’s time for all us data quality professionals to stand up and start changing it.

It’s clear that everyone organization, no matter what the size or influence, can benefit from properly managing their data. Even charitable organizations can benefit with a cleaner customer list to get the word out when they need donations. Non-profits who handle charitable goods can benefit from better data in their inventory management. If food banks had a better way of managing data and soliciting volunteers, wouldn’t more people be fed? If churches kept better records of their members, would their positive influence be more widespread? If organizations who accept goods in donation kept a better inventory system, wouldn’t more people benefit? The data asset is not limited to Fortune 1000 companies, but until recently, solutions to manage data properly were only available to the elite.

Open source is coming on strong and is a factor that eases us to donate the data quality. In the past, it many have been a challenge to get mega-vendors to donate high-end solutions, but we can make significant progress on the data quality problem with little or no solutions cost these days. Solutions like Talend Open Profiler, Talend Open Studio, Pentaho and DataCleaner offer data integration and data profiling.

In my last post, I discussed the reference data that is now available for download. Reference data used to be proprietary and costly. It’s a new world – a better one for low-cost data management solutions.

Can we save the world through data quality? If we can help good people spread more goodness, then we can. Let’s give it a try.

Tuesday, February 16, 2010

The Secret Ingredient in Major IT Initiatives

One of my first jobs was that of assistant cook at a summer camp. (In this case, the term ‘cook’ was loosely applied meaning to scrub pots and pans for the head cook.) It was there I learned that most cooks have ingredients that they tend to use more often. The cook at Camp Marlin tended to use honey where applicable. Food TV star Emeril likes to use garlic and pork fat. Some cooks add a little hot pepper to their chocolate recipes – it is said to bring out the flavor of the chocolate. Definitely a secret ingredient.
For head chefs taking on major IT initiatives the secret ingredient is always data quality technology. Attention to data quality doesn’t make the recipe of an IT initiative alone so much as it makes an IT initiative better. Let’s take a look at how this happens.

Profiling
No matter what the project, data profiling provides a complete understanding of the data before the project team attempts to migrate it. This can help the project team create a more accurate plan for integration. On the other hand, it is ill-advised to migrate data to your new solution as-is, as it can lead to major costs over-runs and project delays as you have to load and reload it.

Customer Relationship Management (CRM)
By using data quality technology in CRM, the organization will benefit from a cleaner customer list with fewer duplicate records. Data quality technology can work as a real-time process, limiting the amount of typos and duplicates in the system, thus leading to improved call center efficiency. Data profiling can also help an organization understand and monitor the quality of a purchased list for integration will avoid issues with third-party data.

Enterprise Resource Planning (ERP) and Supply Chain Management (SCM)
If data is accurate, you will have a more complete picture of the supply chain. Data quality technology can be used to more accurately report inventory levels, lowering inventory costs. When you make it part of your ERP project, you may also be able to improve bargaining power with suppliers by gaining improved intelligence about their corporate buying power.

Data Warehouse and Business Intelligence
Data quality helps disparate data sources to act as one when migrated to a data warehouse. Data quality makes data warehouse possible by standardizing disparate data. You will be able to generate more accurate reports when trying to understand sales patterns, revenue, customer demographics and more.

Master Data Management (MDM)
Data quality is a key component of master data management. An integral part of making applications communicate and share data is to have standardized data. MDM enhances the basic premise of data quality with additional features like persistent keys, a graphical user interface to mitigate matching, the ability to publish and subscribe to enterprise applications, and more.

So keep in mind, when you decide to improve data quality, it is often because of your need to make a major IT initiative even stronger. In most projects, data quality is the secret ingredient to make your IT projects extraordinary. Share the recipe.

Thursday, January 21, 2010

ETL, Data Quality and MDM for Mid-sized Business

Is data quality a luxury that only large companies should be able to afford? Of course the answer is no. Your company should be paying attention to data quality no matter if you are a Fortune 1000 or a startup. Like a toothache, poor data quality will never get better on its own.

As a company naturally grows, the effects of poor data quality multiply. When a small company expands, it naturally develops new IT systems. Mergers often bring in new IT systems, too. The impact of poor data quality slowly invades and hinders the company’s ability to service customers, keep the supply chain efficient and understand its own business. Paying attention to data quality early and often is a winning strategy for even the small and medium-sized enterprise (SME).

However, SME’s have challenges with the investment needed in enterprise level software. While it’s true that the benefit often outweighs the costs, it is difficult for the typical SME to invest in the license, maintenance and services needed to implement a major data integration, data quality or MDM solution.

At the beginning of this year, I started with a new employer, Talend. I became interested in them because they were offering something completely different in our world – open source data integration, data quality and MDM. If you go to the Talend Web site, you can download some amazing free software, like:

a fully functional, very cool data integration package (ETL) called Talend Open Studio
a data profiling tool, called Talend Open Profiler, providing charts and graphs and some very useful analytics on your data

The two packages sit on top of a database, typically MySQL – also an open source success.

For these solutions, Talend uses a business model similar to what my friend Jim Harris has just blogged about – Freemium. Under this new model, free open source content is made available to everyone—providing the opportunity to “up-sell” premium content to a percentage of the audience. Talend works like this. You can enhance your experience from Talend Open Studio by purchasing Talend Integration Suite (in various flavors). You can take your data quality initiative to the next level by upgrading Talend Open Profiler to Talend Data Quality.

If you want to take the combined data integration and data quality to an even higher level, Talend just announced a complete Master Data Management (MDM) solution, which you can use in a more enterprise-wide approach to data governance. There’s a very inexpensive place to start and an evolutionary path your company can take as it matures its data management strategy.

The solutions have been made possible by the combined efforts of the open source community and Talend, the corporation. If you’d like, you can take a peek at some source code, use the basic software and try your hand at coding an enhancement. Sharing that enhancement with community will only lead to a world full of better data, and that’s a very good thing.

Thursday, October 22, 2009

Book Review: Data Modeling for Business

A couple of weeks ago, I book-swapped with author Donna Burbank. She has a new book entitled Data Modeling for Business. Donna, an experienced consultant by trade, has teamed up with Steve Hoberman, a previous published author and technologist and Chris Bradley, also a consultant, for an excellent exploration of the process of creating a data model. With a subtitle like “A handbook for Aligning the Business with IT using a High-Level Data Model” I knew I was going to find some value in the swap.

The book describes in plain English the proper way to create a data model, but that simple description doesn’t do it justice. The book is designed for those who are learning from scratch – those who only vaguely understand what a data model is. It uses commonly understood concepts to describe data model concepts. The book describes the impact of the data model to the project’s success and digs into setting up data definitions and the levels of detail necessary for them to be effective. All of this is accomplished in a very plain-talk, straight-forward tone without the pretentiousness you sometimes get in books about data modeling.

We often talk about the need for business and IT to work together to build a data governance initiative. But many, including myself, have pointed to the communication gap that can exist in a cross-functional team. In order to bridge the gap, a couple of things need to happen. First, IT teams need to expand their knowledge of business processes, budgets and corporate politics. Second, business team members need to expand their knowledge of metadata and data modeling. This book provides an insightful education for the latter. In my book, the Data Governance Imperative, the goal was the former.

The book is well-written and complete. It’s a perfect companion for those who are trying to build a knowledgeable, cross-function team for data warehouse, MDM or data governance projects. Therefore, I’ve added it to my recommended reading list on my blog.

Monday, October 12, 2009

Data May Require Unique Data Quality Processes

A few things in life have the same appearance, but the details can vary widely. For example, planets and stars look the same in the night sky, but traveling to them and surviving once you get there are two completely different problems. It’s only when you get close to your destination that you can see the difference.

All data quality projects can appear the same from afar but ultimately can be as different as stars and planets. One of the biggest ways they vary is in the data itself and whether it is chiefly made up of name and address data or some other type of data.

Name and Address Data
A customer database or CRM system contains data that we know much about. We know that letters will be transposed, names will be comma reversed, postal codes will be missing and more. There are millions of things that good data quality tools know about broken name and address data since so many name and address records have been processed over the years. Over time, business rules and processes are fine-tuned for name and address data. Methods of matching up names and addresses become more and more powerful.

Data quality solutions also understand what name and addresses are supposed to look like since the postal authorities provide them with correct formatting. If you’re somewhat precise about following the rules of the postal authorities, most mail makes it to its destination. If we’re very precise, the postal services can offer discounts. The rules are clear in most parts of the civilized world. Everyone follows the same rules for name and address data because it makes for better efficiency.

So, if we know what the broken item looks like and we know what the fixed item is supposed to look like, you can design and develop processes that involve trained, knowledgeable workers and automated solutions to solve real business problems. There’s knowledge inherent in the system and you don’t have to start from scratch every time you want to cleanse it.

ERP, Supply Chain Data
However, when we take a look at other types of data domains, the picture is very different. There isn’t a clear set of knowledge what is typically input and what is typically output and therefore you must set up processes for doing so. In supply chain data or ERP data, we can’t immediately see why the data is broken or what we need to do to fix it. ERP data is likely to be sort of a history lesson of your company’s origins, the acquisitions that were made, and the partnership changes throughout the years. We don’t immediately have an idea about how the data should ultimately look. The data that exists in this world is specific to one client or a single use scenario which cannot be handled by existing out-of-the-box rules

With this type of data you may find the need to collaborate more with the business users of the data, who expertise in determining the correct context for the information comes more quickly, and therefore enable you to effect change more rapidly. Because of the inherent unknowns about the data, few of the steps for fixing the data are done for you ahead of time. It then becomes critical to establish a methodology for:

Data profiling in order to understanding what issues and challenges.
Discussions with the users of the data to understand context, how it’s used and the most desired representation. Since there are few governing bodies for ERP and supply chain data, the corporation and its partners must often come up with an agreed-upon standard.
Setting up business rules, usually from scratch, to transform the data
Testing the data in the new systems

I write about this because I’ve read so much about this topic lately. As practitioners you should be aware that the problem is not the same across all domains. While you can generally solve name and address data problems with a technology focus, you will often rely more on collaboration with subject matter experts to solve issues in other data domains.

Tuesday, July 21, 2009

Data Quality – Technology’s Prune

Prunes. When most of us think of prunes, we tend to think of a cure for older people suffering from constipation. In reality, prunes are not only sweet but are also highly nutritious. Prunes are a good source of potassium and a good source of dietary fiber. Prunes suffer from a stigma that’s just not there for dried apricots, figs and raisins, which have a similar nutritional benefit and medicinal benefit. Prunes suffer from bad marketing.

I have no doubt that data quality is considered technology’s prune by some. We know that information quality is good for us, having many benefits to the corporation. It also can be quite tasty in its ability to deliver benefit, yet most of our corporations think of it as a cure for business intelligence constipation – something we need to “take” to cure the ills of the corporation. Like the lowly prune, data quality also suffers from bad marketing.

In recent years, prune marketers in the United States have begun marketing their product as "dried plums” in an attempt to get us to change the way we think about them. Commercials show the younger, soccer Mom crowd eating the fruit and being surprised at its delicious flavor. It may take some time for us to change our minds about prunes. I suppose if Lady Gaga or Zac Efron would be spokespersons, prunes might have a better chance.

The biggest problem in making data quality beloved by the business world is that it’s well… hard to explain. When we talk about it, we get crazy with metadata models and profiling metrics. It’s great when we’re communicating among data professionals, but that talk tends to plug-up business users.

In my recent presentations and in recent blog posts, I’ve made it clear that it’s up to us, the data quality champions, to market data quality, not as a BI laxative, but as a real business initiative with real benefits. For example:

Take a baseline measurement and track ROI, even if you think you don’t have to
If the project has no ROI, you should not be doing it. Find the ROI by asking the business users of the data what they use it for.
Aggregate and roll-up our geeky metrics of nulls, accuracy, conformity, etc into metrics that a business user would understand – like according to our evaluation, 86.4% of our customers are fully reachable by mail.
Create and use the aggregated scores similar to the Dow Jones Industrial Average. Publish them at regular intervals. To raise awareness of the data quality, talk about why it’s up and talk about why it has gone down.
Have a business-focused elevator pitch ready when someone asks you what you do. “My team is saving the company millions by ensuring that the ERP system accurately reflects inventory levels.”

Of course there's more. There’s more to this in my previous blog posts, yet to come in my future blog posts, and in my book The Data Governance Imperative. Marketing the value of data quality is just something we all need to do more of. Not selling the business importance of data quality... it’s just plum-crazy!