Monday, May 19, 2008

Unusual Data Quality Problems

When I talk to folks who are struggling with data quality issues, there are some who are worried that they have data unlike any data anyone has ever seen. Often there’s a nervous laugh in the voice as if the data is so unusual and so poor that an automated solution can’t possibly help.

Yes, there are wide variations in data quality and consistency and it might be unlike any we’ve seen. On the other hand, we’ve seen a lot of unusual data over the years. For example:

  • A major motorcycle manufacturer used data quality tools to pull out nicknames from their customer records. Many of the names they had acquired for their prospect list were from motorcycle events and contests where the entries were, shall we say, colorful. The name fields contained data like “John the Mad Dog Smith” or “Frank Motor-head Jones”. The client used the tool to separate the name from the nickname, making it a more valuable marketing list.
  • One major utility company used our data quality tools to identify and record notations on meter-reader records that were important to keep for operational uses, but not in the customer billing record. Upon analysis of the data, the company noticed random text like “LDIY" and "MOR" along with the customer records. After some investigation, they figured out that LDIY meant “Large Dog in Yard” which was particularly important for meter readers. MOR meant “Meter in Right, which was also valuable. The readers were given their own notes field, so that they could maintain the integrity of the name and address while also keeping this valuable data. IT probably saved a lot of meter readers from dog bite situations.
  • Banks have used our data quality tools to separate items like "John and Judy Smith/221453789 ITF George Smith". The organization wanted to consider this type of record as three separate records "John Smith" and "Judy Smith" and "George Smith" with obvious linkage between the individuals. This type of data is actually quite common on mainframe migrations.
  • A food manufacturer standardizes and cleanses ingredient names to get better control of manufacturing costs. In data from their worldwide manufacturing plants, an ingredient might be “carrots” “chopped frozen carrots” “frozen carrots, chopped” “chopped carrots, frozen” and so on. (Not to mention all the possible abbreviations for the words carrots, chopped and frozen.) Without standardization of these ingredients, there was really no way to tell how many carrots the company purchased worldwide. There was no bargaining leverage with the carrot supplier, and all the other ingredient suppliers, until the data was fixed.

Not all data quality solutions can handle all of these types of anomalies. They will pass these "odd" values without attempting to cleanse them. It’s key to have a system that will learn from your data and allow you to develop business rules that meet the organization’s needs.

Now there are times, quite frankly, when data gets so bad, that automated tools can do nothing about it, but that’s where data profiling comes in. Before you attempt to cleanse or migrate data, you should profile it to have a complete understanding of it. This will let you weigh the cost of fixing very poor data against the value that it will bring to the organization.

Wednesday, May 14, 2008

The Best Books on Data Governance

Is there a comprehensive book on data governance that we should all read to achieve success? At the time of this post, I'm not sure there is. I haven't seen it yet. If you think about it, such a book would make War and Peace look like a Harlequin novel in terms of book size in order to cover the all aspects of the topic. Instead, we really must become students of data governance and begin to understand large knowledge areas such as 1) how to optimize and manage processes; 2) how to manage teams and projects; 3) public relations and marketing for internal project promotion; and 4) how to implement technologies to achieve data governance, just to name a few.

I’ve recently added an Amazon widget to my blog that lists some printed books on data governance-related topics. The books cover the four areas I’ve mentioned. As summer vacation arrives, now is the time to buy your books for the beach and read up! After all, what could be more relaxing on a July afternoon than a big frozen margarita and the book “Business Process Improvement: The Breakthrough Strategy for Total Quality, Productivity, and Competitiveness” by James Harrington?

The Amazon affiliate program generates just a few pennies for each book, but what money it does generate will be donated to charity. The appeal of the Amazon widget is that it's a good way to store a list of books and provide direct links to buy. If you have some suggestions to add to the list, please share them.

EDIT: My book on data governance is now available on Amazon. The Data Governance Imperative.

Sunday, May 4, 2008

Data Governance Structure and Organization Webinar

My colleague Jim Orr just did a great job delivering a webinar on data governance. You can see a replay of the webinar in case you missed it. Jim is our Data Quality Practice Leader and he has a very positive point of view when it comes to developing a successful data governance strategy.
In this webinar, Jim talks exclusively about the structure and the organization behind data governance. If you believe that data governance is people, process and technology, this webinar covers the "people" side of the equation.

Sunday, April 27, 2008

The Solution Maturity Cycle


I saw the news about Informatica’s acquisition of Identity Systems, and it got me thinking. I recognize a familiar pattern that all too often occurs in the enterprise software business. I’m going to call it the Solution Maturity Cycle. It goes something like this:

1. The Emergence Phase: A young, fledgling company emerges that provides an excellent product that fills a need in the industry. This was Informatica in the 90’s. Rather than hand coding a system of metadata management, companies could use a cool graphical user interface to get the job done. Customers were happy. Informatica became a success. Life was good.

2. The Mashup Phase: Customers begin to realize that if they mash up the features of say, an ETL tool and a data quality tool, they can reap huge benefit for their companies. Eventually, the companies see the benefit of working together, and even begin to talk to prospective customers together. This was Informatica in 2003-5, working with FirstLogic and Trillium Software. Customers could decide which solution to use. Customers were happy that they could mashup, and happy that others had found success in doing so.

3. The Market Consolidation Phase: Under pressure from stockholders to increase revenue, the company looks to buy a solution in order to sell it in-house. The pressure also comes from industry analysts, who if they’re doing their job properly, interpret the mashup as a hole in the product. Unfortunately, the established and proven technology companies are too expensive to buy, so the company looks to a young, fledgling data quality company. The decision on which company to buy is more influenced by bean counters than technologists. Even if there are limitations on the fledgling’s technology, the sales force pushes hard to eliminate mashup implementations, so that annual maintenance revenue will be recognized. This is what happened with Informatica and Similarity Systems in my opinion. Early adopters are confused by this and fearful that their mashup might not be supported. Some customers fight to keep their mashups, some yield to the pressure and install the new solution.

4. Buy and Grow Phase: When bean counters select technology to support the solution, they usually get some product synergies wrong. Sure, the acquisition works from a revenue-generating perspective, but from the technology solution perspective, it is limited. The customers are at the same time under pressure from the mega-vendors, who want to own the whole enterprise. What to do? Buy more technology. It’ll fill the holes, keep the mega-vendor wolves at bay, and build more revenue.

The Solution Maturity Cycle is something that we all must pay attention to when dealing with vendors. For example, I’m seeing phase 3 this cycle occur in the SAP world, where SAP’s acquisition of Business Objects dropped several data quality solutions in SAP’s lap. Now despite the many successful mashups of Trillium Software and SAP, customers are being shown other solutions from the acquisition. All along, history makes me question whether an ERP vendor will be committed long term to the data quality market.

After a merger occurs, a critical decision point comes to customers. Should a customer resist pulling out mashups, or should you try to unify the solution under one vendor? It's a tough decision. The decision may affect internal IT teams, causing conflict between those who have been working on the mashup versus the mega-vendor team. In making this decision, there are a couple of key questions to ask:

  • Is the newly acquired technology in the vendor’s core competency?
  • Is the vendor committed to interoperability with other enterprise applications, or just their own? How will this affect your efforts for an enterprise-wide data governance program?
  • Is the vendor committed to continual improvement this part of the solution?
  • How big is the development team and how many people has the vendor hired from the purchased company? (Take names.)
  • Can the vendor prove that taking out a successful solution to put in a new one will make you more successful?
  • Are there any competing solutions within the vendor’s own company, poised to become the standard?
  • Who has been successful with this solution, and do they have the same challenges that I have?
As customers of enterprise applications, we should be aware of history and the Solution Maturity Cycle.

Wednesday, April 9, 2008

Must-read Analyst Reports on Data Governance

If you’re thinking of implementing a data governance strategy at your company, here are some key analyst reports I believe are a must-read.

Data Governance: What Works And What Doesn't
by Rob Karel, Forrester
A high-level overview of data governance strategies. It’s a great report to hand to a c-level executive in your company who may need some nudging.

Data Governance Strategies
by Philip Russom and TDWI
A comprehensive overview of data governance, including extensive research and case studies. This one is hot off the presses from TDWI. Sponsored by many of the top information quality vendors.

The Forrester Wave™: Information Quality Software by J. Paul Kirby, Forrester
This report covers the strengths and weaknesses of top information quality software vendors. Many of the vendors covered here have been gobbled up by other companies, but the report is still worth a read. $$

Best Practices for Data Stewardship
Magic Quadrant for Data Quality Tools

by Ted Friedman, Gartner
I have included the names of two of Ted’s reports on this list, but Ted offers much insight in many forms. He has written and spoken often on the topic. (When you get to the Gartner web site, you're going to have to search on the above terms as Gartner makes it difficult to link directly.) $$
Ed Note: The latest quadrant (2008) is now available here.

The case for a data quality platform
Philip Howard, Bloor Research
Andy Hayler and Philip Howard are prolific writers on information quality at Bloor Research. They bring an international flair to the subject that you won’t find in the rest.

Sunday, April 6, 2008

Politics, Presidents and Data Governance

I was curious about the presidential candidates and their plans to build national ID cards and a database of citizens, so I set out to do some research on the candidates stance on this issue. It strikes me as a particularly difficult task, given the size of the database that would be needed and the complexity. Just how realistic would the data governance strategy for the candidates be?

I searched the candidate’s web sites with the following Google commands:
database site:http://www.johnmccain.com
database site:http://www.barackobama.com
database site:http://www.hillaryclinton.com

Hardly scientific, but interesting results nonetheless. The candidates have very different data management plans for the country. This simple search gave some insight into the candidate’s data management priorities.

Clinton:
Focused on national health care and the accompanying data challenges.
• Patient Health Care Records Database
• Health Care Provider Performance Tracking Database
• Employer History of Complaints
Comments: It’s clear that starting a national database of doctors and patients is a step toward a national health plan. There are huge challenges with doctor data, however. Many doctors work in multiple locations, having a practice at a major medical center and a private practice, for example. Consolidating doctor lists from insurance companies would rely heavily on unique health care provider ID numbers, doctor age and sex, and factors other than name and address for information quality. This is an ambitious plan, particularly given data compliance regulations, but necessary for a national health plan.

Obama:
Not much about actual database plans, but Obama has commented in favor of:
• Lobbyist Activity Database
• National Sex Offender Database
Comments: Many states currently monitor sex offenders, so the challenge would be coordinating a process and managing the metadata from the states. Not a simple task to say the least. I suspect none of the candidates are really serious about this, but it’s a strong talk-track. Ultimately, this may be better left to the states to manage.
As far as the lobbyist activity database, I frankly can’t see how it’d work. Would lobbyists would complete online forms describing their activities with politicians. If lobbyists have to describe their interaction with the politician, would they be given an open slate in which to scribble some notes about the event/gift/dinner/meeting topics? This would likely be chock full of unstructured data, and its usefulness would be questionable in my opinion.

McCain:
• Grants and Contracts Database
• Lobbyist Activity Database
• National Sex Offender Database
Comments: Adding in the grants and contracts database into McCain’s plan, I see this as similar to Obama’s plan in that it’s storage of unstructured data.

To succeed in any of these plans from our major presidential candidates, I see a huge effort in the “people” and “process” components of data governance. Congress will have to enact laws that describe data models, data security, information quality, exceptions processing and much more. Clearly, this is not their area of expertise. Yet the candidates seem to be talking about technology as a magic wand to heal our country’s problems. It’s not going to be easy for any of them to make any of this a reality, even with all the government’s money.
Instead of these popular vote-grabbing initiatives, wouldn't the government be better served by a president who is understands data governance? When you think about it, the US Government is making the same mistake that businesses make, growing and expanding data silos, leading to more and more inefficiencies. I can’t help but thinking what we really need is a federal information quality and metadata management agency (since the government like acronyms, shall we call it FIQMM) to oversee the government’s data. The agency could be empowered by the president to have access to government data, define data models, and provide people, process and technologies to improve efficiency. Imagine what efficiencies we could gain with a federal data governance strategy. Just imagine.

Thursday, March 27, 2008

Mergers and Acquisitions: Data's Influence on Company Value

Caveat Emptor! Many large companies have a growth strategy that includes mergers and acquisitions, but many are missing a key negotiating strategy during the buying process.

If you’re a big company, buying other companies in your market brings new customers into your fold. So, rather than paying for a marketing advertising campaign to get new customers, you can buy them as part of an acquisition. Because of this, most venture capitalists and business leaders know that two huge factors in determining a company’s value during an acquisition are the customer and prospect lists.

Having said that, it’s strange how little this is examined in the buy-out process. Before they buy, companies look at certain assets under a microscope - tangible assets like buildings and inventory are examined. Human assets, like the management staff are given a strong look. Cash flow is audited and examined with due diligence. But, data assets are often only given a hasty passing glance.

Data assets quickly dissolve when the company being acquired has data quality issues. It’s not uncommon for a company to have 20%, 40%, or even 50% customer duplication (or near duplicates) in their data base, for example. So, if you think you’re getting 100,000 new customers, you may actually be getting 50,000 after you cleanse. It’s also common for actual inventory levels in the physical warehouse to be misaligned with the inventory levels in the ERP systems. This too may be due to data quality issues, and lead to surprises after the acquisition.

So what can you do as an acquiring company to mitigate these risks? The key is due diligence on data. Ask to profile the data of the company you’re going to buy. Bring in your team, or hire a third party to examine the data. Look at the customer data, the inventory data, the supply chain data or whatever data is a valuable asset in the acquisition. If privacy and security are an issue the results of the profiling can usually be rolled up into some nice charts and graphs that’ll give you a picture of the status of organizational information.

In my work with Trillium Software, I have talked to customers who have saved millions in acquisition costs by evaluating the data prior to buying a company. Some have gone so far as evaluation of the overlap between their own customer base and the new customer base to determine value. Why pay for a customer when (s)he is already on the customer list?

Profiling lets you set up business rules that are important to your company. Does each record have a valid tax ID number? What percentage of the database contact information is null? How many bogus e-mails appear? Does the data make sense, or are there a lot of near duplicates and misfielded data. In inventory data, how structured or unstructured is the data? All of these can quickly be ascertained with a data profiling technology. All of these technical issues can be correlated into business value, and therefore negotiating value, for your company.

The data governance teams that I have met that I have done this due diligence for their companies have become real superstars, and are very much a strategic part of their corporations. It’s easy for a CEO to see the value you bring when you can prove that they are paying the right price for a company acquisition.

Disclaimer: The opinions expressed here are my own and don't necessarily reflect the opinion of my employer. The material written here is copyright (c) 2010 by Steve Sarsfield. To request permission to reuse, please e-mail me.