This time of year, we’re all looking at our budgets and planning for 2009. I’d like to recommend an event that I’ve been participating in for the past several years – the MIT IQ symposium. It’s in my travel budget and I’m looking forward to going to this event again this year.
The symposium is a July event in Boston that is a discussion and exchange of ideas about data quality between practitioners and academicians. The goal is less commercial than you would find at a typical symposium. In the case of this MIT event, it’s more about the mission and philosophy of information quality.
Day one focuses on education, with highly qualified and very interesting speakers teaching you about enterprise architecture, data governance, business intelligence, data warehousing. and data quality. Latest methodologies, frameworks, and best practice cases are the topics. Day two, the sessions deconstruct industry-specific topics. There is a government track, healthcare track and business track. On the last day, a half day, the sessions are more about the future of information quality.
I’ve grown to really enjoy the presentations, information quality theory and hallway chat that you find here. If you have some travel budget, please consider earmarking some of it for this event.
Tuesday, December 9, 2008
2009 MIT Information Quality Industry Symposium
Friday, December 5, 2008
Short Ham Rule and Data Governance
One of my old bosses, a long time IBM VP who was trained in the traditional Big Blue executive training program, used to refer to the “short ham” rule quite often. With my apologies for its lack of political correctness, the story goes something like this:
Sarah is recently married and for the first time decides to cook the Easter ham for her new extended family. Her spouse’s sisters, mother and grandmother are all coming to dinner and as a new bride, she is nervous. As the family arrives, she begins preparing it for dinner.
Sarah’s sister-in-law Debbie helps with the preparation. As Sarah begins to put the ham into the oven, Debbie stops her. “You must cut off the back half of the ham before it goes into the oven.” she says.
Sarah was nervous, but somehow musters the courage to ask a simple question – why? Debbie is shaken for a moment at the nerve of her new sister-in-law. How dare she question the family tradition?
Debbie pauses then says, “Well, I’m not sure. My Mom always does it. Let’s ask her why.”
When asked, Mom also hesitates. “Well, my Mom always cut off that part of the ham. I’m not sure why.”
Finally, the group turns to Grandma, who is sitting in her rocking chair listening to the discussion. By now, the entire party has heard about the outrageous boldness of Sarah. The party turns silent as the elder slowly begins to whisper her answer. “Well, I grew up in the depression and we didn’t have a pan big enough to fit the whole ham. So, we’d cut off part of it and saved it for another meal.”
Three factors in the short ham story caused change. First, Sarah’s courage to take on the project of cooking the ham started the change. Second, Sarah’s willingness to listen and learn the processes of others in the family gave her credibility in the eyes of the family. Finally, Sarah’s question – why – that created change. It was only with audacity that Sarah was able to educate and make the holiday feast more enjoyable.
The same can be said about leading your company toward of data governance. You have to have the courage to take on new projects, understand the business processes, and ask why to become an agent for change in your organization. A leader has to get past resistance and convince others to embrace new ways of doing things.
Building credibility is the key to overcoming the resistance. If you were to sit down and work for a day in the billing center, call center or purchasing agent job, for example, people there will see that you understand them and care about their processes. At the very least, you could invite a business person to lunch to understand their challenges. The hearts and minds of the people can be won if you walk a mile in their shoes.
Monday, December 1, 2008
Information Quality Success at Nectar
It’s great when you see data quality programs work. Such is the case in Europe, where Loyalty Management Group (LMG) has improved efficiency and information quality in a very large, retail-based, customer loyalty program. I hadn’t heard of Nectar all that much here in the USA, but the Nectar card is very well-known in the UK. About half of all UK households use it to earn points from everyday purchases and later redeem those points for gifts and prizes. Recently, Groupe Aeroplan purchased LMG and Nectar is now their brand.
Using the databases generated by Nectar, the company also provides database marketing and consulting services to retailers, service providers and consumer packaged goods companies worldwide. Data is really the company’s primary asset.
Nectar data
The data management effort needed to handle half the population of the UK and a good portion of Europe could be perilous. To make matters worse, data entered into the Nectar system generally comes from paper-based forms available in stores or received through mailings, online or by phoning a call center. All of these sources could produce poor data if not checked.
To gain closer business control, the company made business management responsible for data integrity rather than IT. The company also embedded the Trillium Software System in its own systems, including in real-time for online and call center applications.
At first, LMG used just the basic capabilities of the tool to ensure that at enrollment, addresses matched to UK Postcode Address File (PAF). Later, the company engaged a business-oriented data quality steward to review existing processes and propose new policy. For example, they set up various checks using Trillium Software to check for mandatory information at the point of registration. A process is now in place where the data collector is notified of missing information.
Information quality often lands and expands into an organization, once folks see how powerful it can be. In LMG’s case, the Trillium Software System is implemented to help partners match their own customer databases with the Nectar collector database. For certain campaigns, Nectar partners might want to know which individuals are on both their own customer database and on the Nectar database, or which customers are common to both. The Trillium Software System allows for this, including the process of pre-processing the partner’s data where necessary, to bring it up to a sufficient standard for accurate matching.
You can download the whole story on LMG here.
Sunday, November 23, 2008
Picking the Boardwalk and Park Place DQ Projects
This weekend, I was playing a game of Monopoly with my kids. Monopoly is the ultimate game of capitalism. It’s a great way to teach a young one about money. (Given the length of the game, a single game can be a weekend long lesson.) The companies that we work for are also playing the capitalism game. So, it’s not a stretch that there are lessons to be learned while playing this game.
As I took in hefty rents from Pacific Ave, I could see that my daughter was beginning to realize that it’s really tough to win if you buy low-end properties like Baltic and Mediterranean, or any of the properties on that side of the board. Even with hotels, Baltic will only get you $450. It’s only with the yellow, green and blue properties that you can really make an impression on your fellow players. She got excited by finally getting a hold of Boardwalk and Park Place.
Likewise, it’s difficult to win at the data governance game if you pick projects that have limited upside. The tendency might be to fix the data of the business users who are complaining the most or those that the CEO tells you to fix. The key is to keep capitalism and the game of monopoly in mind when you pick projects.
When you begin picking high value targets with huge upside potential, you’ll begin to win at the data governance game. People will stand up and notice when you begin to bring in the high-end returns that Boardwalk and Park Place can bring in. You’ll get better traction in the organization. You’ll be able to expand your domain across Ventnor, St. James Place, gathering up other clean data monopolies.
This is the tactic that I’ve see so many successful data governance initiatives take at Trillium Software. The most successful project managers are also good marketers, promoting their success inside the company. And if no one will listen inside the company, they promote it to trade journals, analysts and industry awards. There’s nothing like a little press to make the company look up and notice.
So take the $200 you get from passing GO and focus on high value, high impact projects. When you land on Baltic, pass it by, at least at first. By focusing on the high impact data properties, you’ll get a better payoff in the end.
To hear a few more tips, I recommend the webinar by my friend Jim Orr at Trillium Software. You can listen to his webinar here.
Wednesday, November 19, 2008
What is DIG?
In case you haven’t heard, financial services companies are in a crunch time right now. Some say the current stormy conditions are unprecedented. Some say it’s a rocky time, but certainly manageable. Either way, financial service companies have to be smarter than ever in managing risk.
That’s what DIG is all about, helping financial services companies manage risk from their data. It's a new solution set from Trillium Software.
In Europe, BASEL II is standard operating procedure at many financial services companies and the US is starting to come on board. BASEL II is complex, but includes mandates for increased transparency of key performance indicators, such as probability of default (PD) and of Loss Given Default (LGD) to better determine Exposure At Default (EAD). Strict rules on capital risks reserve provisions penalize those institutions highly exposed to risk and those unable to provide ‘provably correct’ analysis of their risk position.
Clearly, the lack of risk calculations had something to do with the situation that banks are in today. Consider all the data that it takes to make a risk compliance calculation: customer credit quality measurements, agency debt ratings, accounts receivables, and current market exposures. When this type of data is spread out over multiple systems, it introduces risk that can shake the financial world.
To comply with BASEL II, financial services companies and those who issue credit have to be smarter than ever in managing data. Data drives decision-making and risk calculation models. For example, let’s say you’re a bank and you’re calculating the risk of your debtors. You enrich your data with Standard & Poor's ratings to understand the risk. But if the data is non-standardized, you may have a hard time matching the Standard & Poor's data to your customer. If not found, a company with a AA- bond rating might default as BB- in the database. After all, it is prudent to be conservative if you don’t know the risk. But that error can cause thousands, even millions to be set unnecessarily aside. These additional capital reserves can be a major drag on the company.
With the Data Intelligence and Governance (DIG) announcement from Trillium Software, we’re starting to leverage our enterprise technology platform to fix the risk rating process to become proactive participants in the validation, measurement, and management of all data fed into risk models. The key is to establish a framework for the context of data and best practices for enterprise governance. When we leverage our software and services to work on key data attributes and set up rules to ensure the accuracy of data, it could work to save the financial services companies a ton of money.
To support DIG, we’ve brought on board some additional financial services expertise. We’ve revamped our professional services and are working closely with some of our partners on the DIG initiative. We’ve also been updating our software, like our data quality dashboard, TS Insight, to help meet financial services challenges. For more information, see the DIG page on the Trillium Software web site.
Wednesday, November 12, 2008
The Data Governance Insider - Year in Review
Today is the one year anniversary of this blog. We’ve covered some interesting ground this year. It’s great to look back and to see if the thoughts I had in my 48 blog entries made any sense at all. For the most part, I’m proud of what I said this year.
Probably the most controversial entries this year were the ones on probabilistic matching. This was where I pointed out some of the shortcomings of the probabilistic technique to matching data. Some people read and agreed. Others added their dissension.
Visitors seemed to like the entry on approaching data intensive projects with data quality in mind. This is a popular white paper on Trilliumsoftware.com, too. We'll have to do more of those nuts and bolts articles in the year ahead.
As a data guy, I like reviewing the stats from Google Analytics. In terms of traffic, it was very slow going at first, but as traffic started to build, we were able to eke out 3,506 Visits with 2,327 of those visits unique. That means that either someone came back 1,179 times or 1,179 people came back… or some combination of the two. Maybe my mother just loves reading my stuff.
The visitors came from the places you’d expect. The top ten were United States, United Kingdom, Canada, Australia, India, Germany, France, Netherlands, Belgium, and Israel. We had a few visitors from unexpected places - one visitor from Kazakhstan apparently liked my entry on the Trillium Software integration with Oracle, but not enough to come back. A visitor from the Cayman Islands took a breaking from SCUBA diving to read my story on the successes Trillium Software has had with SAP implementations. There's a nice webinar that we recorded that's available there. A visitor from Croatia took time to read my story about data quality on the mainframe. Even outside Croatia, the mainframe is still a viable platform for data management.
I’m looking forward to another year of writing about data governance and data quality. Thanks for all your visits!
Tuesday, October 21, 2008
Financial Service Companies Need to Prepare for New Regulation
We’re in the midst of a mortgage crisis. Call it a natural extension of capitalism, where greed can inspire unregulated “innovation”. That greed is now coming home to roost.
This problem has many moving pieces and it's difficult to describe in an elevator pitch. By the actions of our leaders and bankers, mortgage lenders were inspired to write dubious mortgages, and the US population was encouraged to apply for them. At first, these unchecked mortgages lead to more free cash, more spending, and a boom in the economy. Unfortunately, the boom was built on a foundation of quicksand. It forced us to take drastic measures to bring balance back to the system. The $700 billion bill already passed is an example to those measures. Hopefully for our government, we won’t need too many more balancing measures.
So where do we go from here? The experts say that the best-case scenario would be for the world economy to do well - unemployment stays low, personal income keeps pace with inflation and real estate prices find a bottom. I'm optimistic that we'll see that day soon. Many of the experts aren't so sure.
One thing that history teaches us is that regulatory oversight is bound to get stiffer after this fiasco. We had similar “innovations” in capitalism with the savings and loan scandal, the artificial dot-com boom, Enron, Tyco and WorldCom. Those scandals where followed up by new worldwide regulations like Sarbanes-Oxley, Bill 198 (Canada), JSOX (Japan) and Deutscher Corporate Governance Kodex (Germany) to name just a few. These laws tightened oversight of the accounting industry and toughen corporate disclosure rules. They also moved to make the leaders of corporations more personally liable for reporting irregularities.
The same should be true after the mortgage crisis. The types of loans that have brought us to this situation may only exist in tightly regulated form in the future. In the coming months, we should see a renewed emphasis on detecting fraud at every step of the process. For the financial services industry especially, it will be more important than ever to have good clean data, accurate business intelligence and holistic data governance to achieve the regulations to come.
If you’re running a company that still can’t get a handle on your customers, has a hard time detecting fraud, has a lot of missing and outlier values in your data, has many systems with many duplicated forms of data values, you’ll want to get started now on your governing your data. Go now, run, since data governance and building intelligence can take years of hard work. The goal here would be to begin to mitigate the potential risk you have in meeting regulatory edicts. If you get going now, you’ll not only beat the rush to comply, but you'll reap the real and immediate benefits of data governance.
Thursday, October 9, 2008
Teradata Partners User Group
Road trip! Next week, I’m heading to Teradata Partners User Group and Conference in Las Vegas, and I’m looking forward to it. The event should be a fantastic opportunity to take a peak inside the Teradata world.
The event is a way for Trillium Software to celebrate its partnership with Teradata. This partnership has always made a lot of sense to me. Teradata and Trillium Software have had similar game-plans throughout the years – focus on your core competency, be the best you can be at it, but maintain an open and connectible architecture that allows in other high-end technologies. There are many similarities in the philosophies of the two companies.
Both companies have architecture that works well in particularly large organizations with vast amounts of data. One key feature with Teradata, for example, is that you can linearly expand the database capacity response time by adding more nodes to the existing database. Similarly, with Trillium Software, you can expand the number of records cleansed in real-time by adding more nodes to the cleansing system. Trillium Software uses a load balancing technology called the Director to manage cleansing and matching on multiple servers. In short, both technologies will scale to support very large volumes of complex, global data.
The estimate is for about 4000 Teradata enthusiasts to show up and participate in the event. So, if you’re among them, please come by the Trillium Software exhibit and say hello.
Monday, October 6, 2008
Data Governance and Chicken Parmesan
With the tough economy and shrinking 401K’s, some of my co-workers at Trillium are starting to cut back a bit in personal spending. They talk about how expensive everything is and speak with regret if they don’t bring a lunch instead of buying one at the Trillium cafeteria. Until now, I’ve kept quiet about this topic and waited politely until the conversation turned to say, fantasy football. But between you and me, I don’t agree that there is a huge cost savings with making your own.
Restaurants can sell chicken parmesan for $15.99 and still make a profit because they have the system of making it that uses economy of scale. They buy ingredients cheaper, and because they use the sauce in other dishes, have ‘reusability’ working for them, too. They use the sauce in their eggplant parmesan, spaghetti with meatballs, and many other dishes, and that reuse is powerful. Most of the high-end technologies you choose for your company have to have the same reusability as the sauce for the maximum benefit. Using data quality technologies that only plug into SAP, for example, when your future data governance projects may lead you to Oracle and Tibco and Siperian just doesn’t make sense.
One other consideration - what if something goes wrong with my homemade chicken parmesan? I had little recourse if my own home-cooked solution were to go up in flames, except to get into even more expense and order out. But if the restaurant chicken parmesan is bad, you can call them and they’ll make me another one at no charge. Likewise, you have contractual recourse when a vendor solution doesn’t do what they say it will.
If you’re thinking of cooking up your own technical solutions for data governance hoping to save a ton of money, think again. Your most economical solution might just be to order out.
Monday, September 29, 2008
The Data Intelligence Gap: Part Two
In part one, I wrote about the evolution of a corporation and how rapid growth leads to a data intelligence gap. It makes sense that a combination of people, process and technology combine to close the gap, but just what kind of technology can be used to help you cross the divide and connect the needs of business with the data available in the corporation?
Of course, the technology needed depends on the company’s needs and how mature they are about managing their data. Many technologies exist to help close the gap, improve information quality and meet the business needs of the organization. Let’s look at them:
CATEGORY | TECHNOLOGIES | HOW IT CLOSES THE |
Preventative | Type-Ahead Technology | This technology watches the user type helps completes the data entry in real time. For example, products like Harte-Hanks Global Address help call center staff and others who enter address data into your system by speeding up the process and ensuring the data is correct. |
Data Quality Dashboard | Dashboards allow business users and IT users to keep an eye on data anomalies by constantly checking if the data meets business specifications. Products like TS Insight even give you some attractive charts and graphs on the status of data compliance and the trend of its conformity. Dashboards are also a great way to communicate the importance of closing the data intelligence gap. When your people get smarter about it, they will help you achieve cleaner, more useful information. | |
Diagnostic and Health | Data Profiling | Not sure about the health and suitability of any given data set? Profile it with products like TS Discovery, and you’ll begin to understand how much data is missing, outlier values in the data, and many other anomalies. Only then will you be able to understand the scope of your data quality project. |
Batch Data Quality | Once the anomalies are discovered. A batch cleansing process can solve many problems with name and address data, supply chain data and more. Some solutions are batch-centric, while others can do both batch cleansing and scalable enterprise-class data quality (see below). | |
Infrastructure | Master Data Management ( | Products from the mega-vendors like |
Enterprise-Class Data Quality | Products like the Trillium Software System provide real time data quality to any application in the enterprise, including the | |
Data Monitoring | You can often use the same technology to monitor data as you do for profiling data. These tools keep track of the quality of the data. Unlike data quality dashboards, the IT staff can really dig into the nitty-gritty if necessary. | |
Enrichment | Services and Data Sources | Companies like Harte-Hanks offer data sources that can help fill the gaps when mission-critical data is missing. You can buy data and services to segment your database, check customer lists for change of address, look for customers on the do-not-call list, reverse phone number look ups, and more. |
These are just some of the technologies involved in closing the data intelligence gap. In my next installment of this series, I’ll look at people and process. Stay tuned.
Monday, September 22, 2008
Are There Business Advantages to Poor Data Management?
I have long held the belief, perhaps even religion, that companies who do a good job governing and managing their data will be blessed with so many advantages over those who don’t. This weekend, as I was walking through the garden, the serpent tempted me with an apple. Might there actually be some business advantage in poorly managing your data?
The experience started when I noticed a bubble on the sidewall of my tire. Just a small bubble, but since I was planning on a trip down and back on the lonely Massachusetts Turnpike (Mass Pike) on a Sunday night, I decided to get it checked out. No need to risk a blow-out.
I remembered that I had purchased one of those “road hazard replacement” policies. I called the nearest location of a chain of stores that covers New England. Good news. The manager assured me that I didn’t need my paperwork and that the record would be in the database.
Of course, when I arrived at the tire center, no record of my purchase or my policy could be found. Since I didn’t bring the printed receipt, the tire center manager gave me a couple of options: 1) Drive down the Mass Pike with the bubbly tire and come back again on Monday when they could “access the database in the main office”; or 2) Drive home, find paperwork, come back to store... Hmm. Not sure where it was. 3) Buy a new tire at full price.
I opted to buy a new tire and attempt to claim a refund from the corporate office later when I found my receipts. The jury is still out on the success of that strategy.
However, this got me thinking. Could the inability for the stores to maintain more that 18 months of records actually be a business advantage? How many customers lose the paperwork, or even forget about their road hazard policies and just pay the replacement price? How much additional revenue was this shortcoming actually generating each year? What additional revenue would be possible if the database only stored 12 months of transactions?
Finding fault in the one truth - data management is good - did hurt. However, I realized that advantages of the poor data infrastructure design at the tire chain is very short-sighted. True, it actually may lower pay-outs on the road hazard policies short-term, but eventually, this poor customer database implementation has to catch up to them in decreased customer satisfaction and word-of-mouth badwill. There are so many tire stores here competing for the same buck, eventually, the poor service will cause most good customers to move on.
If you're buying tires soon in New England and want to know what tire chain it was, e-mail me and I'll tell. But before I tell you all, I'm going to hold out hope for justice... and hope that our foundation beliefs are still intact.
Saturday, September 20, 2008
New Data Governance Books
A couple of new, important books hit the streets this month. I’m adding these books to my recommended reading list.
Data Driven: Profiting from Your Most Important Business Asset is Tom Redmond’s new book making the most of your data to sharpen your company's competitive edge and enhance its profitability. I like how Tom uses real-life metaphors in this book to simplify the concepts of governing your data.
Master Data Management is David Loshin’s new book that provides help for both business and technology managers as they strive to improve data quality. Among the topics covered are strategic planning, managing organizational change and the integration of systems and business processes to achieve better data.
Both Tom and David have written several books on data quality and master data management, and I think their material gets stronger and stronger as they plug in new experiences and reference new strategies.
EDIT: In April of 2009, I also released my own book on data governance called "The Data Governance Imperative".
Check it out.>>
Monday, August 11, 2008
The Data Intelligence Gap: Part One
There is a huge chasm in many corporations today, one that hurts companies by keeping them from revenue, more profit, and better operating efficiency. The gap, of course, lies in corporate information.
What the Business Wants to Know | Data needed | What’s inhibiting peak efficiency |
Can I lower my inventory costs and purchase prices? Can I get discounts on high volume items purchased? | Reliable inventory data. | Multiple ERP and |
Are my marketing programs effective? Am I giving customers and prospects every opportunity to love our company? | Customer attrition rates. Results of marketing programs. | Typos. Lack of standardization of name and address. Multiple |
Are any customers or prospects “bad guys”? Are we complying with all international laws? | Reliable customer data for comparison to “watch” lists. | Lack of standards. Ability to match names that may have slight variations against watch lists. Missing values. |
Am I driving the company in the right direction? | Reliable business metrics. Financial trends. | Extra effort and time needed to compile sales and finance data – time to cross-check results. |
Is the company we’re buying worth it? | Fast comprehension of the reliability of the information provided by the seller. | Ability to quickly check the accuracy of the data, especially the customer lists, inventory level accuracy, financial metrics, and the existence of “bad guys” in the data. |
Thursday, July 24, 2008
Forget the Data. Eat the Ice Cream.
It’s summer and time for vacations. Even so, it’s difficult for a data-centric guy like me to shut off thoughts of information quality, even during times of rest and relaxation.
Case in point, my family and I just took a road trip from Boston to Burlington, VT to visit the shores of Lake Champlain. We loaded up the mini-van and headed north. Along the way, you drive along beautiful RT 89, which winds its way through the green mountains and past the capital - Montpelier.
No trip to western Vermont is complete without a trip to the Ben and Jerry’s ice cream manufacturing plant in Waterbury. They offer a tour of the plant and serve up a sample of the freshly made flavor of the day at the end. The kids were very excited.
However, when I see a manufacturing process, my mind immediately turns to data. As the tour guide spouted off statistics about how much of any given ingredient they use, and which flavor was the most popular (Cherry Garcia), my thoughts turned to the trustworthiness of the data behind it. I wanted him to back it up by telling me what ERP system they used and what data quality processes were in place to ensure the utmost accuracy in the manufacturing process. Inside, I wondered if they had the data to negotiate properly with the ingredients vendors and if they really knew how many heath bars, for example, they were buying across all of their manufacturing plants. Just having the clean data and accurate metrics around their purchasing processes could save them thousands and thousands of dollars.
The tour guide talked about a Jack Daniels flavored ice cream that was now in the “flavor graveyard” mostly because the main ingredient was disappearing from the production floor. I thought about inventory controls and processes that could be put in place to stop employee pilfering.
It went on and on. The psychosis continued until my daughter exclaimed “Dad. This is the coolest thing ever! That’s how they make Chunky Monkey!” She was right. It was perhaps the coolest thing ever to see how they made something we use nearly every day. It was cool to take a peak inside the corporate culture of Ben and Jerry’s. It popped me back into reality.
Take your vacation this year, but remember that life isn’t only about the data. Remember to eat the ice cream and enjoy.
Tuesday, July 1, 2008
The Soft Costs of Information Quality
Choosing data quality technology simply on price could mean that you end up paying far more than you need to, thanks to the huge differences in how the products solve the problems. While your instinct may tell you to focus solely on the price of your data quality tool, your big costs come in less visible areas – like time to implement, re-usability, time spend preprocessing data so that it reads into the tool, performance and overall learning curve.
As if it wasn’t confusing enough for the technology buyer having to choose between a desktop and enterprise-class technology, local and global solutions, or built-in solution vs. universal architecture, now you have to work out soft costs too. But you need to know that there are some huge differences in the way the technologies are implemented and work day-to-day, and those differences will impact your soft costs.
So just what should you look for to limit soft costs when selecting an information quality solution? Here are a few suggestions:
- Does the data quality solution understand data at the field level only or can it see the big picture? For example, can you pass it an address that’s a blob of text, or do you need to pass it individual first name, last name, address, city, state, postal code lines. Importance: If the data is misfielded, you’ll have a LOT of work to do to get it ready for the field level solution.
- On a similar note, what is the approach to multi-country data? Is there an easy way to pre-process mixed global data or is it a manual process? Importance: If the data has mixed country of origin, again you’ll have to do a lot of preprocessing work to do to get it ready.
- What is the solution’s approach to complex records like “John and Diane Cougar Mellencamp DBA John Cougar”? Does the solution have the intelligence to understand all of those people in a record or do I have to post-process this name?
- Despite the look of the user interface, is the product a real application or is it a development environment? Importance: In a real application, an error will be indicated if you pass in some wild and crazy data. In a development environment, even slight data quirks will cause nothing to run and just getting the application to run can be very time consuming and wasteful.
- How hard is it to build a process? As a user you’ll need to know how to build an entire end-to-end process with the product. During proof of concept, the data quality vendor may hide that from you. Importance: Whether you’re using it on one project, or across many projects, you’re eventually going to want to build or modify a process. You should know up-front how hard this is. It shouldn’t be a mystery, and you need to follow this during the proof-of-concept.
- Are web services the only real-time implementation strategy? Importance: Compared to a scalable application server, web services can be slow and actually add costs to the implementation.
- Does the application actually use its own address correction worldwide or a third party solution? Importance: Understanding how the application solves certain problems will let you understand how much support you’ll get from the company. If something breaks, it’s easier for the program’s originator to fix it. A company using a lot of third party applications may have challenges with this.
- Does the application have different ways to find duplicates? Importance: During a complex clean-up, you may want to dedupe your records based on, say e-mail and name for the first pass. But what about the records where your e-mail isn’t populated? For those records, you’ll need to go back and use other attributes to match. The ability to multi-match allows you to achieve cleaner, more efficient data by using whatever attributes are best in your specific data.
I could go on. The point is – there are many technical, in-the-weeds differences between vendors, and those differences have a BIG impact on your ability to deliver information quality. The best way to understand a data quality vendor’s solution is to look over their shoulder during the proof-of-concept. Ask questions. Challenge the steps needed to cleanse your data. Diligence today will save you from having to buy Excedrin tomorrow.
Wednesday, June 25, 2008
Data Quality Events – Powerful and Cozy
For those of you who enjoy hobnobbing with the information quality community, I have a couple of recommendations for you. These events are your chance to rub elbows with different factions of the community. In the case of these events, the crowds are small but the information is powerful.
MIT Information Quality Symposium
We’re a couple of weeks away from the MIT Information Quality Symposium in Boston. I’ll be sharing the podium with a couple of other data quality vendors in delivering a presentation this year. I’m really looking forward to it.
Dr. Wang and his cohorts from MIT fill a certain niche in information quality with these gatherings. Rather than a heavily-sponsored, high pressure selling event, this one really focuses on concepts and the study of information quality. There are presenters from all over the globe, some who have developed thought-provoking theories on information quality, and others who just want to share the results of a completed information quality project. The majority of the presentations offer smart ways of dissecting and tackling data quality problems that aren’t so much tied to vendor solutions as they are processes and people.
My presentation this year will discuss the connections between the rate at which a company grows and the degree of poor information in the organization. While a company may have a strong desire to own their market, they may wind up owning chaos and disorder instead, in the form of disparate data. It’s up to data quality vendors to provide solutions to help high-growth companies defeat chaos and regain ownership of their companies.
If you decide to come to the MIT event, please come by the vendor session and introduce yourself.
Information and Data Quality Conference
One event that I’m regrettably going to miss this year is Larry English’s Information and Data Quality Conference (IDQ) taking place September 22-25, in San Antonio, Texas. I’ve been to Larry’s conferences in past years and have always had a great time. What struck me, at least in past years, was the fact that most of the people who went to the IDQ conference really “got it” in terms of the data quality issue. Most of the people I’ve talked with were looking for sharing advice on taking what they knew as the truth – that information quality is an important business asset – and making believers out of the rest of their organizations. Larry and the speakers at that conference will definitely make a believer out of you and send you out into the world to proclaim the information quality gospel. Hallelujah!
Thanks
On another topic, I’d like to thank Vince McBurney for the kind words in his blog last week. Vince runs a blog covering IBM Information Server. In his latest installment, Vince has a very good analysis of the new Gartner Magic Quadrant on data quality. Thanks for the mention, Vince.
Monday, June 16, 2008
Get Smart about Your Data Quality Projects
With all due respect to Agent Maxwell Smart, there is a mini battle between good and evil, CONTROL and KAOS, happening in many busy, fast-growing corporations. It is, of course, with information quality. Faster growing companies are more vulnerable to chaos because by opening up new national and international divisions, expanding through acquisition, manufacturing offshore, and doing all the other things that an aggressive company does, it leads to more misalignment and more chaotic data.
While a company may have a strong desire to “own the world” or at least their market, they may wind up owning chaos and disorder instead - in the form of disparate data. The challenges include:
- trying to reconcile technical data quality issues, such as different code pages like ASCII, Unicode and EBCDIC
- dealing with different data quality processes across your organization, each that that deliver different results
- being able to cleanse data from various platforms and applications
- dealing with global data, including local languages and nuances
Agent 99: Sometime I wish you were just an ordinary businessman.
Maxwell Smart: Well, 99, we are what we are. I'm a secret agent, trained to be cold, vicious, and savage... but not enough to be a businessman.
In an aggressive company, as your sphere of influence increases, it’s harder to gather key intelligence. How much did we sell yesterday? What’s the sales pipeline? What do we have in inventory worldwide? Since many company assets are tied to data, it’s hard to own your own company assets if they are a jumble.
Not only are decision-making metrics lost, but opportunity for efficiency is lost. With poor data, you may not be able to reach customers effectively. You may be paying too much to suppliers by not understanding your worldwide buying power. You may be driving your own employees away from innovations, as users begin to avoid new applications because of data.
KAOS Agent: Look, I'm a sportsman. I'll let you choose the way you want to die.
Maxwell Smart: All right, how about old age?
So, it’s up to data quality vendors to provide solutions to help high-growth companies “get smart” and defeat chaos (kaos) to regain ownership of their companies. They can do it with smart data-centric consulting services that help bring together business and IT. They can do it with technology that is easy to use and powerful enough to tackle even the toughest data quality problems. Finally, they can do it with a great team of people, working together to solve data issues.
Agent 99: Oh Max, you're so brave. You're going to get a medal for this.
Maxwell Smart: There's something more important than medals, 99.
Agent 99: What?
Maxwell Smart: It's after six. I get overtime.
Monday, June 9, 2008
Probabilistic Matching: Part Two
Matching algorithms, the functions that allow data quality tools to determine duplicate records and create households, are always a hot topic in the data quality community. In a previous installment of the Data Governance and Data Quality Insider, I wrote about the folly of probabilistic matching and its inability to precisely tune match results.
To recap, decisions for matching records together with probabilistic matchers are based on three things: 1) statistical analysis of the data; 2) a complicated mathematical formula, and; 3) and a “loose” or “tight” control setting. Statistical analysis is important because under probabilistic matching, data that is more unique in your data set has more weight in determining a pass/fail on the match. In other words, if you have a lot of ‘Smith’s in your database, Smith becomes a less important matching criterion for that record. If the record has a unique last name like ‘Afinogenova’ that’ll carry more weight in determining the match.
The trouble comes when you don’t like the way records are being matched. Your main course of action is to turn the dial on the loose/tight control to see if you can get the records to match without affecting record matching elsewhere in the process. Little provision is made for precise control of what records match and what records don’t. Always, there is some degree of inaccuracy in the match.
In other forms of matching, like deterministic matching and rules-based matching, you can very precisely control which records come together and which ones don’t. If something isn’t matching properly, you can make a rule for it. The rules are easy to understand. It’s also very easy to perform forensics on the matching and figure out why two records matched, and that comes in handy should you ever have to explain to anyone exactly why you deduped any given record.
But there is another major folly of probabilistic matching – namely performance. Remember, probabilistic matching relies heavily on statistical analysis of your data. It wants to know how many instances of “John” and “Main Street” are in your data before it can determine if there’s a match.
Consider for a moment a real time implementation, where records are entering the matching system, say once per second. The solution is trying to determine if the new record is almost like a record you already have in your database. For every record entering the system, shouldn’t the solution re-run statistics on the entire data set for the most accurate results? After all, the last new record you accepted into your database is going to change the stats, right? With medium-sized data sets, that’s going to take some time and some significant hardware to accomplish. With large sets of data, forget it.
Many vendors who tout their probabilistic matching secretly have work-arounds for real time matching performance issues. They recommend that you don’t update the statistics for every single new record. Depending on the real-time volumes, you might update statistics nightly or say every 100 records. But it’s safe to say that real time performance is something you’re going to have to deal with if you go with a probabilistic data quality solution.
Better yet, you can stay away from probabilistic matching and take a much less complicated and much more accurate approach – using time-tested pre-built business rules supplemented with your own unique business rules to precisely determine matches.
Friday, June 6, 2008
Data Profiling and Big Brown
Big Brown is positioned to win the third leg of the Triple Crown this weekend. In many ways picking a winner for a big thoroughbred race is similar to planning for a data quality project. Now, stay with me on this one.
When making decision on projects, we need statistics and analysis. With horse racing, we have a nice report that is already compiled for us called the daily racing form. It contains just about all the analysis we need to make a decision. With data intensive projects, you’ve got to do the analysis up front in order to win. We use data profiling tools to gather a wide array of metrics in order to make reasonable decisions. Like in our daily racing form, we look for anomalies, trends, and ways to cash in.
In data governance project planning, where there are company-wide projects abound, we may even have the opportunity to pick the projects that will deliver the highest return on investment. It’s similar to picking a winner at 10:1 odds. We may decide to bet our strategy on a big winner and when that horse comes in, we’ll win big for our company.
Now needless to say, neither the daily racing form nor the results of data profiling are completely infallible. For example, Big Brown’s quarter crack in his hoof is something that doesn’t show up in the data. Will it play a factor? Does newcomer Casino Drive, for whom there is very little data available, have a chance to disrupt our Big Brown project? In data intensive projects, we must communicate, bring in business users to understand processes, study and prepare contingency plans in order to mitigate risks from the unknown.
So, Big Brown is positioned to win the Triple Crown this weekend. Are you positioned to win on your next data intensive IT project? You can better your chances by using the daily racing form for data governance – a data profiling tool.
Tuesday, June 3, 2008
Trillium Software News Items
A couple of big items hit the news wire today from Trillium Software that are significant for data quality enthusiasts.
Item One:
Trillium Software cleansed and matched the huge database of Loyalty Management Group (LMG), the database company that owns the Nectar and Air Miles customer loyalty schemes in the UK and Europe.
Significance:
LMG has saved £150,000 by using data quality software to cleanse its mailing list, which is the largest in Europe, some 10 million customers strong. I believe this speaks to Trillium Software’s outstanding scalability and global data support. This particular implementation is an Oracle database with Trillium Software as the data cleansing process.
Item Two:
Trillium Software delivered the latest version of the Trillium Software System version 11.5. The software now offers expanded cleansing capabilities across a broader range of countries.
Significance:
Again, global data is a key take-away here. Being able to handle all of the cultural challenges you encounter with international data sets is a problem that requires continual improvement from data quality vendors. Here, Trillium is leveraging their parent company’s buyout of Global Address to improve the Trillium technology.
Item Three:
Trillium Software released a new mainframe version of version 11.5, too.
Significance:
Trillium Software continues to support data quality processes on the mainframe. Unfortunately, you don’t see other enterprise software companies offering many new mainframe releases these days, despite the fact that the mainframe is still very much a viable and vibrant for managing data.
Monday, May 19, 2008
Unusual Data Quality Problems
When I talk to folks who are struggling with data quality issues, there are some who are worried that they have data unlike any data anyone has ever seen. Often there’s a nervous laugh in the voice as if the data is so unusual and so poor that an automated solution can’t possibly help.
Yes, there are wide variations in data quality and consistency and it might be unlike any we’ve seen. On the other hand, we’ve seen a lot of unusual data over the years. For example:
- A major motorcycle manufacturer used data quality tools to pull out nicknames from their customer records. Many of the names they had acquired for their prospect list were from motorcycle events and contests where the entries were, shall we say, colorful. The name fields contained data like “John the Mad Dog Smith” or “Frank Motor-head Jones”. The client used the tool to separate the name from the nickname, making it a more valuable marketing list.
- One major utility company used our data quality tools to identify and record notations on meter-reader records that were important to keep for operational uses, but not in the customer billing record. Upon analysis of the data, the company noticed random text like “LDIY" and "MOR" along with the customer records. After some investigation, they figured out that LDIY meant “Large Dog in Yard” which was particularly important for meter readers. MOR meant “Meter in Right, which was also valuable. The readers were given their own notes field, so that they could maintain the integrity of the name and address while also keeping this valuable data. IT probably saved a lot of meter readers from dog bite situations.
- Banks have used our data quality tools to separate items like "John and Judy Smith/221453789 ITF George Smith". The organization wanted to consider this type of record as three separate records "John Smith" and "Judy Smith" and "George Smith" with obvious linkage between the individuals. This type of data is actually quite common on mainframe migrations.
- A food manufacturer standardizes and cleanses ingredient names to get better control of manufacturing costs. In data from their worldwide manufacturing plants, an ingredient might be “carrots” “chopped frozen carrots” “frozen carrots, chopped” “chopped carrots, frozen” and so on. (Not to mention all the possible abbreviations for the words carrots, chopped and frozen.) Without standardization of these ingredients, there was really no way to tell how many carrots the company purchased worldwide. There was no bargaining leverage with the carrot supplier, and all the other ingredient suppliers, until the data was fixed.
Not all data quality solutions can handle all of these types of anomalies. They will pass these "odd" values without attempting to cleanse them. It’s key to have a system that will learn from your data and allow you to develop business rules that meet the organization’s needs.
Now there are times, quite frankly, when data gets so bad, that automated tools can do nothing about it, but that’s where data profiling comes in. Before you attempt to cleanse or migrate data, you should profile it to have a complete understanding of it. This will let you weigh the cost of fixing very poor data against the value that it will bring to the organization.
Wednesday, May 14, 2008
The Best Books on Data Governance
Is there a comprehensive book on data governance that we should all read to achieve success? At the time of this post, I'm not sure there is. I haven't seen it yet. If you think about it, such a book would make War and Peace look like a Harlequin novel in terms of book size in order to cover the all aspects of the topic. Instead, we really must become students of data governance and begin to understand large knowledge areas such as 1) how to optimize and manage processes; 2) how to manage teams and projects; 3) public relations and marketing for internal project promotion; and 4) how to implement technologies to achieve data governance, just to name a few.
I’ve recently added an Amazon widget to my blog that lists some printed books on data governance-related topics. The books cover the four areas I’ve mentioned. As summer vacation arrives, now is the time to buy your books for the beach and read up! After all, what could be more relaxing on a July afternoon than a big frozen margarita and the book “Business Process Improvement: The Breakthrough Strategy for Total Quality, Productivity, and Competitiveness” by James Harrington?
The Amazon affiliate program generates just a few pennies for each book, but what money it does generate will be donated to charity. The appeal of the Amazon widget is that it's a good way to store a list of books and provide direct links to buy. If you have some suggestions to add to the list, please share them.
EDIT: My book on data governance is now available on Amazon. The Data Governance Imperative.
Sunday, May 4, 2008
Data Governance Structure and Organization Webinar
My colleague Jim Orr just did a great job delivering a webinar on data governance. You can see a replay of the webinar in case you missed it. Jim is our Data Quality Practice Leader and he has a very positive point of view when it comes to developing a successful data governance strategy.
In this webinar, Jim talks exclusively about the structure and the organization behind data governance. If you believe that data governance is people, process and technology, this webinar covers the "people" side of the equation.
Sunday, April 27, 2008
The Solution Maturity Cycle
I saw the news about Informatica’s acquisition of Identity Systems, and it got me thinking. I recognize a familiar pattern that all too often occurs in the enterprise software business. I’m going to call it the Solution Maturity Cycle. It goes something like this:
1. The Emergence Phase: A young, fledgling company emerges that provides an excellent product that fills a need in the industry. This was Informatica in the 90’s. Rather than hand coding a system of metadata management, companies could use a cool graphical user interface to get the job done. Customers were happy. Informatica became a success. Life was good.
2. The Mashup Phase: Customers begin to realize that if they mash up the features of say, an ETL tool and a data quality tool, they can reap huge benefit for their companies. Eventually, the companies see the benefit of working together, and even begin to talk to prospective customers together. This was Informatica in 2003-5, working with FirstLogic and Trillium Software. Customers could decide which solution to use. Customers were happy that they could mashup, and happy that others had found success in doing so.
3. The Market Consolidation Phase: Under pressure from stockholders to increase revenue, the company looks to buy a solution in order to sell it in-house. The pressure also comes from industry analysts, who if they’re doing their job properly, interpret the mashup as a hole in the product. Unfortunately, the established and proven technology companies are too expensive to buy, so the company looks to a young, fledgling data quality company. The decision on which company to buy is more influenced by bean counters than technologists. Even if there are limitations on the fledgling’s technology, the sales force pushes hard to eliminate mashup implementations, so that annual maintenance revenue will be recognized. This is what happened with Informatica and Similarity Systems in my opinion. Early adopters are confused by this and fearful that their mashup might not be supported. Some customers fight to keep their mashups, some yield to the pressure and install the new solution.
4. Buy and Grow Phase: When bean counters select technology to support the solution, they usually get some product synergies wrong. Sure, the acquisition works from a revenue-generating perspective, but from the technology solution perspective, it is limited. The customers are at the same time under pressure from the mega-vendors, who want to own the whole enterprise. What to do? Buy more technology. It’ll fill the holes, keep the mega-vendor wolves at bay, and build more revenue.
The Solution Maturity Cycle is something that we all must pay attention to when dealing with vendors. For example, I’m seeing phase 3 this cycle occur in the SAP world, where SAP’s acquisition of Business Objects dropped several data quality solutions in SAP’s lap. Now despite the many successful mashups of Trillium Software and SAP, customers are being shown other solutions from the acquisition. All along, history makes me question whether an ERP vendor will be committed long term to the data quality market.
After a merger occurs, a critical decision point comes to customers. Should a customer resist pulling out mashups, or should you try to unify the solution under one vendor? It's a tough decision. The decision may affect internal IT teams, causing conflict between those who have been working on the mashup versus the mega-vendor team. In making this decision, there are a couple of key questions to ask:
- Is the newly acquired technology in the vendor’s core competency?
- Is the vendor committed to interoperability with other enterprise applications, or just their own? How will this affect your efforts for an enterprise-wide data governance program?
- Is the vendor committed to continual improvement this part of the solution?
- How big is the development team and how many people has the vendor hired from the purchased company? (Take names.)
- Can the vendor prove that taking out a successful solution to put in a new one will make you more successful?
- Are there any competing solutions within the vendor’s own company, poised to become the standard?
- Who has been successful with this solution, and do they have the same challenges that I have?
Wednesday, April 9, 2008
Must-read Analyst Reports on Data Governance
If you’re thinking of implementing a data governance strategy at your company, here are some key analyst reports I believe are a must-read.
Data Governance: What Works And What Doesn't by Rob Karel, Forrester
A high-level overview of data governance strategies. It’s a great report to hand to a c-level executive in your company who may need some nudging.
Data Governance Strategies
by Philip Russom and TDWI
A comprehensive overview of data governance, including extensive research and case studies. This one is hot off the presses from TDWI. Sponsored by many of the top information quality vendors.
The Forrester Wave™: Information Quality Software by J. Paul Kirby, Forrester
This report covers the strengths and weaknesses of top information quality software vendors. Many of the vendors covered here have been gobbled up by other companies, but the report is still worth a read. $$
Best Practices for Data Stewardship
Magic Quadrant for Data Quality Tools
by Ted Friedman, Gartner
I have included the names of two of Ted’s reports on this list, but Ted offers much insight in many forms. He has written and spoken often on the topic. (When you get to the Gartner web site, you're going to have to search on the above terms as Gartner makes it difficult to link directly.) $$
Ed Note: The latest quadrant (2008) is now available here.
The case for a data quality platform
Philip Howard, Bloor Research
Andy Hayler and Philip Howard are prolific writers on information quality at Bloor Research. They bring an international flair to the subject that you won’t find in the rest.
Sunday, April 6, 2008
Politics, Presidents and Data Governance
I was curious about the presidential candidates and their plans to build national ID cards and a database of citizens, so I set out to do some research on the candidates stance on this issue. It strikes me as a particularly difficult task, given the size of the database that would be needed and the complexity. Just how realistic would the data governance strategy for the candidates be?
I searched the candidate’s web sites with the following Google commands:
database site:http://www.johnmccain.com
database site:http://www.barackobama.com
database site:http://www.hillaryclinton.com
Hardly scientific, but interesting results nonetheless. The candidates have very different data management plans for the country. This simple search gave some insight into the candidate’s data management priorities.
Clinton:
Focused on national health care and the accompanying data challenges.
• Patient Health Care Records Database
• Health Care Provider Performance Tracking Database
• Employer History of Complaints
Comments: It’s clear that starting a national database of doctors and patients is a step toward a national health plan. There are huge challenges with doctor data, however. Many doctors work in multiple locations, having a practice at a major medical center and a private practice, for example. Consolidating doctor lists from insurance companies would rely heavily on unique health care provider ID numbers, doctor age and sex, and factors other than name and address for information quality. This is an ambitious plan, particularly given data compliance regulations, but necessary for a national health plan.
Obama:
Not much about actual database plans, but Obama has commented in favor of:
• Lobbyist Activity Database
• National Sex Offender Database
Comments: Many states currently monitor sex offenders, so the challenge would be coordinating a process and managing the metadata from the states. Not a simple task to say the least. I suspect none of the candidates are really serious about this, but it’s a strong talk-track. Ultimately, this may be better left to the states to manage.
As far as the lobbyist activity database, I frankly can’t see how it’d work. Would lobbyists would complete online forms describing their activities with politicians. If lobbyists have to describe their interaction with the politician, would they be given an open slate in which to scribble some notes about the event/gift/dinner/meeting topics? This would likely be chock full of unstructured data, and its usefulness would be questionable in my opinion.
McCain:
• Grants and Contracts Database
• Lobbyist Activity Database
• National Sex Offender Database
Comments: Adding in the grants and contracts database into McCain’s plan, I see this as similar to Obama’s plan in that it’s storage of unstructured data.
To succeed in any of these plans from our major presidential candidates, I see a huge effort in the “people” and “process” components of data governance. Congress will have to enact laws that describe data models, data security, information quality, exceptions processing and much more. Clearly, this is not their area of expertise. Yet the candidates seem to be talking about technology as a magic wand to heal our country’s problems. It’s not going to be easy for any of them to make any of this a reality, even with all the government’s money.
Instead of these popular vote-grabbing initiatives, wouldn't the government be better served by a president who is understands data governance? When you think about it, the US Government is making the same mistake that businesses make, growing and expanding data silos, leading to more and more inefficiencies. I can’t help but thinking what we really need is a federal information quality and metadata management agency (since the government like acronyms, shall we call it FIQMM) to oversee the government’s data. The agency could be empowered by the president to have access to government data, define data models, and provide people, process and technologies to improve efficiency. Imagine what efficiencies we could gain with a federal data governance strategy. Just imagine.