Showing posts with label data profiling. Show all posts
Showing posts with label data profiling. Show all posts

Tuesday, August 30, 2011

Top Ten Root Causes of Data Quality Problems: Part Four

Part 4 of 5: Data Flow
In this continuing series, we're looking at root causes of data quality problems and the business processes you can put in place to solve them.  In part four, we examine some of the areas involving the pervasive nature of data and how it flows to and fro within an organization.

Root Cause Number Seven: Transaction Transition

More and more data is exchanged between systems through real-time (or near real-time) interfaces. As soon as the data enters one database, it triggers procedures necessary to send transactions to other downstream databases. The advantage is immediate propagation of data to all relevant databases.

However, what happens when transactions go awry? A malfunctioning system could cause problems with downstream business applications.  In fact, even a small data model change could cause issues.

Root Cause Attack Plan
  • Schema Checks – Employ schema checks in your job streams to make sure your real-time applications are producing consistent data.  Schema checks will do basic testing to make sure your data is complete and formatted correctly before loading.
  • Real-time Data Monitoring – One level beyond schema checks is to proactively monitor data with profiling and data monitoring tools.  Tools like the Talend Data Quality Portal and others will ensure the data contains the right kind of information.  For example, if your part numbers are always a certain shape and length, and contain a finite set of values, any variation on that attribute can be monitored. When variations occur, the monitoring software can notify you.

Root Cause Number Eight: Metadata Metamorphosis

Metadata repository should be able to be shared by multiple projects, with audit trail maintained on usage and access.  For example, your company might have part numbers and descriptions that are universal to CRM, billing, ERP systems, and so on.  When a part number becomes obsolete in the ERP system, the CRM system should know. Metadata changes and needs to be shared.

In theory, documenting the complete picture of what is going on in the database and how various processes are interrelated would allow you to completely mitigate the problem. Sharing the descriptions and part numbers among all applicable applications needs to happen. To get started, you could then analyze the data quality implications of any changes in code, processes, data structure, or data collection procedures and thus eliminate unexpected data errors. In practice, this is a huge task.

Root Cause Attack Plan
  • Predefined Data Models – Many industries now have basic definitions of what should be in any given set of data.  For example, the automotive industry follows certain ISO 8000 standards.  The energy industry follows Petroleum Industry Data Exchange standards or PIDX.  Look for a data model in your industry to help.
  • Agile Data Management – Data governance is achieved by starting small and building out a process that first fixes the most important problems from a business perspective. You can leverage agile solutions to share metadata and set up optional processes across the enterprise.

This post is an excerpt from a white paper available here. My final post on this subject in the days ahead.

Monday, June 13, 2011

The Differences Between Small and Big Data

There is a lot of buzz today about big data and companies stepping up to meet the challenge of ever increasing data volumes. In the center of it all, are Hadoop and the Cloud.  Hadoop can intelligently manage the distribution of processing and your files. It manages the infrastructure needed to break down big data into more manageable chunks for processing by multiple servers. Likewise, a cloud strategy can take data management outside the walls of a corporation into a high scalable infrastructure.

Do you have big data?  It’s difficult to know precisely whether you do because big data is vaguely defined. You may qualify for big data technology if you face hundreds of gigabytes of data, or it may hundreds or thousands of terabytes. The classification of “big data” is not strictly defined by data size, but other business processes, too. Your data management infrastructure needs to take into account factors like future data volumes, peaks and lulls in requirements, business requirements and much more.

Small and Medium-Sized Data

What about “small” and medium-sized data? For example, data from spreadsheet, the occasional flat file, leads from a trade show, and catalog data from vendors may be vital to your business processes. With a new industry focus on transparency, business user involvement and sharing of data, small data is a constant issue.  Spreadsheets and flat files are the preferred method to share data today because most companies have some process for handling them. When you get these small to medium sized data sets,  it is still necessary to:
  • profile them
  • integrate them into your relational database
  • aggregate data from these sources, or extract only the vital parts
  • apply data quality standards when necessary
  • use them as part of a master data management (MDM) initiative

The Difference Goals of Big Data and Little Data
With big data, the concern is usually about your data management technology’s ability to handle massive quantities in order to provide you aggregates that are meaningful.  You need solutions that will scale to meet your data management needs.  However, handling small and medium data sets is more about short and long term costs.  How can you quickly and easily integrate data without a lot of red tape, big license fees, pain and suffering.

Think about it. When you need to handle small and medium data, you have options:
  • Hand-coding: Using hand-coding is sometimes faster than any solution and it still may be OK for ad-hoc, one off data integration.  Once you find yourself hand-coding again and again, you’ll find yourself rethinking that strategy. Eventually managing all that code will waste time and cost you a bundle. If your data volumes grow, hand-coded quickly becomes obsolete due to lack of scaling. Hand-coding gets high marks on speed to value, but falters in sustainability and long-term costs.
  • Open Source: Open source data management tools provide a quick way to get started, low overall costs and high sustainability.  By just downloading and learning the tools, you’re on your way to getting data management done.  The open source solutions may have some limitations on scalability, but most open source providers have low-cost commercial upgrades that meet these needs.  In other words, it's easy to start today and leverage Hadoop and the Cloud if you need it later. Open source gets high marks on speed to value, sustainability and costs.
  • Traditional Data Management Vendors: Small data is a tough issue for the mega-vendors. Even for 50K-100K records, the license cost in both the short term and long term could be prohibitive.  The mega-vendor solutions do tend to scale well, making them sustainable at a cost. However mergers in the data management business do happen. The sustainability of a product can be affected by these mergers.  Commercial vendors get respectable marks in speed to value and sustainability, but falter in high up-front costs and maintenance fees.
I've heard it a million times in this business - start small and fast with technology that gives you a fast success but also scales to future tasks.

    Monday, April 25, 2011

    Data Quality Scorecard: Making Data Quality Relevant

    Most data governance practitioners agree that a data quality scorecard is an important tool in any data governance program. It provides comprehensive information about quality of data in a database, and perhaps even more importantly, allows business users and technical users to collaborate on the quality issue.

    However, there are multiple levels of metrics that you should consider. There are:

    METRIC CLASSIFICATION
    EXAMPLES
    1
    Metrics that the technologists use to fix data quality problems

    7% of the e-mail attribute is blank. 12% of the e-mail attribute does not follow the standard e-mail syntax. 13% of our US mail addresses fail address validation.
    2
    Metrics business people use to make decisions about the data
    9% of my contacts have invalid e-mails.  3% have both invalid e-mails and invalid addresses.
    3
    Metrics managers use to get a big picture
    This customer data is good enough to use for a campaign.

    All levels are important for the various members of the data governance team.  Level one shows the steps you need to take to fix the data.  Level two shows context to the task at hand. Level three tells the uniformed about the business issue without having to dig into the details.

    So, when you’re building your DQ metrics, remember to roll-up the data into metrics into slightly higher formulations. You must design the scorecards to meet the needs of the interest of the different audiences, from technical through to business and up to executive. At the beginning of a data quality scorecard is information about data quality of individual data attributes. This is the default information that most profilers will deliver out of the box. As you aggregate scores, the high-level measures of the data quality become more meaningful. In the middle are various score sets allowing your company to analyze and summarize data quality from different perspectives. If you define the objective of a data quality assessment project as calculating these different aggregations, you will have much easier time maturing your data governance program. The business users and c-level will begin to pay attention.

    Tuesday, November 16, 2010

    Ideas Having Sex: The Path to Innovation in Data Management

    I read a recent analyst report on the data quality market and “enterprise-class” data quality solutions. Per usual, the open source solutions were mentioned at a passing while the data quality solutions of the past were given high marks. Some of the solutions picked in the top originated from days when mainframe was king. Some of the top contenders still contained cobbled-together applications from ill-conceived acquisitions. It got me thinking about the way we do business today and how so much of it is changing.

    Back in the 1990’s or earlier, if you had an idea for a new product, you’d work with an internal team of engineers and build the individual parts.  This innovation took time, as you might not always have exactly the right people working on the job.  It was slow and tedious. The product was always confined by its own lineage.

    The Android phone market is a perfect examples of the modern way to innovate.  Today, when you want to build something groundbreaking like an Android, you pull in expertise from all around the world. Sure, Samsung might make the CPU and Video processing chips, but Primax Electronics in Taiwan might make the digital camera and Broadcomm in the US makes the touch screen, plus many others. Software vendors push the platform further with their cool apps. Innovation happens at break-neck speed because the Android is a collection of ideas that have sex and produce incredible offspring.

    Isn’t that really the model of a modern company?  You have ideas getting together and making new ideas. When you have free exchange between people, there is no need to re-invent something that has already been invented. See the TED for more on this concept, where British author Matt Ridley argues that, through history, the engine of human progress and prosperity is "ideas having sex.”

    The business model behind open source has a similar mission.  Open source simply creates better software. Everyone collaborates, not just within one company, but among an Internet-connected, worldwide community. As a result, the open source model often builds higher quality, more secure, more easily integrated software. It does so at a vastly accelerated pace and often at a lower cost.

    So why do some industry analysts ignore it? There’s no denying that there are capitalist and financial reasons.  I think if an industry analyst were to actually come out and say that the open source solution is the best, it would be career suicide. The old-school would shun the analysts making him less relevant. The link between the way the industry pays and promotes analysts and vice versa seems to favor enterprise application vendors.

    Yet the open source community along with Talend has developed a very strong data management offering that should be considered in the top of its class. The solution leverages other cutting edge solutions. To name just a few examples:
    • if you want to scale up, you can use distributed platform technology from Hadoop, which enables it to work with thousands of nodes and petabytes of data.
    • very strong enterprise class data profiling.  
    • matching that users can actually use and tune without having to jump between multiple applications.
    • a platform that grows with your data management strategy so that if your future is MDM, you can seamlessly move there without having to learn a new GUI.
    The way we do business today has changed. Innovation can only happen when ideas have sex, as Matt Ridley puts it. As long as we’re engaged in exchange and specialization, we will achieve those new levels of innovation.

    Monday, August 9, 2010

    Data Quality Pro Discussion

    Last week I sat down with Dylan Jones of DataQualityPro.com to talk about data governance. Here is the replay. We discussed a range of topics including organic governance approaches, challenges of defining data governance, industry adoption trends, policy enforcement vs legislature and much more.

    Link

    Friday, April 9, 2010

    Links from my eLearning Webinar

    I recently delivered a webinar on the Secrets of Affordable Data Governance. In the webinar, I promised to deliver links for lowering the costs of data management.  Here are those links:

    • Talend Open Source - Download free data profiling, data integration and MDM software.
    • US Census - Download census data for cleansing of city name and state with latitude and longitude appends.
    • Data.gov - The data available from the US government.
    • Geonames - Postal codes and other location reference data for almost every country in the world.
    • GRC Data - A source of low-cost customer reference data, including names, addresses, salutations, and more.
    • Regular Expressions - Check the shape of data in profiling software or within your database application.
    If you search on the term "download reference data", you will find many other sources.

    Friday, April 2, 2010

    Donating the Data Quality Asset

    If you believe like I do that proper data management can change the world, then you have to start wondering if it’s time for all us data quality professionals to stand up and start changing it.

    It’s clear that everyone organization, no matter what the size or influence, can benefit from properly managing their data. Even charitable organizations can benefit with a cleaner customer list to get the word out when they need donations.  Non-profits who handle charitable goods can benefit from better data in their inventory management.  If food banks had a better way of managing data and soliciting volunteers, wouldn’t more people be fed? If churches kept better records of their members, would their positive influence be more widespread?  If organizations who accept goods in donation kept a better inventory system, wouldn’t more people benefit? The data asset is not limited to Fortune 1000 companies, but until recently, solutions to manage data properly were only available to the elite.

    Open source is coming on strong and is a factor that eases us to donate the data quality.  In the past, it many have been a challenge to get mega-vendors to donate high-end solutions, but we can make significant progress on the data quality problem with little or no solutions cost these days. Solutions like Talend Open Profiler, Talend Open Studio, Pentaho and DataCleaner offer data integration and data profiling.

    In my last post, I discussed the reference data that is now available for download.  Reference data used to be proprietary and costly. It’s a new world – a better one for low-cost data management solutions.

    Can we save the world through data quality?  If we can help good people spread more goodness, then we can. Let’s give it a try.

    Tuesday, February 16, 2010

    The Secret Ingredient in Major IT Initiatives

    One of my first jobs was that of assistant cook at a summer camp.  (In this case, the term ‘cook’ was loosely applied meaning to scrub pots and pans for the head cook.) It was there I learned that most cooks have ingredients that they tend to use more often.  The cook at Camp Marlin tended to use honey where applicable.  Food TV star Emeril likes to use garlic and pork fat.  Some cooks add a little hot pepper to their chocolate recipes – it is said to bring out the flavor of the chocolate.  Definitely a secret ingredient.
    For head chefs taking on major IT initiatives the secret ingredient is always data quality technology. Attention to data quality doesn’t make the recipe of an IT initiative alone so much as it makes an IT initiative better.  Let’s take a look at how this happens.

    Profiling
    No matter what the project, data profiling provides a complete understanding of the data before the project team attempts to migrate it. This can help the project team create a more accurate plan for integration.  On the other hand, it is ill-advised to migrate data to your new solution as-is, as it can lead to major costs over-runs and project delays as you have to load and reload it.

    Customer Relationship Management (CRM)
    By using data quality technology in CRM, the organization will benefit from a cleaner customer list with fewer duplicate records. Data quality technology can work as a real-time process, limiting the amount of typos and duplicates in the system, thus leading to improved call center efficiency.  Data profiling can also help an organization understand and monitor the quality of a purchased list for integration will avoid issues with third-party data.

    Enterprise Resource Planning (ERP) and Supply Chain Management (SCM)

    If data is accurate, you will have a more complete picture of the supply chain. Data quality technology can be used to more accurately report inventory levels, lowering inventory costs. When you make it part of your ERP project, you may also be able to improve bargaining power with suppliers by gaining improved intelligence about their corporate buying power. 

    Data Warehouse and Business  Intelligence
    Data quality helps disparate data sources to act as one when migrated to a data warehouse. Data quality makes data warehouse possible by standardizing disparate data. You will be able to generate more accurate reports when trying to understand sales patterns, revenue, customer demographics and more.

    Master Data Management (MDM)
    Data quality is a key component of master data management.     An integral part of making applications communicate and share data is to have standardized data.  MDM enhances the basic premise of data quality with additional features like persistent keys, a graphical user interface to mitigate matching, the ability to publish and subscribe to enterprise applications, and more.

    So keep in mind, when you decide to improve data quality, it is often because of your need to make a major IT initiative even stronger.  In most projects, data quality is the secret ingredient to make your IT projects extraordinary.  Share the recipe.

    Thursday, January 21, 2010

    ETL, Data Quality and MDM for Mid-sized Business


    Is data quality a luxury that only large companies should be able to afford?  Of course the answer is no. Your company should be paying attention to data quality no matter if you are a Fortune 1000 or a startup. Like a toothache, poor data quality will never get better on its own.

    As a company naturally grows, the effects of poor data quality multiply.  When a small company expands, it naturally develops new IT systems. Mergers often bring in new IT systems, too. The impact of poor data quality slowly invades and hinders the company’s ability to service customers, keep the supply chain efficient and understand its own business. Paying attention to data quality early and often is a winning strategy for even the small and medium-sized enterprise (SME).

    However, SME’s have challenges with the investment needed in enterprise level software. While it’s true that the benefit often outweighs the costs, it is difficult for the typical SME to invest in the license, maintenance and services needed to implement a major data integration, data quality or MDM solution.

    At the beginning of this year, I started with a new employer, Talend. I became interested in them because they were offering something completely different in our world – open source data integration, data quality and MDM.  If you go to the Talend Web site, you can download some amazing free software, like:
    • a fully functional, very cool data integration package (ETL) called Talend Open Studio
    • a data profiling tool, called Talend Open Profiler, providing charts and graphs and some very useful analytics on your data
    The two packages sit on top of a database, typically MySQL – also an open source success.

    For these solutions, Talend uses a business model similar to what my friend Jim Harris has just blogged about – Freemium. Under this new model, free open source content is made available to everyone—providing the opportunity to “up-sell” premium content to a percentage of the audience. Talend works like this.  You can enhance your experience from Talend Open Studio by purchasing Talend Integration Suite (in various flavors).  You can take your data quality initiative to the next level by upgrading Talend Open Profiler to Talend Data Quality.

    If you want to take the combined data integration and data quality to an even higher level, Talend just announced a complete Master Data Management (MDM) solution, which you can use in a more enterprise-wide approach to data governance. There’s a very inexpensive place to start and an evolutionary path your company can take as it matures its data management strategy.

    The solutions have been made possible by the combined efforts of the open source community and Talend, the corporation. If you’d like, you can take a peek at some source code, use the basic software and try your hand at coding an enhancement. Sharing that enhancement with community will only lead to a world full of better data, and that’s a very good thing.

    Friday, January 2, 2009

    Building a More Powerful Data Quality Scorecard

    Most data governance practitioners agree that a data quality scorecard is an important tool in any data governance program. It provides comprehensive information about quality of data in a database, and perhaps even more importantly, allows business users and technical users to collaborate on the quality issue.

    However, if we show that 7% of all tables have data quality issues, the number is useless - there is no context. You can’t say whether it is good or bad, and you can’t make any decisions based on this information. There is no value associated with the score.

    In an effort to improve processes, the data governance teams should roll-up the data into metrics into slightly higher formulations. In their book “Journey to Data Quality”, authors Lee, Pipino, Funk and Wang correctly suggest that making the measurements quantifiable and traceable provide the next level of transparency to the business. The metrics may be rolled up into a completeness rating, for example if your database contains 100,000 name and address postal codes and 3,500 records are incomplete, 3.5% of your postal codes failed and 96.5% pass. Similar simple formulas exist for Accuracy, Correctness, Currency and Relevance, too. However, this first aggregation still doesn’t support data governance, because business users aren’t thinking that way. They have processes that are supported by data and it's still a stretch figuring out why this all matters.

    Views of Data Quality Scorecard
    Your plan must be to make data quality scorecards for different internal audiences - marketing, IT, c-level, etc.

    The aggregation might look something like this:You must design the scorecards to meet the needs of the interest of the different audiences, from technical through to business and up to executive. At the beginning of a data quality scorecard is information about data quality of individual data records. This is the default information that most profilers will deliver out of the box. As you aggregate scores, the high-level measures of the data quality become more meaningful. In the middle are various score sets allowing your company to analyze and summarize data quality from different perspectives. If you define the objective of a data quality assessment project as calculating these different aggregations, you will have much easier time maturing your data governance program. The business users and c-level will begin to pay attention.

    Business users are looking for whether the data supports the business process. They want to know if the data is facilitating compliance with laws. They want to decide whether their programs are “Go”, “Caution” or “Stop” like a traffic light. They want to know whether the current processes are giving them good data so they can change them if necessary. You can only do this by aggregating the information quality results and aligning those results with business.

    Friday, June 6, 2008

    Data Profiling and Big Brown

    Big Brown is positioned to win the third leg of the Triple Crown this weekend. In many ways picking a winner for a big thoroughbred race is similar to planning for a data quality project. Now, stay with me on this one.

    When making decision on projects, we need statistics and analysis. With horse racing, we have a nice report that is already compiled for us called the daily racing form. It contains just about all the analysis we need to make a decision. With data intensive projects, you’ve got to do the analysis up front in order to win. We use data profiling tools to gather a wide array of metrics in order to make reasonable decisions. Like in our daily racing form, we look for anomalies, trends, and ways to cash in.

    In data governance project planning, where there are company-wide projects abound, we may even have the opportunity to pick the projects that will deliver the highest return on investment. It’s similar to picking a winner at 10:1 odds. We may decide to bet our strategy on a big winner and when that horse comes in, we’ll win big for our company.

    Now needless to say, neither the daily racing form nor the results of data profiling are completely infallible. For example, Big Brown’s quarter crack in his hoof is something that doesn’t show up in the data. Will it play a factor? Does newcomer Casino Drive, for whom there is very little data available, have a chance to disrupt our Big Brown project? In data intensive projects, we must communicate, bring in business users to understand processes, study and prepare contingency plans in order to mitigate risks from the unknown.

    So, Big Brown is positioned to win the Triple Crown this weekend. Are you positioned to win on your next data intensive IT project? You can better your chances by using the daily racing form for data governance – a data profiling tool.

    Monday, May 19, 2008

    Unusual Data Quality Problems

    When I talk to folks who are struggling with data quality issues, there are some who are worried that they have data unlike any data anyone has ever seen. Often there’s a nervous laugh in the voice as if the data is so unusual and so poor that an automated solution can’t possibly help.

    Yes, there are wide variations in data quality and consistency and it might be unlike any we’ve seen. On the other hand, we’ve seen a lot of unusual data over the years. For example:

    • A major motorcycle manufacturer used data quality tools to pull out nicknames from their customer records. Many of the names they had acquired for their prospect list were from motorcycle events and contests where the entries were, shall we say, colorful. The name fields contained data like “John the Mad Dog Smith” or “Frank Motor-head Jones”. The client used the tool to separate the name from the nickname, making it a more valuable marketing list.
    • One major utility company used our data quality tools to identify and record notations on meter-reader records that were important to keep for operational uses, but not in the customer billing record. Upon analysis of the data, the company noticed random text like “LDIY" and "MOR" along with the customer records. After some investigation, they figured out that LDIY meant “Large Dog in Yard” which was particularly important for meter readers. MOR meant “Meter in Right, which was also valuable. The readers were given their own notes field, so that they could maintain the integrity of the name and address while also keeping this valuable data. IT probably saved a lot of meter readers from dog bite situations.
    • Banks have used our data quality tools to separate items like "John and Judy Smith/221453789 ITF George Smith". The organization wanted to consider this type of record as three separate records "John Smith" and "Judy Smith" and "George Smith" with obvious linkage between the individuals. This type of data is actually quite common on mainframe migrations.
    • A food manufacturer standardizes and cleanses ingredient names to get better control of manufacturing costs. In data from their worldwide manufacturing plants, an ingredient might be “carrots” “chopped frozen carrots” “frozen carrots, chopped” “chopped carrots, frozen” and so on. (Not to mention all the possible abbreviations for the words carrots, chopped and frozen.) Without standardization of these ingredients, there was really no way to tell how many carrots the company purchased worldwide. There was no bargaining leverage with the carrot supplier, and all the other ingredient suppliers, until the data was fixed.

    Not all data quality solutions can handle all of these types of anomalies. They will pass these "odd" values without attempting to cleanse them. It’s key to have a system that will learn from your data and allow you to develop business rules that meet the organization’s needs.

    Now there are times, quite frankly, when data gets so bad, that automated tools can do nothing about it, but that’s where data profiling comes in. Before you attempt to cleanse or migrate data, you should profile it to have a complete understanding of it. This will let you weigh the cost of fixing very poor data against the value that it will bring to the organization.

    Thursday, March 27, 2008

    Mergers and Acquisitions: Data's Influence on Company Value

    Caveat Emptor! Many large companies have a growth strategy that includes mergers and acquisitions, but many are missing a key negotiating strategy during the buying process.

    If you’re a big company, buying other companies in your market brings new customers into your fold. So, rather than paying for a marketing advertising campaign to get new customers, you can buy them as part of an acquisition. Because of this, most venture capitalists and business leaders know that two huge factors in determining a company’s value during an acquisition are the customer and prospect lists.

    Having said that, it’s strange how little this is examined in the buy-out process. Before they buy, companies look at certain assets under a microscope - tangible assets like buildings and inventory are examined. Human assets, like the management staff are given a strong look. Cash flow is audited and examined with due diligence. But, data assets are often only given a hasty passing glance.

    Data assets quickly dissolve when the company being acquired has data quality issues. It’s not uncommon for a company to have 20%, 40%, or even 50% customer duplication (or near duplicates) in their data base, for example. So, if you think you’re getting 100,000 new customers, you may actually be getting 50,000 after you cleanse. It’s also common for actual inventory levels in the physical warehouse to be misaligned with the inventory levels in the ERP systems. This too may be due to data quality issues, and lead to surprises after the acquisition.

    So what can you do as an acquiring company to mitigate these risks? The key is due diligence on data. Ask to profile the data of the company you’re going to buy. Bring in your team, or hire a third party to examine the data. Look at the customer data, the inventory data, the supply chain data or whatever data is a valuable asset in the acquisition. If privacy and security are an issue the results of the profiling can usually be rolled up into some nice charts and graphs that’ll give you a picture of the status of organizational information.

    In my work with Trillium Software, I have talked to customers who have saved millions in acquisition costs by evaluating the data prior to buying a company. Some have gone so far as evaluation of the overlap between their own customer base and the new customer base to determine value. Why pay for a customer when (s)he is already on the customer list?

    Profiling lets you set up business rules that are important to your company. Does each record have a valid tax ID number? What percentage of the database contact information is null? How many bogus e-mails appear? Does the data make sense, or are there a lot of near duplicates and misfielded data. In inventory data, how structured or unstructured is the data? All of these can quickly be ascertained with a data profiling technology. All of these technical issues can be correlated into business value, and therefore negotiating value, for your company.

    The data governance teams that I have met that I have done this due diligence for their companies have become real superstars, and are very much a strategic part of their corporations. It’s easy for a CEO to see the value you bring when you can prove that they are paying the right price for a company acquisition.

    Thursday, January 24, 2008

    The Rise of the Business-focused Data Steward


    In a December 2007 research note from Gartner entitled “Best Practices for Data Stewardship”, Gartner give some very practical and accurate advice on starting and executing a data steward program. They reiterate this advice in a press release issued this month. The new advice is to have business people become your data stewards. So, in marketing you have someone assigned as a data steward to work with the IT. The business person knows the meaning of the data as well as where they want to go with it. They become responsible for the data, and owners of it.

    It’s a great concept, and one that I expect will become more and more a reality this year. However, there is some growth that needs to happen in the software industry. There are very few tools that serve a business-focused data steward. Most tools on the market are additional features that have been tacked on to IT-focused tools. Sure, a data profiler can show some cool charts and graphs, but not many business users want to learn how to use them. Should a business user really have to learn about metadata, entities, and attributes in order to find out if the data meets the need of the organization?

    Rather, a marketing person wants to know if (s)he can do an offer mailing without getting most of it back. A CIO wants to know if a customer database that they just got as part of a merger has complete and current information. Accounting wants to know that they have valid tax ID numbers (social security numbers) for customers with whom they give credit, and the compliance team want to know that they are stopping those listed on the OFAC from opening accounts. Metadata? They don’t care. They just need the metrics to track the business problem.

    This was really the concept that Trillium Software had when we designed TS Insight, our data quality reporting tool. The tool uses business rules and analysis from our profiler and presents them in a very friendly way – via a web browser. The more technical users can set up regular updates that display compliance with the business rules. The less technical users can open their web browsers to their customized page and metrics that are important to them. The business rules can track pretty much anything about the data without being too technical.

    TS Insight is still in ramp-up for us. We came out with version 1.0 last year and we’re about to release version 2.5 this quarter. Still, we have a big head start on anyone else in the industry with this tool, serving the needs of the business-focused data steward. If this is something you’d like to see, please send me an e-mail and I’ll set up a demo.

    Disclaimer: The opinions expressed here are my own and don't necessarily reflect the opinion of my employer. The material written here is copyright (c) 2010 by Steve Sarsfield. To request permission to reuse, please e-mail me.