Showing posts with label tools. Show all posts
Showing posts with label tools. Show all posts

Monday, April 2, 2012

Why Code Base is Important in Vendor Selection


The horticulture of software

Spring has sprung here in the northern hemisphere and mind turn to the plant life that will be sprouting all across our home towns. The new growth has me thinking about the similarities between horticulture and the code base of our data management solutions.

Reviewing software solutions before you buy is a major effort for users and/or vendor selection committees. Much time is spent on looking at whether the features of the product will meet team needs. Features are so important that companies will spend time to produce RFPs with extensive feature lists. They may even require a proof of concept; the vendor must install and test the solution in the purchaser’s work environment. This goes for those applications used to manage data, but also many other applications.

However, I believe that buyers should carefully look at the style of growth to the code base. In the data management field, we undergone decades of technology combined with decades of market consolidation.  The code base for the application you’re about to buy may have grown from the following horticultural strategy:

  • Grafting  –  A large software company sees potential in the data management field and begins to acquire companies and grafting them together to create a solution. Sometimes the acquisition isn’t done by technologists, but by upper management seeking to fill holes in the product line. Sometimes they even buy competing technologies, leaving everyone trying to figure out who will win. Sometimes the graft doesn’t take.
  • Old Growth – Companies have an existing technology that has worked for decades. However, back in 1990 when they released version 1.0, JAVA was experimental and not the dominant force it is today.  FORTRAN was the preferred programming language and COBOL copybooks were the data model.  I know some companies in the data management market have spent millions updating old growth code to be more competitive in this market, and some others who have not.  This becomes a dilemma for all vendors at some point.  When do you prune out the dead wood?
  • Sapling – Companies who are just breaking into the data management marketplace and have a good-looking start for data management.  However, the sapling doesn’t yet have all the branches you want on it.  Will the sapling survive among the other deciduous solutions in the market?

When you’re selecting a vendor, you ideally want a code base that is mature, but not too mature.  You want limited grafting.   The growth of the code and the grafting affects:

  • Speed of innovation for the vendor
  • Customization for you
  • Future expansion for both of you
  • The age and experience of the technologists necessary to operate it
  • Consulting requirements
  • Ability to cross-train personnel (E.g. DI people running DQ and vice versa)

So, when you’re selecting a data management solution, or any technology solution, don’t just compare the features, but take a look at how the product grew to where it is today.  Look for the solution in the optimal stage of growth that will meet your needs today and those for the future.


Monday, April 25, 2011

Data Quality Scorecard: Making Data Quality Relevant

Most data governance practitioners agree that a data quality scorecard is an important tool in any data governance program. It provides comprehensive information about quality of data in a database, and perhaps even more importantly, allows business users and technical users to collaborate on the quality issue.

However, there are multiple levels of metrics that you should consider. There are:

METRIC CLASSIFICATION
EXAMPLES
1
Metrics that the technologists use to fix data quality problems

7% of the e-mail attribute is blank. 12% of the e-mail attribute does not follow the standard e-mail syntax. 13% of our US mail addresses fail address validation.
2
Metrics business people use to make decisions about the data
9% of my contacts have invalid e-mails.  3% have both invalid e-mails and invalid addresses.
3
Metrics managers use to get a big picture
This customer data is good enough to use for a campaign.

All levels are important for the various members of the data governance team.  Level one shows the steps you need to take to fix the data.  Level two shows context to the task at hand. Level three tells the uniformed about the business issue without having to dig into the details.

So, when you’re building your DQ metrics, remember to roll-up the data into metrics into slightly higher formulations. You must design the scorecards to meet the needs of the interest of the different audiences, from technical through to business and up to executive. At the beginning of a data quality scorecard is information about data quality of individual data attributes. This is the default information that most profilers will deliver out of the box. As you aggregate scores, the high-level measures of the data quality become more meaningful. In the middle are various score sets allowing your company to analyze and summarize data quality from different perspectives. If you define the objective of a data quality assessment project as calculating these different aggregations, you will have much easier time maturing your data governance program. The business users and c-level will begin to pay attention.

Tuesday, March 15, 2011

Open Source Data Management or Do-it-Yourself

With the tough economy people are still cutting back on corporate spending.  There is a sense of urgency to just get things done, and sometimes that can lead to hand-coding your own data integration, data quality or MDM functions. When you begin to develop your plan and strategies for data management, you have to think about all the hidden costs of getting solutions out-of-the-box versus building on your own.

Reusability is one key consideration. Using data management technologies that only plug into one system just doesn’t make sense.  It’s difficult to get that re-usability with custom code, unless your programmers have high visibility into other projects. On the other hand, all tool vendors, even open source ones have pressure from their clients to support multiple databases and business solutions.  Open source solutions are built to work in a wider variety of architectures. You can move your data management processes between JD Edwards and SAP and SalesForce, for example, with relative ease.

Indemnity is another consideration. What if something goes wrong with your home-grown solution after the chief architect leaves his job? Who are you going to call? If something goes wrong with your open source solution, you can turn to the community or call the vendor for support.

Long-term costs are yet another issue.  Home-grown solutions have the tendency to start cheap and get more expensive as time goes on.  It’s difficult to manage custom code, especially if it is poorly documented. You hire consultants to manage code.  Eventually, you have to rip and replace and that can be costly.

You should consider your human resources, too. Does it make sense to have a team work on hand-coding database extractions and transformation, or would the total cost/benefit be better if you used an open source data integration tool? It might just free up some of your programmers to pursue more important, ROI-centric ventures.

If you’re thinking of cooking up your own technical solutions for data management, hoping to just get it done, think again. Your most economical solution might just be to leverage the community of experts and go with open source.

Friday, December 10, 2010

Six Data Management Predictions for 2011

This time of year everyone makes prognostications about the state of the data management field for 2011. I thought I’d take my turn by offering my predictions for the coming year.

Data will become more open
In the old days good quality reference data was an asset kept in the corporate lockbox. If you had a good reference table for common misspellings of parts, cities, or names for example, the mind set was to keep it close and away from falling into the wrong hands.  The data might have been sold for profit or simply not available.  Today, there really is no “wrong hands”.  Governments and corporations alike are seeing the societal benefits of sharing information. More reference data is there for the taking on the internet from sites like data.gov and geonames.org.  That trend will continue in 2011.  Perhaps we’ll even see some of the bigger players make announcements as to the availability of their data. Are you listening Google?

Business and IT will become blurry
It’s becoming harder and harder to tell an IT guy from the head of marketing. That’s because in order to succeed, the IT folks need to become more like the marketer and vice versa.  In the coming year, the difference will be less noticeable and business people get more and more involved in using data to their benefit.  Newsflash One: If you’re in IT, you need marketing skills to pitch your projects and get funding.  Newsflash Two: If you’re in business, you need to know enough about data management practices to succeed.

Tools will become easier to use
As the business users come into the picture, they will need access to the tools to manage data.  Vendors must respond to this new marketplace or die.

Tools will do less heavy lifting
Despite the improvements in the tools, corporations will turn to improving processes and reporting in order to achieve better data management. Dwindling are the days where we’re dealing with data that is so poorly managed that it requires overly complicated data quality tools.  We’re getting better at the data management process and therefore, the burden on the tools becomes less. Future tools with focus on supporting the process improvement with work flow features, reporting and better graphical user interfaces.

CEOs and Government Officials will gain enlightenment
Feeding off the success of a few pioneers in data governance as well as failures of IT projects in our past, CEOs and governments will gain enlightenment about managing their data and put teams in place to handle it.  It has taken decades of our sweet-talk and cajoling for government and CEOs to achieve enlightenment, but I believe it is practically here.

We will become more reliant on data
Ten years ago, it was difficult to imagine us where we are today with respect to our data addiction. Today, data is a pervasive part of our internet-connected society, living in our PCs, our TVs, our mobile phones many other devices. It’s a huge part of our daily lives. As I’ve said in past posts, the world is addicted to data and that bodes well for anyone who helps the world manage it. In 2011, no matter if the economy turns up or down, our industry will continue to feed the addiction to good, clean data.

Friday, July 30, 2010

Deterministic and Probabilistic Matching White Paper

I’ve been busy this summer working on a white paper on record matching, the result of which is available on the Talend web site here.

The white paper is sort of a primer containing elementary principles of record matching,  As the description says, it outlines the basic theories and strategies of record matching. It describes the nuances of deterministic and probabilistic matching and the algorithms used to identify relationships within records. It covers the processes to employ in conjunction with matching technology to transform raw data into powerful information that drives success in enterprise applications like CRM, data warehouse and ERP.

Friday, April 9, 2010

Links from my eLearning Webinar

I recently delivered a webinar on the Secrets of Affordable Data Governance. In the webinar, I promised to deliver links for lowering the costs of data management.  Here are those links:

  • Talend Open Source - Download free data profiling, data integration and MDM software.
  • US Census - Download census data for cleansing of city name and state with latitude and longitude appends.
  • Data.gov - The data available from the US government.
  • Geonames - Postal codes and other location reference data for almost every country in the world.
  • GRC Data - A source of low-cost customer reference data, including names, addresses, salutations, and more.
  • Regular Expressions - Check the shape of data in profiling software or within your database application.
If you search on the term "download reference data", you will find many other sources.

Monday, October 12, 2009

Data May Require Unique Data Quality Processes


A few things in life have the same appearance, but the details can vary widely.  For example, planets and stars look the same in the night sky, but traveling to them and surviving once you get there are two completely different problems. It’s only when you get close to your destination that you can see the difference.

All data quality projects can appear the same from afar but ultimately can be as different as stars and planets. One of the biggest ways they vary is in the data itself and whether it is chiefly made up of name and address data or some other type of data.

Name and Address Data
A customer database or CRM system contains data that we know much about. We know that letters will be transposed, names will be comma reversed, postal codes will be missing and more.  There are millions of things that good data quality tools know about broken name and address data since so many name and address records have been processed over the years. Over time, business rules and processes are fine-tuned for name and address data.  Methods of matching up names and addresses become more and more powerful.

Data quality solutions also understand what name and addresses are supposed to look like since the postal authorities provide them with correct formatting. If you’re somewhat precise about following the rules of the postal authorities, most mail makes it to its destination.  If we’re very precise, the postal services can offer discounts. The rules are clear in most parts of the civilized world. Everyone follows the same rules for name and address data because it makes for better efficiency.

So, if we know what the broken item looks like and we know what the fixed item is supposed to look like, you can design and develop processes that involve trained, knowledgeable workers and automated solutions to solve real business problems. There’s knowledge inherent in the system and you don’t have to start from scratch every time you want to cleanse it.

ERP, Supply Chain Data
However, when we take a look at other types of data domains, the picture is very different.  There isn’t a clear set of knowledge what is typically input and what is typically output and therefore you must set up processes for doing so. In supply chain data or ERP data, we can’t immediately see why the data is broken or what we need to do to fix it.  ERP data is likely to be sort of a history lesson of your company’s origins, the acquisitions that were made, and the partnership changes throughout the years. We don’t immediately have an idea about how the data should ultimately look. The data that exists in this world is specific to one client or a single use scenario which cannot be handled by existing out-of-the-box rules

With this type of data you may find the need to collaborate more with the business users of the data, who expertise in determining the correct context for the information comes more quickly, and therefore enable you to effect change more rapidly. Because of the inherent unknowns about the data, few of the steps for fixing the data are done for you ahead of time. It then becomes critical to establish a methodology for:
  • Data profiling in order to understanding what issues and challenges.
  • Discussions with the users of the data to understand context, how it’s used and the most desired representation.  Since there are few governing bodies for ERP and supply chain data, the corporation and its partners must often come up with an agreed-upon standard.
  • Setting up business rules, usually from scratch, to transform the data
  • Testing the data in the new systems
I write about this because I’ve read so much about this topic lately. As practitioners you should be aware that the problem is not the same across all domains. While you can generally solve name and address data problems with a technology focus, you will often rely more on collaboration with subject matter experts to solve issues in other data domains.

Disclaimer: The opinions expressed here are my own and don't necessarily reflect the opinion of my employer. The material written here is copyright (c) 2010 by Steve Sarsfield. To request permission to reuse, please e-mail me.