Tuesday, November 30, 2010

Match Mitigation: When Algorithms Aren’t Enough

I’d like to get a little technical on this post. I try to keep my posts business-friendly, but sometimes there's importance in detail. If none of this post makes any sense to you, I wrote a sort of primer on how matching works in many data quality tools, which you can get here.

Matching Algorithms
When you use a data quality tool, you’re often using matching algorithms and rules to make decisions on whether records match or not.  You might be using deterministic algorithms like Jaro, SoundEx and Metaphones. You might also be using probabilistic matching algorithms.

In many tools, you can set the rules to be tight where the software uses tougher criteria to determine a match, or loose where the software is not so particular. Tight and loose matches are important because you may have strict rules for putting records together, like customers of a bank, or not so strict rules, like when you’re putting together a customer list for marketing purposes.

What to do with Matches
Once data has been processed through the matcher, there are several possible outcomes. Between any two given records, the matcher may find:

  • No relationship
  • Match – the matcher found a definite match based on the criteria given
  • Suspect – the matcher thinks it found a match but is not confident. The results should be manually reviewed.
It’s that last category that the tough one.  Mitigating the suspect matches is the most time-consuming follow-up task after the matching is complete. Envision a million record database where you have 20,000 suspect matches.   That’s still going to take you some time to review.

Some of the newer (and cooler) tools offer strategies for dealing with suspect matches. The tools will present the suspect matches in a graphical user interface and allow users to pick which relationships are accurate and which are not. For example, Talend now offers a data stewardship console that lets you pick and choose records and attributes that will make up a best of breed record.

The goal, of course, is to not have suspect matches, so tuning the matches and limiting the suspect matches is the ultimate. The newest tools will make this easy. Some of the legacy tools make this hard.

Match mitigation is perhaps one of the most often overlooked processes of data quality. Don’t overlook it in your planning and processes.

Tuesday, November 16, 2010

Ideas Having Sex: The Path to Innovation in Data Management

I read a recent analyst report on the data quality market and “enterprise-class” data quality solutions. Per usual, the open source solutions were mentioned at a passing while the data quality solutions of the past were given high marks. Some of the solutions picked in the top originated from days when mainframe was king. Some of the top contenders still contained cobbled-together applications from ill-conceived acquisitions. It got me thinking about the way we do business today and how so much of it is changing.

Back in the 1990’s or earlier, if you had an idea for a new product, you’d work with an internal team of engineers and build the individual parts.  This innovation took time, as you might not always have exactly the right people working on the job.  It was slow and tedious. The product was always confined by its own lineage.

The Android phone market is a perfect examples of the modern way to innovate.  Today, when you want to build something groundbreaking like an Android, you pull in expertise from all around the world. Sure, Samsung might make the CPU and Video processing chips, but Primax Electronics in Taiwan might make the digital camera and Broadcomm in the US makes the touch screen, plus many others. Software vendors push the platform further with their cool apps. Innovation happens at break-neck speed because the Android is a collection of ideas that have sex and produce incredible offspring.

Isn’t that really the model of a modern company?  You have ideas getting together and making new ideas. When you have free exchange between people, there is no need to re-invent something that has already been invented. See the TED for more on this concept, where British author Matt Ridley argues that, through history, the engine of human progress and prosperity is "ideas having sex.”

The business model behind open source has a similar mission.  Open source simply creates better software. Everyone collaborates, not just within one company, but among an Internet-connected, worldwide community. As a result, the open source model often builds higher quality, more secure, more easily integrated software. It does so at a vastly accelerated pace and often at a lower cost.

So why do some industry analysts ignore it? There’s no denying that there are capitalist and financial reasons.  I think if an industry analyst were to actually come out and say that the open source solution is the best, it would be career suicide. The old-school would shun the analysts making him less relevant. The link between the way the industry pays and promotes analysts and vice versa seems to favor enterprise application vendors.

Yet the open source community along with Talend has developed a very strong data management offering that should be considered in the top of its class. The solution leverages other cutting edge solutions. To name just a few examples:
  • if you want to scale up, you can use distributed platform technology from Hadoop, which enables it to work with thousands of nodes and petabytes of data.
  • very strong enterprise class data profiling.  
  • matching that users can actually use and tune without having to jump between multiple applications.
  • a platform that grows with your data management strategy so that if your future is MDM, you can seamlessly move there without having to learn a new GUI.
The way we do business today has changed. Innovation can only happen when ideas have sex, as Matt Ridley puts it. As long as we’re engaged in exchange and specialization, we will achieve those new levels of innovation.

Disclaimer: The opinions expressed here are my own and don't necessarily reflect the opinion of my employer. The material written here is copyright (c) 2010 by Steve Sarsfield. To request permission to reuse, please e-mail me.