Tuesday, November 30, 2010

Match Mitigation: When Algorithms Aren’t Enough

I’d like to get a little technical on this post. I try to keep my posts business-friendly, but sometimes there's importance in detail. If none of this post makes any sense to you, I wrote a sort of primer on how matching works in many data quality tools, which you can get here.

Matching Algorithms
When you use a data quality tool, you’re often using matching algorithms and rules to make decisions on whether records match or not.  You might be using deterministic algorithms like Jaro, SoundEx and Metaphones. You might also be using probabilistic matching algorithms.

In many tools, you can set the rules to be tight where the software uses tougher criteria to determine a match, or loose where the software is not so particular. Tight and loose matches are important because you may have strict rules for putting records together, like customers of a bank, or not so strict rules, like when you’re putting together a customer list for marketing purposes.

What to do with Matches
Once data has been processed through the matcher, there are several possible outcomes. Between any two given records, the matcher may find:

  • No relationship
  • Match – the matcher found a definite match based on the criteria given
  • Suspect – the matcher thinks it found a match but is not confident. The results should be manually reviewed.
It’s that last category that the tough one.  Mitigating the suspect matches is the most time-consuming follow-up task after the matching is complete. Envision a million record database where you have 20,000 suspect matches.   That’s still going to take you some time to review.

Some of the newer (and cooler) tools offer strategies for dealing with suspect matches. The tools will present the suspect matches in a graphical user interface and allow users to pick which relationships are accurate and which are not. For example, Talend now offers a data stewardship console that lets you pick and choose records and attributes that will make up a best of breed record.

The goal, of course, is to not have suspect matches, so tuning the matches and limiting the suspect matches is the ultimate. The newest tools will make this easy. Some of the legacy tools make this hard.

Match mitigation is perhaps one of the most often overlooked processes of data quality. Don’t overlook it in your planning and processes.

4 comments:

Frank Harland said...

I won't forget Steve. I'm glad I now know the term for something I did a lot back in 1995: Match Mitigation.
And you're absolutely right about the real data quality work being labour-intensive. Right from the Match Mitigation process up to the defining of good data architectures and setting up of Data Governance. Very little can be left to a software tool.

Dylan Jones said...

Interesting post Steve, I know you guys at Talend have been working on this so I need to take a refresh.

I think a natural step for these tools is to create a richer matching experience, pulling in data from social media networks perhaps, past records, other media - giving the steward more control over unstructured data to make that decision far quicker.

Naiem Yeganeh said...

Thank you for this post. One thing which is under estimated is the ability for user to easily identify matching exceptions. For example you know that kath and kate are two different persons even though they might be a close match. I have seen this specially when people are trying to correct postcodes and suburbs against a master data source.

Steve Sarsfield said...

We humans can pull in our vast intelligence and experience to make match decisions. We have a level of understanding of people and places in the world that few computer systems possess. Therefore, one strategy is to leave the easy decisions up the match algorithms, allowing us to handle the more challenging ones.
As long as you can teach the technology, so you don't have to keep making the same decisions, it's an ideal way to achieve match accuracy.

Disclaimer: The opinions expressed here are my own and don't necessarily reflect the opinion of my employer. The material written here is copyright (c) 2010 by Steve Sarsfield. To request permission to reuse, please e-mail me.