Data Governance Insider: matching algorithms

Showing posts with label matching algorithms. Show all posts

Tuesday, November 30, 2010

Match Mitigation: When Algorithms Aren’t Enough

I’d like to get a little technical on this post. I try to keep my posts business-friendly, but sometimes there's importance in detail. If none of this post makes any sense to you, I wrote a sort of primer on how matching works in many data quality tools, which you can get here.

Matching Algorithms
When you use a data quality tool, you’re often using matching algorithms and rules to make decisions on whether records match or not. You might be using deterministic algorithms like Jaro, SoundEx and Metaphones. You might also be using probabilistic matching algorithms.

In many tools, you can set the rules to be tight where the software uses tougher criteria to determine a match, or loose where the software is not so particular. Tight and loose matches are important because you may have strict rules for putting records together, like customers of a bank, or not so strict rules, like when you’re putting together a customer list for marketing purposes.

What to do with Matches
Once data has been processed through the matcher, there are several possible outcomes. Between any two given records, the matcher may find:

No relationship
Match – the matcher found a definite match based on the criteria given
Suspect – the matcher thinks it found a match but is not confident. The results should be manually reviewed.

It’s that last category that the tough one. Mitigating the suspect matches is the most time-consuming follow-up task after the matching is complete. Envision a million record database where you have 20,000 suspect matches. That’s still going to take you some time to review.

Some of the newer (and cooler) tools offer strategies for dealing with suspect matches. The tools will present the suspect matches in a graphical user interface and allow users to pick which relationships are accurate and which are not. For example, Talend now offers a data stewardship console that lets you pick and choose records and attributes that will make up a best of breed record.

The goal, of course, is to not have suspect matches, so tuning the matches and limiting the suspect matches is the ultimate. The newest tools will make this easy. Some of the legacy tools make this hard.

Match mitigation is perhaps one of the most often overlooked processes of data quality. Don’t overlook it in your planning and processes.

Saturday, October 16, 2010

Is 99.8 % data accuracy enough?

Ripped from recent headlines, we see how even a .2% failure can have a big impact.

WASHINGTON (AP) ― More than 89,000 stimulus payments of $250 each went to people who were either dead or in prison, a government investigator says in a new report.

Let’s take a good, hard look at this story. It begins with the US economy slumping. The president proposes and passes through congress one of the biggest stimulus packages ever. The idea is sound to many; get America working by offering jobs in green energy, shovel-ready infrastructure projects. Among other actions, the plan is to give lower income people some government money so they can stimulate the economy.

I’m not really here to praise or zing the wisdom of this. I’m just here to give the facts. In hindsight, it appears as though it hasn’t stimulated the economy as many had hoped, but that’s beside the point.

Continuing on, the government issues 52 million people on social security a check for $250. It turns out of that number nearly 100,000 people were in prison or dead, roughly 0.2% of the checks. Some checks are returned, some are cashed. Ultimately, the government loses $22.3 million on the 0.2% error.

While $22.3 million is a HUGE number, 0.2% is a tiny number. It strikes at the heart at why data quality is so important. Social Security spokesman Mark Lassiter said, "…Each year we make payments to a small number of deceased recipients usually because we have not yet received reports of their deaths."

There is strong evidence that the SSA is hooked up to the right commercial data feeds and have the processes in place to use them. It seems as though the social security administration is quite proactive in their search for the dead and imprisoned, but people die and go to prison all the time. They also move, get married and become independent of their parents.

If we try to imagine what it would take to achieve closer to 100% accuracy, it would take up-to-the-minute reference data. It seems that the only real solution is to put forth legislation that requires the reporting to the federal government any of these life changing events. Should we mandate the bereaved or perhaps funeral directors to report the death immediately in a central database? Even with such a law, there still would be a small percentage of checks that would be issued while the recipient was alive and delivered after the recipient is dead. We’d have better accuracy for this issue, but not 100%

While this story takes a poke at the SSA for sending checks to dead people, I have to applaud their achievement of 99.8% accuracy. It could be a lot worse America. A lot worse.

Friday, July 30, 2010

Deterministic and Probabilistic Matching White Paper

I’ve been busy this summer working on a white paper on record matching, the result of which is available on the Talend web site here.

The white paper is sort of a primer containing elementary principles of record matching, As the description says, it outlines the basic theories and strategies of record matching. It describes the nuances of deterministic and probabilistic matching and the algorithms used to identify relationships within records. It covers the processes to employ in conjunction with matching technology to transform raw data into powerful information that drives success in enterprise applications like CRM, data warehouse and ERP.

Monday, June 9, 2008

Probabilistic Matching: Part Two

Matching algorithms, the functions that allow data quality tools to determine duplicate records and create households, are always a hot topic in the data quality community. In a previous installment of the Data Governance and Data Quality Insider, I wrote about the folly of probabilistic matching and its inability to precisely tune match results.

To recap, decisions for matching records together with probabilistic matchers are based on three things: 1) statistical analysis of the data; 2) a complicated mathematical formula, and; 3) and a “loose” or “tight” control setting. Statistical analysis is important because under probabilistic matching, data that is more unique in your data set has more weight in determining a pass/fail on the match. In other words, if you have a lot of ‘Smith’s in your database, Smith becomes a less important matching criterion for that record. If the record has a unique last name like ‘Afinogenova’ that’ll carry more weight in determining the match.

The trouble comes when you don’t like the way records are being matched. Your main course of action is to turn the dial on the loose/tight control to see if you can get the records to match without affecting record matching elsewhere in the process. Little provision is made for precise control of what records match and what records don’t. Always, there is some degree of inaccuracy in the match.

In other forms of matching, like deterministic matching and rules-based matching, you can very precisely control which records come together and which ones don’t. If something isn’t matching properly, you can make a rule for it. The rules are easy to understand. It’s also very easy to perform forensics on the matching and figure out why two records matched, and that comes in handy should you ever have to explain to anyone exactly why you deduped any given record.

But there is another major folly of probabilistic matching – namely performance. Remember, probabilistic matching relies heavily on statistical analysis of your data. It wants to know how many instances of “John” and “Main Street” are in your data before it can determine if there’s a match.

Consider for a moment a real time implementation, where records are entering the matching system, say once per second. The solution is trying to determine if the new record is almost like a record you already have in your database. For every record entering the system, shouldn’t the solution re-run statistics on the entire data set for the most accurate results? After all, the last new record you accepted into your database is going to change the stats, right? With medium-sized data sets, that’s going to take some time and some significant hardware to accomplish. With large sets of data, forget it.

Many vendors who tout their probabilistic matching secretly have work-arounds for real time matching performance issues. They recommend that you don’t update the statistics for every single new record. Depending on the real-time volumes, you might update statistics nightly or say every 100 records. But it’s safe to say that real time performance is something you’re going to have to deal with if you go with a probabilistic data quality solution.

Better yet, you can stay away from probabilistic matching and take a much less complicated and much more accurate approach – using time-tested pre-built business rules supplemented with your own unique business rules to precisely determine matches.

Friday, December 7, 2007

Probabilistic Matching: Sounds like a good idea, but...

I've been thinking about the whole concept of probabilistic matching and how flawed it is to assume that this matching technique is the best there is. Even in concept, it isn't.

To summarize, decisions for matching records together with probabilistic matchers are based on three things: 1) statistical analysis of the data; 2) a complicated mathematical formula, and; 3) and a “loose” or “tight” control setting. Statistical analysis is important because under probabilistic matching, data that is more unique in your data set has more weight in determining a pass/fail on the match. In other words, if you have a lot of ‘Smith’s in your database, Smith becomes a less important matching criterion for that record. If the record has a unique last name like ‘Afinogenova’ that’ll carry more weight in determining the match.

So the only control you really have is the loose or tight setting. Imagine for a moment that you had a volume control for the entire world. This device allows you to control the volume of every living thing and every device on the planet. The device uses a strange and mystical algorithm of sound dynamics and statistics that only the most knowledgeable scientists can understand. So, if construction noise gets too much outside your window, you could turn the knob down. The man in the seat next to you on the airplane is snoring too loud? Turn down the volume.

Unfortunately, the knob does control EVERY sound on the planet, so when you turn down the volume, the ornithologist in Massachusetts can’t hear the rare yellow-bellied sapsucker she’s just spotted. A mother in Chicago may be having a hard time hearing her child coo, so she and a thousand other people call you to ask you to turn up the volume.

Initially, the idea of a world volume control sounds really cool, but after you think about the practical applications, it’s useless. By making one adjustment to the knob, the whole system must readjust.

That’s exactly why most companies don’t use probabilistic matching. To bring records together, probabilistic matching uses statistics and algorithms to determine a match. If you don’t like the way it’s matching, your only recourse is to adjust the volume control. However, the correct and subtle matches that probabilistic matching found on the previous run will be affected by your adjustment. It just makes more sense for companies to have the individual volume controls that deterministic and rules-based matching provides to find duplicates and households.
Perhaps more importantly, certain types of companies can't use probabilistic matching because of transparency. If you're changing the data at financial institutions, for example, you need to be able to explain exactly why you did it. An auditor my ask you why you matched two customer records? That's something that's easy to explain with a rules-based system, and much less transparent with probabilistic matching.

I have yet to talk to a company that actually uses 100% probabilistic matching in their data quality production systems. Like the master volume control, it sounds like a good idea when the sales guy pitches it, but once implemented, the practical applications are few.
Read more on probablistic matching.

Data Governance Insider

Tuesday, November 30, 2010

Match Mitigation: When Algorithms Aren’t Enough

Saturday, October 16, 2010

Is 99.8 % data accuracy enough?

Friday, July 30, 2010

Deterministic and Probabilistic Matching White Paper

Monday, June 9, 2008

Probabilistic Matching: Part Two

Friday, December 7, 2007

Probabilistic Matching: Sounds like a good idea, but...

Blog Archive

Share and Follow

About Me

wikipedia

Kindle Edition

Book Recommendations

Other Blogs I Like

RSS Feed