Friday, December 7, 2007

Probabilistic Matching: Sounds like a good idea, but...

I've been thinking about the whole concept of probabilistic matching and how flawed it is to assume that this matching technique is the best there is. Even in concept, it isn't.

To summarize, decisions for matching records together with probabilistic matchers are based on three things: 1) statistical analysis of the data; 2) a complicated mathematical formula, and; 3) and a “loose” or “tight” control setting. Statistical analysis is important because under probabilistic matching, data that is more unique in your data set has more weight in determining a pass/fail on the match. In other words, if you have a lot of ‘Smith’s in your database, Smith becomes a less important matching criterion for that record. If the record has a unique last name like ‘Afinogenova’ that’ll carry more weight in determining the match.

So the only control you really have is the loose or tight setting. Imagine for a moment that you had a volume control for the entire world. This device allows you to control the volume of every living thing and every device on the planet. The device uses a strange and mystical algorithm of sound dynamics and statistics that only the most knowledgeable scientists can understand. So, if construction noise gets too much outside your window, you could turn the knob down. The man in the seat next to you on the airplane is snoring too loud? Turn down the volume.

Unfortunately, the knob does control EVERY sound on the planet, so when you turn down the volume, the ornithologist in Massachusetts can’t hear the rare yellow-bellied sapsucker she’s just spotted. A mother in Chicago may be having a hard time hearing her child coo, so she and a thousand other people call you to ask you to turn up the volume.

Initially, the idea of a world volume control sounds really cool, but after you think about the practical applications, it’s useless. By making one adjustment to the knob, the whole system must readjust.

That’s exactly why most companies don’t use probabilistic matching. To bring records together, probabilistic matching uses statistics and algorithms to determine a match. If you don’t like the way it’s matching, your only recourse is to adjust the volume control. However, the correct and subtle matches that probabilistic matching found on the previous run will be affected by your adjustment. It just makes more sense for companies to have the individual volume controls that deterministic and rules-based matching provides to find duplicates and households.
Perhaps more importantly, certain types of companies can't use probabilistic matching because of transparency. If you're changing the data at financial institutions, for example, you need to be able to explain exactly why you did it. An auditor my ask you why you matched two customer records? That's something that's easy to explain with a rules-based system, and much less transparent with probabilistic matching.

I have yet to talk to a company that actually uses 100% probabilistic matching in their data quality production systems. Like the master volume control, it sounds like a good idea when the sales guy pitches it, but once implemented, the practical applications are few.
Read more on probablistic matching.

8 comments:

Daniel said...

You know it's one thing to discuss the challenges regarding probabilistic/statistical classifiers (as required for matching), however your poor analogy only shows your lack of expertise in this area. This reminds me of discussions surrounding deterministic/rule based email filters and probabilistic email filters. Most experts agree the probabilistic filters clearly won out in terms of accuracy over time. The same can be said about identifying and linkage/matching of individuals across multiple systems, however most companies lack the expertise where to beging and how to build sushc systems. That does not mean that probabilistic systems could not beat out (and probably will) deterministic approaches. The rules based approach always hits a wall when you reach a large number of rule sets. I have yet to see a company/solution that can sucessfully and coherently debug/profile hundreds of rules running against multiple sets of data without overwhealming the end-user/analyst. Cheers.

Steve Sarsfield said...

Daniel, personal attacks aside, thanks for your opinion. This is somewhat of a religious war and it makes a great blog topic.
The fact is, debugging matching rules is not only possible with big data sets and a rules-based system, it’s downright simple. When records come together, the reports will tell you exactly what rule caused the match. So, with razor precision, you can know what rule to work on if a bad match occurs. You can’t say the same when a probabilistic matcher decides on a match. It’s difficult to trace back through the statistical analysis and advanced math to figure out why exactly two records came together and near impossible to precisely tune the results.
That’s why some probabilistic matching vendors have rules in addition to algorithms to do the match. Talk about confusing the user.
With regard to overwhelming the user with rules, well I just don’t think it’s an issue. However, we have had end users ask us to come in and give them a rules tune-up from time to time. The vendor can often provide this tune-up in a single day of professional services with excellent results.

Professor said...

Actually, the analogy is not that bad. However, the opinion is a little short sighted. I agree that fundamentally the probabilistic approach is generally more accurate and more practical under certain circumstances. However, in very quality sensitive scenarios, proabilistic alone doesn't cut it. What is necessary is to build on top of a foundation of looser probabilistic settings a short list layer of deterministic logic. This is a hybrid that works extremely efficiently and accurately.

Steve Sarsfield said...

Yes, Professor, I'm starting to believe. After having three years now to think about it, I do think that probabilistic matching does have some practical value, particularly in a hybrid approach. As long as I can hold on to the fact that probabilistic is NOT the Rosetta Stone of matching, I'll concede to your point about being short-sighted.

Anonymous said...

I know this is a really old post, but I'm puzzled why the analogy has a single volume knob? Probabilistic methods block data and then apply weights to fields, why wouldn't their be a volume knob for each field? So, let's say we make a "Smith" block, the weight on this very common name would be "turned down" quite low, and the volume knob on "Zambi" in some other block would be high. You are tuning the weights at the field level, not at the record level, and the data is blocked, so the volume knob is wrong on two levels.

Deterministic approaches just don't scale well. If data quality isn't much of an issue and there are few fields and records, this primitive approach works fine. But the numbers don't lie. Probabilistic matching algorithms are much higher in their matching accuracy. But they are harder to implement, and that's why they are rarer, not based on their relative performance. So, the record linkage that ships with, say, an EMR, is going to be deterministic because it is easy and doesn't require the domain knowledge of Fellegi-Sunter and other math to apply them. The record linkage that a specialized company with expertise in the domain will provide will include probabilistic methods. Or at least the one I would buy would.

Probabilistic with some tricks like string distance formulae to afford tolerance for typographical errors and methods to identify digit transpositions (12 versus 21) and swaps (1/2/3 versus 1/3/2) perhaps augmented with some machine learning is the way to go. After all, record linkage is all about simulating human intelligence at pattern matching, and you have to figure a human would "weight" a match of last name Smith differently than they would Zambi.

Steve Sarsfield said...

Thanks for your comments. They are good and obviously come from experience. However, I would argue that probabilistic doesn't scale particularly well either. You constantly have to run statistical analysis and that can be a drag on real-time implementations.
Deterministic matching will support the transpositions you mentioned. Algorithms like Jaro-Winkler and Levenshtein are good at that.
I think you're right. It may be time for a new intelligent, real-time method that shakes the matching world.

Anonymous said...

My issue is: to what end. Meaning, okay probabilistic matching may not be as good as some would have you believe; however, in many situations it is notably better than deterministic. So, what do I do if not probabilistic? You have stated a problem but provided no solution which might be better.

Steve Sarsfield said...

These days, I'm seeing that matching algorithms matter less while upstream business process change matters so much more. Our jobs seem to be less about finding the optimal algorithm and more about setting and policing DQ standards. All algorithms that attempt to think like a human will have flaws. We can be much more effective if we attack root cause with DQ reporting and monitoring of data quality metrics.

There was an error in this gadget
Disclaimer: The opinions expressed here are my own and don't necessarily reflect the opinion of my employer. The material written here is copyright (c) 2010 by Steve Sarsfield. To request permission to reuse, please e-mail me.