Showing posts with label deterministic. Show all posts
Showing posts with label deterministic. Show all posts

Friday, July 30, 2010

Deterministic and Probabilistic Matching White Paper

I’ve been busy this summer working on a white paper on record matching, the result of which is available on the Talend web site here.

The white paper is sort of a primer containing elementary principles of record matching,  As the description says, it outlines the basic theories and strategies of record matching. It describes the nuances of deterministic and probabilistic matching and the algorithms used to identify relationships within records. It covers the processes to employ in conjunction with matching technology to transform raw data into powerful information that drives success in enterprise applications like CRM, data warehouse and ERP.

Monday, June 9, 2008

Probabilistic Matching: Part Two

Matching algorithms, the functions that allow data quality tools to determine duplicate records and create households, are always a hot topic in the data quality community. In a previous installment of the Data Governance and Data Quality Insider, I wrote about the folly of probabilistic matching and its inability to precisely tune match results.

To recap, decisions for matching records together with probabilistic matchers are based on three things: 1) statistical analysis of the data; 2) a complicated mathematical formula, and; 3) and a “loose” or “tight” control setting. Statistical analysis is important because under probabilistic matching, data that is more unique in your data set has more weight in determining a pass/fail on the match. In other words, if you have a lot of ‘Smith’s in your database, Smith becomes a less important matching criterion for that record. If the record has a unique last name like ‘Afinogenova’ that’ll carry more weight in determining the match.

The trouble comes when you don’t like the way records are being matched. Your main course of action is to turn the dial on the loose/tight control to see if you can get the records to match without affecting record matching elsewhere in the process. Little provision is made for precise control of what records match and what records don’t. Always, there is some degree of inaccuracy in the match.

In other forms of matching, like deterministic matching and rules-based matching, you can very precisely control which records come together and which ones don’t. If something isn’t matching properly, you can make a rule for it. The rules are easy to understand. It’s also very easy to perform forensics on the matching and figure out why two records matched, and that comes in handy should you ever have to explain to anyone exactly why you deduped any given record.

But there is another major folly of probabilistic matching – namely performance. Remember, probabilistic matching relies heavily on statistical analysis of your data. It wants to know how many instances of “John” and “Main Street” are in your data before it can determine if there’s a match.

Consider for a moment a real time implementation, where records are entering the matching system, say once per second. The solution is trying to determine if the new record is almost like a record you already have in your database. For every record entering the system, shouldn’t the solution re-run statistics on the entire data set for the most accurate results? After all, the last new record you accepted into your database is going to change the stats, right? With medium-sized data sets, that’s going to take some time and some significant hardware to accomplish. With large sets of data, forget it.

Many vendors who tout their probabilistic matching secretly have work-arounds for real time matching performance issues. They recommend that you don’t update the statistics for every single new record. Depending on the real-time volumes, you might update statistics nightly or say every 100 records. But it’s safe to say that real time performance is something you’re going to have to deal with if you go with a probabilistic data quality solution.

Better yet, you can stay away from probabilistic matching and take a much less complicated and much more accurate approach – using time-tested pre-built business rules supplemented with your own unique business rules to precisely determine matches.

Friday, December 7, 2007

Probabilistic Matching: Sounds like a good idea, but...

I've been thinking about the whole concept of probabilistic matching and how flawed it is to assume that this matching technique is the best there is. Even in concept, it isn't.

To summarize, decisions for matching records together with probabilistic matchers are based on three things: 1) statistical analysis of the data; 2) a complicated mathematical formula, and; 3) and a “loose” or “tight” control setting. Statistical analysis is important because under probabilistic matching, data that is more unique in your data set has more weight in determining a pass/fail on the match. In other words, if you have a lot of ‘Smith’s in your database, Smith becomes a less important matching criterion for that record. If the record has a unique last name like ‘Afinogenova’ that’ll carry more weight in determining the match.

So the only control you really have is the loose or tight setting. Imagine for a moment that you had a volume control for the entire world. This device allows you to control the volume of every living thing and every device on the planet. The device uses a strange and mystical algorithm of sound dynamics and statistics that only the most knowledgeable scientists can understand. So, if construction noise gets too much outside your window, you could turn the knob down. The man in the seat next to you on the airplane is snoring too loud? Turn down the volume.

Unfortunately, the knob does control EVERY sound on the planet, so when you turn down the volume, the ornithologist in Massachusetts can’t hear the rare yellow-bellied sapsucker she’s just spotted. A mother in Chicago may be having a hard time hearing her child coo, so she and a thousand other people call you to ask you to turn up the volume.

Initially, the idea of a world volume control sounds really cool, but after you think about the practical applications, it’s useless. By making one adjustment to the knob, the whole system must readjust.

That’s exactly why most companies don’t use probabilistic matching. To bring records together, probabilistic matching uses statistics and algorithms to determine a match. If you don’t like the way it’s matching, your only recourse is to adjust the volume control. However, the correct and subtle matches that probabilistic matching found on the previous run will be affected by your adjustment. It just makes more sense for companies to have the individual volume controls that deterministic and rules-based matching provides to find duplicates and households.
Perhaps more importantly, certain types of companies can't use probabilistic matching because of transparency. If you're changing the data at financial institutions, for example, you need to be able to explain exactly why you did it. An auditor my ask you why you matched two customer records? That's something that's easy to explain with a rules-based system, and much less transparent with probabilistic matching.

I have yet to talk to a company that actually uses 100% probabilistic matching in their data quality production systems. Like the master volume control, it sounds like a good idea when the sales guy pitches it, but once implemented, the practical applications are few.
Read more on probablistic matching.

Disclaimer: The opinions expressed here are my own and don't necessarily reflect the opinion of my employer. The material written here is copyright (c) 2010 by Steve Sarsfield. To request permission to reuse, please e-mail me.