Data Governance Insider: dedupe records

Showing posts with label dedupe records. Show all posts

Tuesday, November 30, 2010

Match Mitigation: When Algorithms Aren’t Enough

I’d like to get a little technical on this post. I try to keep my posts business-friendly, but sometimes there's importance in detail. If none of this post makes any sense to you, I wrote a sort of primer on how matching works in many data quality tools, which you can get here.

Matching Algorithms
When you use a data quality tool, you’re often using matching algorithms and rules to make decisions on whether records match or not. You might be using deterministic algorithms like Jaro, SoundEx and Metaphones. You might also be using probabilistic matching algorithms.

In many tools, you can set the rules to be tight where the software uses tougher criteria to determine a match, or loose where the software is not so particular. Tight and loose matches are important because you may have strict rules for putting records together, like customers of a bank, or not so strict rules, like when you’re putting together a customer list for marketing purposes.

What to do with Matches
Once data has been processed through the matcher, there are several possible outcomes. Between any two given records, the matcher may find:

No relationship
Match – the matcher found a definite match based on the criteria given
Suspect – the matcher thinks it found a match but is not confident. The results should be manually reviewed.

It’s that last category that the tough one. Mitigating the suspect matches is the most time-consuming follow-up task after the matching is complete. Envision a million record database where you have 20,000 suspect matches. That’s still going to take you some time to review.

Some of the newer (and cooler) tools offer strategies for dealing with suspect matches. The tools will present the suspect matches in a graphical user interface and allow users to pick which relationships are accurate and which are not. For example, Talend now offers a data stewardship console that lets you pick and choose records and attributes that will make up a best of breed record.

The goal, of course, is to not have suspect matches, so tuning the matches and limiting the suspect matches is the ultimate. The newest tools will make this easy. Some of the legacy tools make this hard.

Match mitigation is perhaps one of the most often overlooked processes of data quality. Don’t overlook it in your planning and processes.

Friday, July 30, 2010

Deterministic and Probabilistic Matching White Paper

I’ve been busy this summer working on a white paper on record matching, the result of which is available on the Talend web site here.

The white paper is sort of a primer containing elementary principles of record matching, As the description says, it outlines the basic theories and strategies of record matching. It describes the nuances of deterministic and probabilistic matching and the algorithms used to identify relationships within records. It covers the processes to employ in conjunction with matching technology to transform raw data into powerful information that drives success in enterprise applications like CRM, data warehouse and ERP.

Tuesday, July 1, 2008

The Soft Costs of Information Quality

Choosing data quality technology simply on price could mean that you end up paying far more than you need to, thanks to the huge differences in how the products solve the problems. While your instinct may tell you to focus solely on the price of your data quality tool, your big costs come in less visible areas – like time to implement, re-usability, time spend preprocessing data so that it reads into the tool, performance and overall learning curve.

As if it wasn’t confusing enough for the technology buyer having to choose between a desktop and enterprise-class technology, local and global solutions, or built-in solution vs. universal architecture, now you have to work out soft costs too. But you need to know that there are some huge differences in the way the technologies are implemented and work day-to-day, and those differences will impact your soft costs.

So just what should you look for to limit soft costs when selecting an information quality solution? Here are a few suggestions:

Does the data quality solution understand data at the field level only or can it see the big picture? For example, can you pass it an address that’s a blob of text, or do you need to pass it individual first name, last name, address, city, state, postal code lines. Importance: If the data is misfielded, you’ll have a LOT of work to do to get it ready for the field level solution.
On a similar note, what is the approach to multi-country data? Is there an easy way to pre-process mixed global data or is it a manual process? Importance: If the data has mixed country of origin, again you’ll have to do a lot of preprocessing work to do to get it ready.
What is the solution’s approach to complex records like “John and Diane Cougar Mellencamp DBA John Cougar”? Does the solution have the intelligence to understand all of those people in a record or do I have to post-process this name?
Despite the look of the user interface, is the product a real application or is it a development environment? Importance: In a real application, an error will be indicated if you pass in some wild and crazy data. In a development environment, even slight data quirks will cause nothing to run and just getting the application to run can be very time consuming and wasteful.
How hard is it to build a process? As a user you’ll need to know how to build an entire end-to-end process with the product. During proof of concept, the data quality vendor may hide that from you. Importance: Whether you’re using it on one project, or across many projects, you’re eventually going to want to build or modify a process. You should know up-front how hard this is. It shouldn’t be a mystery, and you need to follow this during the proof-of-concept.
Are web services the only real-time implementation strategy? Importance: Compared to a scalable application server, web services can be slow and actually add costs to the implementation.
Does the application actually use its own address correction worldwide or a third party solution? Importance: Understanding how the application solves certain problems will let you understand how much support you’ll get from the company. If something breaks, it’s easier for the program’s originator to fix it. A company using a lot of third party applications may have challenges with this.
Does the application have different ways to find duplicates? Importance: During a complex clean-up, you may want to dedupe your records based on, say e-mail and name for the first pass. But what about the records where your e-mail isn’t populated? For those records, you’ll need to go back and use other attributes to match. The ability to multi-match allows you to achieve cleaner, more efficient data by using whatever attributes are best in your specific data.

I could go on. The point is – there are many technical, in-the-weeds differences between vendors, and those differences have a BIG impact on your ability to deliver information quality. The best way to understand a data quality vendor’s solution is to look over their shoulder during the proof-of-concept. Ask questions. Challenge the steps needed to cleanse your data. Diligence today will save you from having to buy Excedrin tomorrow.

Monday, June 9, 2008

Probabilistic Matching: Part Two

Matching algorithms, the functions that allow data quality tools to determine duplicate records and create households, are always a hot topic in the data quality community. In a previous installment of the Data Governance and Data Quality Insider, I wrote about the folly of probabilistic matching and its inability to precisely tune match results.

To recap, decisions for matching records together with probabilistic matchers are based on three things: 1) statistical analysis of the data; 2) a complicated mathematical formula, and; 3) and a “loose” or “tight” control setting. Statistical analysis is important because under probabilistic matching, data that is more unique in your data set has more weight in determining a pass/fail on the match. In other words, if you have a lot of ‘Smith’s in your database, Smith becomes a less important matching criterion for that record. If the record has a unique last name like ‘Afinogenova’ that’ll carry more weight in determining the match.

The trouble comes when you don’t like the way records are being matched. Your main course of action is to turn the dial on the loose/tight control to see if you can get the records to match without affecting record matching elsewhere in the process. Little provision is made for precise control of what records match and what records don’t. Always, there is some degree of inaccuracy in the match.

In other forms of matching, like deterministic matching and rules-based matching, you can very precisely control which records come together and which ones don’t. If something isn’t matching properly, you can make a rule for it. The rules are easy to understand. It’s also very easy to perform forensics on the matching and figure out why two records matched, and that comes in handy should you ever have to explain to anyone exactly why you deduped any given record.

But there is another major folly of probabilistic matching – namely performance. Remember, probabilistic matching relies heavily on statistical analysis of your data. It wants to know how many instances of “John” and “Main Street” are in your data before it can determine if there’s a match.

Consider for a moment a real time implementation, where records are entering the matching system, say once per second. The solution is trying to determine if the new record is almost like a record you already have in your database. For every record entering the system, shouldn’t the solution re-run statistics on the entire data set for the most accurate results? After all, the last new record you accepted into your database is going to change the stats, right? With medium-sized data sets, that’s going to take some time and some significant hardware to accomplish. With large sets of data, forget it.

Many vendors who tout their probabilistic matching secretly have work-arounds for real time matching performance issues. They recommend that you don’t update the statistics for every single new record. Depending on the real-time volumes, you might update statistics nightly or say every 100 records. But it’s safe to say that real time performance is something you’re going to have to deal with if you go with a probabilistic data quality solution.

Better yet, you can stay away from probabilistic matching and take a much less complicated and much more accurate approach – using time-tested pre-built business rules supplemented with your own unique business rules to precisely determine matches.

Sunday, December 16, 2007

Data Governance or Magic

Today, I wanted to report on what I have discovered - an extremely large data governance project. The project is shrouded in secrecy, but bits and pieces have come out that point to the largest data governance project in the world. I hesitate to give you the details. This quasi-governmental, cross-secular organization is one of the foundational organizations or our society. Having said that, not everyone recognizes it as an authority.

Some statistics: the database contains over 40 million names in the US alone. In Canada, Mexico, South America, and many countries in Europe, the names and addresses of up to 15% percent of the population is stored in this data warehouse. Along with geospatial information, used to optimize product delivery, there’s a huge amount of transactional data. Customers in the data warehouse are served for up to 12 years, when the trends show that most customers move on and eventually pass their memberships on to their children. Because of the nature of their work, there is sleep pattern information on each individual, as well as a transaction when they do something “nice” for society, or whether they pursue more “naughty” actions. For example, when the individual exhibits emotional outbursts, such as pouting or crying, this kicks off a series of events that affect a massive manufacturing facility and supply chain, staffed by thousands of specialty workers who adjust as the clients’ disposition reports come into the system. Many of the clients are simply delivered coal, but other customers receive the toy, game, new sled, of their dreams. Complicating matters even more, the supply chain must deliver all products on a single day each year, December 25^th.

I am of course talking about the implementation managed by Kris Kringle at the North Pole. I tried to find out more about the people, processes and products in place, but apparently there is a custom application in place. According to Mr. Kringle, “Our elves use ‘magic’ to understand our customers and manage our supply chain, so there is no need for Teradata, SAP, Oracle, Trillium Software any other enterprise application in this case. Our magic solution has served us well for many years, and we plan to continue with this strategy for years to come.” If only we could productize some of that Christmas magic.

Friday, December 7, 2007

Probabilistic Matching: Sounds like a good idea, but...

I've been thinking about the whole concept of probabilistic matching and how flawed it is to assume that this matching technique is the best there is. Even in concept, it isn't.

To summarize, decisions for matching records together with probabilistic matchers are based on three things: 1) statistical analysis of the data; 2) a complicated mathematical formula, and; 3) and a “loose” or “tight” control setting. Statistical analysis is important because under probabilistic matching, data that is more unique in your data set has more weight in determining a pass/fail on the match. In other words, if you have a lot of ‘Smith’s in your database, Smith becomes a less important matching criterion for that record. If the record has a unique last name like ‘Afinogenova’ that’ll carry more weight in determining the match.

So the only control you really have is the loose or tight setting. Imagine for a moment that you had a volume control for the entire world. This device allows you to control the volume of every living thing and every device on the planet. The device uses a strange and mystical algorithm of sound dynamics and statistics that only the most knowledgeable scientists can understand. So, if construction noise gets too much outside your window, you could turn the knob down. The man in the seat next to you on the airplane is snoring too loud? Turn down the volume.

Unfortunately, the knob does control EVERY sound on the planet, so when you turn down the volume, the ornithologist in Massachusetts can’t hear the rare yellow-bellied sapsucker she’s just spotted. A mother in Chicago may be having a hard time hearing her child coo, so she and a thousand other people call you to ask you to turn up the volume.

Initially, the idea of a world volume control sounds really cool, but after you think about the practical applications, it’s useless. By making one adjustment to the knob, the whole system must readjust.

That’s exactly why most companies don’t use probabilistic matching. To bring records together, probabilistic matching uses statistics and algorithms to determine a match. If you don’t like the way it’s matching, your only recourse is to adjust the volume control. However, the correct and subtle matches that probabilistic matching found on the previous run will be affected by your adjustment. It just makes more sense for companies to have the individual volume controls that deterministic and rules-based matching provides to find duplicates and households.
Perhaps more importantly, certain types of companies can't use probabilistic matching because of transparency. If you're changing the data at financial institutions, for example, you need to be able to explain exactly why you did it. An auditor my ask you why you matched two customer records? That's something that's easy to explain with a rules-based system, and much less transparent with probabilistic matching.

I have yet to talk to a company that actually uses 100% probabilistic matching in their data quality production systems. Like the master volume control, it sounds like a good idea when the sales guy pitches it, but once implemented, the practical applications are few.
Read more on probablistic matching.

Data Governance Insider

Tuesday, November 30, 2010

Match Mitigation: When Algorithms Aren’t Enough

Friday, July 30, 2010

Deterministic and Probabilistic Matching White Paper

Tuesday, July 1, 2008

The Soft Costs of Information Quality

Monday, June 9, 2008

Probabilistic Matching: Part Two

Sunday, December 16, 2007

Data Governance or Magic

Friday, December 7, 2007

Probabilistic Matching: Sounds like a good idea, but...

Blog Archive

Share and Follow

About Me

wikipedia

Kindle Edition

Book Recommendations

Other Blogs I Like

RSS Feed