Showing posts with label address verification. Show all posts
Showing posts with label address verification. Show all posts

Saturday, October 16, 2010

Is 99.8 % data accuracy enough?

Ripped from recent headlines, we see how even a .2% failure can have a big impact.

WASHINGTON (AP) ― More than 89,000 stimulus payments of $250 each went to people who were either dead or in prison, a government investigator says in a new report.

Let’s take a good, hard look at this story. It begins with the US economy slumping.  The president proposes and passes through congress one of the biggest stimulus packages ever. The idea is sound to many; get America working by offering jobs in green energy, shovel-ready infrastructure projects. Among other actions, the plan is to give lower income people some government money so they can stimulate the economy.

I’m not really here to praise or zing the wisdom of this. I’m just here to give the facts. In hindsight, it appears as though it hasn’t stimulated the economy as many had hoped, but that’s beside the point.

Continuing on, the government issues 52 million people on social security a check for $250. It turns out of that number nearly 100,000 people were in prison or dead, roughly 0.2% of the checks. Some checks are returned, some are cashed. Ultimately, the government loses $22.3 million on the 0.2% error.

While $22.3 million is a HUGE number, 0.2% is a tiny number.  It strikes at the heart at why data quality is so important.  Social Security spokesman Mark Lassiter said, "…Each year we make payments to a small number of deceased recipients usually because we have not yet received reports of their deaths."

There is strong evidence that the SSA is hooked up to the right commercial data feeds and have the processes in place to use them. It seems as though the social security administration is quite proactive in their search for the dead and imprisoned, but people die and go to prison all the time. They also move, get married and become independent of their parents.

If we try to imagine what it would take to achieve closer to 100% accuracy, it would take up-to-the-minute reference data. It seems that the only real solution is to put forth legislation that requires the reporting to the federal government any of these life changing events. Should we mandate the bereaved or perhaps funeral directors to report the death immediately in a central database? Even with such a law, there still would be a small percentage of checks that would be issued while the recipient was alive and delivered after the recipient is dead. We’d have better accuracy for this issue, but not 100%

While this story takes a poke at the SSA for sending checks to dead people, I have to applaud their achievement of 99.8% accuracy. It could be a lot worse America.  A lot worse.

Monday, February 22, 2010

Referential Treatment - The Open Source Reference Data Trend

Reference data can be used in a huge number of data quality and data enrichment processes.  The simplest example is a table that contains cities and their associated postal codes – you can use an ETL process to make sure that all your customer records that contain 02026 for a postal code always refer to the standardized “Dedham, MA” for the city and state, not variations like “Deadham Mass”  or “Dedam, Massachusetts”.

Reference data is not limited to customer address, however. If everyone were to use the same reference data for parts, you could easily exchange procurement data between partners.  If only certain values are allowed in any given table, it would support validation.  By having standards for supply chain data, procurement, supply chain, finance and accounting data, processes are more efficient.  Organizations like the ISO and ECCMA are working on that.

Availability of Reference Data
In the past, it was difficult to get your hands on reference data. Long ago, no one wanted to share reference data with you - you had to send your customer data to a service provider and get the enriched data back.  Others struggled to develop reference data on their own. Lately I’m seeing more and more high quality reference data available for free on the Internet.   For data jockeys, these are good times.

GeoNames
A good example of this is GeoNames.  The GeoNames geographical database is available for download free of charge under a creative commons attribution license. According to the web site, it “aggregates over 100 different data sets to build a list containing over eight million geographical names and consists of 7 million unique features whereof 2.6 million populated places and 2.8 million alternate names. The data is accessible free of charge through a number of web services and a daily database export. “

GeoNames combines geographical data such as names of places in various languages, elevation, population and others from various sources. All lat/long coordinates are in WGS84 (World Geodetic System 1984). Like Wikipedia, users may manually edit, correct and add new names.

US Census Data
Another rich set of reference data is the US Census “Gazetteer” data. Courtesy of the US government, you can download a database with the following fields:
  • Field 1 - State Fips Code
  • Field 2 - 5-digit Zipcode
  • Field 3 - State Abbreviation
  • Field 4 - Zipcode Name
  • Field 5 - Longitude in Decimal Degrees (West is assumed, no minus sign)
  • Field 6 - Latitude in Decimal Degrees (North is assumed, no plus sign)
  • Field 7 - 2000 Population (100%)
  • Field 8 - Allocation Factor (decimal portion of state within zipcode)
So, our Dedham, MA entry includes this data:
  • "25","02026","MA","DEDHAM",71.163741,42.243685,23782,0.003953
It’s Really Exciting!
When I talk about reference data at parties, I immediately see eyes glaze over and it’s clear that my fellow party-goers want to escape my enthusiasm for it.  But this availability of reference data is really great news! Together with the open source data integration tools like Talend Open Studio, we’re starting to see what I like to call “open source reference data” becoming available. It all makes the price of improving data quality much lower and our future much brighter.

There’s so much to talk about with regard to reference data and so many good sources.  I plan to make more posts on this topic, but feel free to post your beloved reference data sources here in the comments section.

Wednesday, June 3, 2009

Informatica Acquires AddressDoctor

Global Data is Hard to Do

Yesterday, Informatica announced their intent to acquire AddressDoctor. This acquisition is all about being able to handle global data quality in today’s market, but it has a surprising potential twist. Data quality vendors have been striving for a better global solution because so many of the large data quality projects contain global data. If your solution doesn’t handle global data, it often just won’t make the cut.

The interesting twist here is that both IBM and Dataflux leverage AddressDoctor for their handling of global address data. There are several other smaller vendors that do also - MelissaData, QAS, and Datanomic. Trillium Software technology is not impacted by this acquisition. They have been building in-house technology for years to support the parsing of global data and have leveraged their parent company’s acquisition Global Address to beef up the geocoding capability of the Trillium Software System.

Informatica has handed the competition a strong blow here. Where will these vendors go to get their global data quality? In the months to come, there will be challenges to face. Informatica, still busy with integrating the disparate parts of Evoke, Similarity and Identity Systems, will now have to integrate AddressDoctor. Other vendors like IBM, Dataflux, MelissaData, QAS and Datanomic may now have to figure out what to do for global data if Informatica decides not to renew partner agreements.

For more analysis on this topic, you can read Rob Karel's blog. Read how this Forrester analyst thinks the move is to limit the choices on MDM platforms.

To be on the safe side, I’d like to restate my opinions in this blog are my own. Even though I work for Harte-Hanks Trillium Software, my comments are my independent thoughts and not necessarily those of my employer.

Tuesday, June 3, 2008

Trillium Software News Items

A couple of big items hit the news wire today from Trillium Software that are significant for data quality enthusiasts.

Item One:
Trillium Software cleansed and matched the huge database of Loyalty Management Group (LMG), the database company that owns the Nectar and Air Miles customer loyalty schemes in the UK and Europe.
Significance:
LMG has saved £150,000 by using data quality software to cleanse its mailing list, which is the largest in Europe, some 10 million customers strong. I believe this speaks to Trillium Software’s outstanding scalability and global data support. This particular implementation is an Oracle database with Trillium Software as the data cleansing process.


Item Two:
Trillium Software delivered the latest version of the Trillium Software System version 11.5. The software now offers expanded cleansing capabilities across a broader range of countries.
Significance:
Again, global data is a key take-away here. Being able to handle all of the cultural challenges you encounter with international data sets is a problem that requires continual improvement from data quality vendors. Here, Trillium is leveraging their parent company’s buyout of Global Address to improve the Trillium technology.


Item Three:
Trillium Software released a new mainframe version of version 11.5, too.
Significance:
Trillium Software continues to support data quality processes on the mainframe. Unfortunately, you don’t see other enterprise software companies offering many new mainframe releases these days, despite the fact that the mainframe is still very much a viable and vibrant for managing data.

Monday, May 19, 2008

Unusual Data Quality Problems

When I talk to folks who are struggling with data quality issues, there are some who are worried that they have data unlike any data anyone has ever seen. Often there’s a nervous laugh in the voice as if the data is so unusual and so poor that an automated solution can’t possibly help.

Yes, there are wide variations in data quality and consistency and it might be unlike any we’ve seen. On the other hand, we’ve seen a lot of unusual data over the years. For example:

  • A major motorcycle manufacturer used data quality tools to pull out nicknames from their customer records. Many of the names they had acquired for their prospect list were from motorcycle events and contests where the entries were, shall we say, colorful. The name fields contained data like “John the Mad Dog Smith” or “Frank Motor-head Jones”. The client used the tool to separate the name from the nickname, making it a more valuable marketing list.
  • One major utility company used our data quality tools to identify and record notations on meter-reader records that were important to keep for operational uses, but not in the customer billing record. Upon analysis of the data, the company noticed random text like “LDIY" and "MOR" along with the customer records. After some investigation, they figured out that LDIY meant “Large Dog in Yard” which was particularly important for meter readers. MOR meant “Meter in Right, which was also valuable. The readers were given their own notes field, so that they could maintain the integrity of the name and address while also keeping this valuable data. IT probably saved a lot of meter readers from dog bite situations.
  • Banks have used our data quality tools to separate items like "John and Judy Smith/221453789 ITF George Smith". The organization wanted to consider this type of record as three separate records "John Smith" and "Judy Smith" and "George Smith" with obvious linkage between the individuals. This type of data is actually quite common on mainframe migrations.
  • A food manufacturer standardizes and cleanses ingredient names to get better control of manufacturing costs. In data from their worldwide manufacturing plants, an ingredient might be “carrots” “chopped frozen carrots” “frozen carrots, chopped” “chopped carrots, frozen” and so on. (Not to mention all the possible abbreviations for the words carrots, chopped and frozen.) Without standardization of these ingredients, there was really no way to tell how many carrots the company purchased worldwide. There was no bargaining leverage with the carrot supplier, and all the other ingredient suppliers, until the data was fixed.

Not all data quality solutions can handle all of these types of anomalies. They will pass these "odd" values without attempting to cleanse them. It’s key to have a system that will learn from your data and allow you to develop business rules that meet the organization’s needs.

Now there are times, quite frankly, when data gets so bad, that automated tools can do nothing about it, but that’s where data profiling comes in. Before you attempt to cleanse or migrate data, you should profile it to have a complete understanding of it. This will let you weigh the cost of fixing very poor data against the value that it will bring to the organization.

Sunday, December 16, 2007

Data Governance or Magic

Today, I wanted to report on what I have discovered - an extremely large data governance project. The project is shrouded in secrecy, but bits and pieces have come out that point to the largest data governance project in the world. I hesitate to give you the details. This quasi-governmental, cross-secular organization is one of the foundational organizations or our society. Having said that, not everyone recognizes it as an authority.

Some statistics: the database contains over 40 million names in the US alone. In Canada, Mexico, South America, and many countries in Europe, the names and addresses of up to 15% percent of the population is stored in this data warehouse. Along with geospatial information, used to optimize product delivery, there’s a huge amount of transactional data. Customers in the data warehouse are served for up to 12 years, when the trends show that most customers move on and eventually pass their memberships on to their children. Because of the nature of their work, there is sleep pattern information on each individual, as well as a transaction when they do something “nice” for society, or whether they pursue more “naughty” actions. For example, when the individual exhibits emotional outbursts, such as pouting or crying, this kicks off a series of events that affect a massive manufacturing facility and supply chain, staffed by thousands of specialty workers who adjust as the clients’ disposition reports come into the system. Many of the clients are simply delivered coal, but other customers receive the toy, game, new sled, of their dreams. Complicating matters even more, the supply chain must deliver all products on a single day each year, December 25th.

I am of course talking about the implementation managed by Kris Kringle at the North Pole. I tried to find out more about the people, processes and products in place, but apparently there is a custom application in place. According to Mr. Kringle, “Our elves use ‘magic’ to understand our customers and manage our supply chain, so there is no need for Teradata, SAP, Oracle, Trillium Software any other enterprise application in this case. Our magic solution has served us well for many years, and we plan to continue with this strategy for years to come.” If only we could productize some of that Christmas magic.

Sunday, November 18, 2007

Postal Validation for the Australia Post



One of the very basic functions you can offer as a data quality vendor is to validate data against the local postal services. With this validation, the postal service is saying that it has tested your software and it agrees that your product can effectively cleanse local data. The customer of said products then become eligible for postal discounts and save money when they mail to their customers. The US, Canada, and Australia have their own way of testing software to ensure results.
I took a look at the Australia Post web site to who was ON the latest AMAS (Address Matching Approval System) list and who was missing. It's interesting to note that, as of this posting, only two of the major enterprise software vendors (those in the Gartner Magic Quadrant 'leaders' section) now support AMAS.
According to the AMAS list, only Trillium Software and Business Objects (FirstLogic) support the Australian postal system with software certified by Australia Post.
Sure, a good data quality solution should have connectivity - it should integrate well with your systems. It should be fast, and it should support the business user as well as the technologist. It should have many other features that meet the needs of a global company. However, postal validation for global name and address data is basic. It helps the marketing department hit their targets, it helps the billing department's invoices reach the customer, and it keeps revenue flowing into an organization.

Disclaimer: The opinions expressed here are my own and don't necessarily reflect the opinion of my employer. The material written here is copyright (c) 2010 by Steve Sarsfield. To request permission to reuse, please e-mail me.