Monday, October 12, 2009

Data May Require Unique Data Quality Processes

A few things in life have the same appearance, but the details can vary widely.  For example, planets and stars look the same in the night sky, but traveling to them and surviving once you get there are two completely different problems. It’s only when you get close to your destination that you can see the difference.

All data quality projects can appear the same from afar but ultimately can be as different as stars and planets. One of the biggest ways they vary is in the data itself and whether it is chiefly made up of name and address data or some other type of data.

Name and Address Data
A customer database or CRM system contains data that we know much about. We know that letters will be transposed, names will be comma reversed, postal codes will be missing and more.  There are millions of things that good data quality tools know about broken name and address data since so many name and address records have been processed over the years. Over time, business rules and processes are fine-tuned for name and address data.  Methods of matching up names and addresses become more and more powerful.

Data quality solutions also understand what name and addresses are supposed to look like since the postal authorities provide them with correct formatting. If you’re somewhat precise about following the rules of the postal authorities, most mail makes it to its destination.  If we’re very precise, the postal services can offer discounts. The rules are clear in most parts of the civilized world. Everyone follows the same rules for name and address data because it makes for better efficiency.

So, if we know what the broken item looks like and we know what the fixed item is supposed to look like, you can design and develop processes that involve trained, knowledgeable workers and automated solutions to solve real business problems. There’s knowledge inherent in the system and you don’t have to start from scratch every time you want to cleanse it.

ERP, Supply Chain Data
However, when we take a look at other types of data domains, the picture is very different.  There isn’t a clear set of knowledge what is typically input and what is typically output and therefore you must set up processes for doing so. In supply chain data or ERP data, we can’t immediately see why the data is broken or what we need to do to fix it.  ERP data is likely to be sort of a history lesson of your company’s origins, the acquisitions that were made, and the partnership changes throughout the years. We don’t immediately have an idea about how the data should ultimately look. The data that exists in this world is specific to one client or a single use scenario which cannot be handled by existing out-of-the-box rules

With this type of data you may find the need to collaborate more with the business users of the data, who expertise in determining the correct context for the information comes more quickly, and therefore enable you to effect change more rapidly. Because of the inherent unknowns about the data, few of the steps for fixing the data are done for you ahead of time. It then becomes critical to establish a methodology for:
  • Data profiling in order to understanding what issues and challenges.
  • Discussions with the users of the data to understand context, how it’s used and the most desired representation.  Since there are few governing bodies for ERP and supply chain data, the corporation and its partners must often come up with an agreed-upon standard.
  • Setting up business rules, usually from scratch, to transform the data
  • Testing the data in the new systems
I write about this because I’ve read so much about this topic lately. As practitioners you should be aware that the problem is not the same across all domains. While you can generally solve name and address data problems with a technology focus, you will often rely more on collaboration with subject matter experts to solve issues in other data domains.


Dylan Jones said...

A very timely topic Steve, great reading.

I think we're definitely seeing some positive moves by the technology industry to create some standardised modules and more focused propositions based on non-name and address data but there is still a massive potential here.

In particular, I think so many consultancies are missing a major trick by not combining deep business expertise with the advanced data quality capabilities of modern technology to create focused solutions for specific business problems.

For example, in Telco there are always issues with linking customers, service, equipment and billing. There are literally billions wasted every year from under-billing and under-utilised equipment. To resolve this requires some very specialist knowledge and the right technology but I still see organisations trying to cobble together solutions themselves, loads of potential for a proposition.

I completely agree that when you veer away from name and address data things can get complex but there are always patterns and structures to be found so that something innovative, scalable and marketable can be developed.

I would also add a requirement to perform information chain analysis to your list as that, for me, is always the starting point as it enables the business to get involved from the off.

It would be great to see more members of the data quality community forming alliances with grizzled business experts to create some innovative new services, the time is certainly right I feel.

Great post as ever.

Steve Sarsfield said...

Thanks Dylan. At least one organization is trying come up with standards for supply chain data - ECCMA. If practitioners want to be involved in setting ISO standards, they can join.

It's a challenging task, though. Setting a standard would work if you erased every company's legacy data and we all started over, but it's the historical data and past ways of managing ERP data that haunts us.

Disclaimer: The opinions expressed here are my own and don't necessarily reflect the opinion of my employer. The material written here is copyright (c) 2010 by Steve Sarsfield. To request permission to reuse, please e-mail me.