Monday, February 22, 2010

Referential Treatment - The Open Source Reference Data Trend

Reference data can be used in a huge number of data quality and data enrichment processes.  The simplest example is a table that contains cities and their associated postal codes – you can use an ETL process to make sure that all your customer records that contain 02026 for a postal code always refer to the standardized “Dedham, MA” for the city and state, not variations like “Deadham Mass”  or “Dedam, Massachusetts”.

Reference data is not limited to customer address, however. If everyone were to use the same reference data for parts, you could easily exchange procurement data between partners.  If only certain values are allowed in any given table, it would support validation.  By having standards for supply chain data, procurement, supply chain, finance and accounting data, processes are more efficient.  Organizations like the ISO and ECCMA are working on that.

Availability of Reference Data
In the past, it was difficult to get your hands on reference data. Long ago, no one wanted to share reference data with you - you had to send your customer data to a service provider and get the enriched data back.  Others struggled to develop reference data on their own. Lately I’m seeing more and more high quality reference data available for free on the Internet.   For data jockeys, these are good times.

GeoNames
A good example of this is GeoNames.  The GeoNames geographical database is available for download free of charge under a creative commons attribution license. According to the web site, it “aggregates over 100 different data sets to build a list containing over eight million geographical names and consists of 7 million unique features whereof 2.6 million populated places and 2.8 million alternate names. The data is accessible free of charge through a number of web services and a daily database export. “

GeoNames combines geographical data such as names of places in various languages, elevation, population and others from various sources. All lat/long coordinates are in WGS84 (World Geodetic System 1984). Like Wikipedia, users may manually edit, correct and add new names.

US Census Data
Another rich set of reference data is the US Census “Gazetteer” data. Courtesy of the US government, you can download a database with the following fields:
  • Field 1 - State Fips Code
  • Field 2 - 5-digit Zipcode
  • Field 3 - State Abbreviation
  • Field 4 - Zipcode Name
  • Field 5 - Longitude in Decimal Degrees (West is assumed, no minus sign)
  • Field 6 - Latitude in Decimal Degrees (North is assumed, no plus sign)
  • Field 7 - 2000 Population (100%)
  • Field 8 - Allocation Factor (decimal portion of state within zipcode)
So, our Dedham, MA entry includes this data:
  • "25","02026","MA","DEDHAM",71.163741,42.243685,23782,0.003953
It’s Really Exciting!
When I talk about reference data at parties, I immediately see eyes glaze over and it’s clear that my fellow party-goers want to escape my enthusiasm for it.  But this availability of reference data is really great news! Together with the open source data integration tools like Talend Open Studio, we’re starting to see what I like to call “open source reference data” becoming available. It all makes the price of improving data quality much lower and our future much brighter.

There’s so much to talk about with regard to reference data and so many good sources.  I plan to make more posts on this topic, but feel free to post your beloved reference data sources here in the comments section.

Tuesday, February 16, 2010

The Secret Ingredient in Major IT Initiatives

One of my first jobs was that of assistant cook at a summer camp.  (In this case, the term ‘cook’ was loosely applied meaning to scrub pots and pans for the head cook.) It was there I learned that most cooks have ingredients that they tend to use more often.  The cook at Camp Marlin tended to use honey where applicable.  Food TV star Emeril likes to use garlic and pork fat.  Some cooks add a little hot pepper to their chocolate recipes – it is said to bring out the flavor of the chocolate.  Definitely a secret ingredient.
For head chefs taking on major IT initiatives the secret ingredient is always data quality technology. Attention to data quality doesn’t make the recipe of an IT initiative alone so much as it makes an IT initiative better.  Let’s take a look at how this happens.

Profiling
No matter what the project, data profiling provides a complete understanding of the data before the project team attempts to migrate it. This can help the project team create a more accurate plan for integration.  On the other hand, it is ill-advised to migrate data to your new solution as-is, as it can lead to major costs over-runs and project delays as you have to load and reload it.

Customer Relationship Management (CRM)
By using data quality technology in CRM, the organization will benefit from a cleaner customer list with fewer duplicate records. Data quality technology can work as a real-time process, limiting the amount of typos and duplicates in the system, thus leading to improved call center efficiency.  Data profiling can also help an organization understand and monitor the quality of a purchased list for integration will avoid issues with third-party data.

Enterprise Resource Planning (ERP) and Supply Chain Management (SCM)

If data is accurate, you will have a more complete picture of the supply chain. Data quality technology can be used to more accurately report inventory levels, lowering inventory costs. When you make it part of your ERP project, you may also be able to improve bargaining power with suppliers by gaining improved intelligence about their corporate buying power. 

Data Warehouse and Business  Intelligence
Data quality helps disparate data sources to act as one when migrated to a data warehouse. Data quality makes data warehouse possible by standardizing disparate data. You will be able to generate more accurate reports when trying to understand sales patterns, revenue, customer demographics and more.

Master Data Management (MDM)
Data quality is a key component of master data management.     An integral part of making applications communicate and share data is to have standardized data.  MDM enhances the basic premise of data quality with additional features like persistent keys, a graphical user interface to mitigate matching, the ability to publish and subscribe to enterprise applications, and more.

So keep in mind, when you decide to improve data quality, it is often because of your need to make a major IT initiative even stronger.  In most projects, data quality is the secret ingredient to make your IT projects extraordinary.  Share the recipe.

Monday, February 1, 2010

A Data Governance Mission Statement

Every organization, including your data governance team has a purpose and a mission. It can be very effective to communicate your mission in a mission statement to show the company that you mean business.  When you show the value of your team, it can change your relationship with management for the better.

The mission statement should pay tribute to the mission of the organization with regard to values, while defining why the data governance organization exists and setting a big picture goal for the future.
The data governance mission statement could revolve around any of the following key components:

  • increasing revenue
  • lowering costs
  • reducing risks (compliance)
  • meeting any of the organization’s other policies such as being green or socially responsible

The most popular format seems to follow:
Our mission is to [purpose] by doing [high level initiatives] to achieve [business benefits]

So, let’s try one:
Our mission is to ensure that the highest quality data is delivered via company-wide data governance strategy for the purpose of improving the efficiency, increasing the profitability and lowering the risk of the business units we serve.
Flopped around:
Our mission is to improve the efficiency, increase the profitability and lower the business risks to Acme’s business units by ensuring that the highest quality data is delivered via company-wide data governance strategy.
Not bad, but a mission statement should be inspiring to the team and to management. Since the passions of the company described above are unknown, it’s difficult for a generic mission statement to be inspirational about the data governance program. That’s up to you.
 
Goals & Objectives
There are mission statements and there are objectives. While every mission statement should say who you are and why you exist, every objective should specify what you’re going to do and the results you expect.  Objectives include activities that can be easily tracked, measured, achieved and, of course, meet the objectives of the mission.  When you start data governance projects, you can look back to the mission statement to make sure we’re on track. Are you using our people and technology in a way that will benefit the company?

Staying On Mission
When you take on a new project, the mission statement can help protect us and ensure that the project is worthwhile for both the team and the company. The mission statement should be considered as a way to block busy-work and unimportant projects.  In our mission statement example above, if the project doesn’t improve efficiency, lower costs or lower business risk, it should not be considered.


In this case, your can clearly map three projects to the mission, but the fourth project is not as clear.  Dig deeper into the mainframe project to see if any efficiency will come out of the migration.  Is the data being used by anyone for a business purpose?

A Mission Never Ends
A mission statement is a written declaration of a data governance team's purpose and focus. This focus  normally remains steady, while objectives may change often to adapt to changes in the business environment. A properly crafted mission statement will serve as a filter to separate what is important from what is not and to communicate your value to the entire organization.

.

There was an error in this gadget
Disclaimer: The opinions expressed here are my own and don't necessarily reflect the opinion of my employer. The material written here is copyright (c) 2010 by Steve Sarsfield. To request permission to reuse, please e-mail me.