Monday, February 22, 2010

Referential Treatment - The Open Source Reference Data Trend

Reference data can be used in a huge number of data quality and data enrichment processes.  The simplest example is a table that contains cities and their associated postal codes – you can use an ETL process to make sure that all your customer records that contain 02026 for a postal code always refer to the standardized “Dedham, MA” for the city and state, not variations like “Deadham Mass”  or “Dedam, Massachusetts”.

Reference data is not limited to customer address, however. If everyone were to use the same reference data for parts, you could easily exchange procurement data between partners.  If only certain values are allowed in any given table, it would support validation.  By having standards for supply chain data, procurement, supply chain, finance and accounting data, processes are more efficient.  Organizations like the ISO and ECCMA are working on that.

Availability of Reference Data
In the past, it was difficult to get your hands on reference data. Long ago, no one wanted to share reference data with you - you had to send your customer data to a service provider and get the enriched data back.  Others struggled to develop reference data on their own. Lately I’m seeing more and more high quality reference data available for free on the Internet.   For data jockeys, these are good times.

A good example of this is GeoNames.  The GeoNames geographical database is available for download free of charge under a creative commons attribution license. According to the web site, it “aggregates over 100 different data sets to build a list containing over eight million geographical names and consists of 7 million unique features whereof 2.6 million populated places and 2.8 million alternate names. The data is accessible free of charge through a number of web services and a daily database export. “

GeoNames combines geographical data such as names of places in various languages, elevation, population and others from various sources. All lat/long coordinates are in WGS84 (World Geodetic System 1984). Like Wikipedia, users may manually edit, correct and add new names.

US Census Data
Another rich set of reference data is the US Census “Gazetteer” data. Courtesy of the US government, you can download a database with the following fields:
  • Field 1 - State Fips Code
  • Field 2 - 5-digit Zipcode
  • Field 3 - State Abbreviation
  • Field 4 - Zipcode Name
  • Field 5 - Longitude in Decimal Degrees (West is assumed, no minus sign)
  • Field 6 - Latitude in Decimal Degrees (North is assumed, no plus sign)
  • Field 7 - 2000 Population (100%)
  • Field 8 - Allocation Factor (decimal portion of state within zipcode)
So, our Dedham, MA entry includes this data:
  • "25","02026","MA","DEDHAM",71.163741,42.243685,23782,0.003953
It’s Really Exciting!
When I talk about reference data at parties, I immediately see eyes glaze over and it’s clear that my fellow party-goers want to escape my enthusiasm for it.  But this availability of reference data is really great news! Together with the open source data integration tools like Talend Open Studio, we’re starting to see what I like to call “open source reference data” becoming available. It all makes the price of improving data quality much lower and our future much brighter.

There’s so much to talk about with regard to reference data and so many good sources.  I plan to make more posts on this topic, but feel free to post your beloved reference data sources here in the comments section.


Unknown said...

Steve, I agree with you, using rich external reference data is also in my eyes one of the big things happening in data quality automation.

It’s a trend all over the world that governments are releasing large volumes of reference data in many cases for free.

Using external reference data also goes hand in hand with cloud computing. Besides infrastructure and software you also will find (maintained) reference data in the cloud.

Garnie Bolling said...

Hi Steve, great post.

This is encouraging to see how this data is starting to evolve, and easier to access. Remembering of the days of vendors selling reference data models and exchanges asking you to join to use their reference data (like supply chain)... granted we still need those vendors / organizations.

Like Henrik says, as we leverage more of the reference models, from the net, the close we get to realizing the benefits of a cloud.

Now if we can get more tooling and solutions to become more "open" with incorporating these reference models.

Keep up the great posts.

XaviMann said...

Steve, You're right, there is a lot of great reference data out there. On my last project we used data similar to Geonames to create a geography dimension. It was great to be able to find multiple sources with global databases of geographical information. I wish some enterprising data hound would assemble a single global db of postal code data. That would be huge. It does beg the question how do you verify the quality of the reference data.

Unknown said...

I share your excitement Steve! :) Im delighted to see more and more trusted sources of data opening their databases like and similar initiatives around the world.

However with the multiplication of sources - independently from its "open source" business model - for the same kind of data, you will have to figure out who can give you the most accurate information and values to be your own reference data (kind of "golden copy(es)").

We face these kind of issues within the financial services industry for securities and pricing data: Once you worked hard to map these different data models (with related maintenance) and to integrate the data from each of your sources, you now have to figure out in which case you should better take a value from this source or an other one to fill in your attribute, since sources do not have the same coverage and relevance according to what you need.

In this area, LinkedData initiative is also a really interesting phenomenon as a network of free data-sources, but it will not prevent you the pain of mapping ontologies and to implement logic to decide your trusted data-source at attribute level. All wee need is standardized data-models and free data-sources... Standards Useful and in use) rule!

I dont exactly get your statement Henrik on how cloud computing can specifically help in the case of integrating more and more data sources and possibly aggregate them - apart from the strict performances point of view and related cost.

But for sure, Open-source integration tools and free&trusted data sources is a really good opportunity to enhance the trust in your own data at really a reduced cost!

Unknown said...

Olivier, I see cloud computing as a driver for using rich external reference data.

First of all many SaaS solutions naturally comes with maintained reference data built in. That may be simple things as country lists, postal code tables, industry product codes and so on.

Also there is a great potential in that we maintain the master data everyone else today is maintaining about you and me – LinkedIn is a great example here.

Steve Sarsfield said...

The Linked Data Initiative that Olivier mentioned:

If you're into reference data, definitely explore here.

Ken O'Connor said...

Hi Steve,

Great post - well done.

Access to quality external reference data will help us break out of the "bespoke era of Master Data Management". In the past, there was little or no external referenence data available. Thankfully this improved in recent years, and it is great to learn from you about this great free "open source" external reference data.

The human genome project, in which DNA is being ‘deciphered’ is one of the finest examples of how the “open source” model can bring greater benefit to all. The equivalent within the Data world could be the opening up of proprietary data models. IBM developed the Financial Services Data Model (FSDM). The FSDM became an ‘overnight success’ when BASEL II arrived. Those Financial Institutions that had adopted the FSDM were in a position to find the data required by the Regulators relatively easily.

Imagine a world in which the Financial Regulator(s) used the same Data Model as the Financial Organisations?

Perhaps such an "open source" data modelling project is already underway?? If so, I would love to hear about it.

Regarding XaviMann's wish for "some enterprising data hound to assemble a single global db of postal code data." I believe Graham Rhind may well be that data hound...
Check out:

Rgds Ken

Steve Sarsfield said...

I am familiar with Graham Rhind's offering and I think it's worthy of its own blog post... to come.

institute of lraqi scholars & academician said...

thank you very much

Disclaimer: The opinions expressed here are my own and don't necessarily reflect the opinion of my employer. The material written here is copyright (c) 2010 by Steve Sarsfield. To request permission to reuse, please e-mail me.