Friday, July 9, 2010
The Book of Life
I found a very interesting paper on record matching in my research for my own upcoming white paper. The paper was written by Chief of the National Office of Vital Statistics for the U. S. Public Health Service. The paper describes in almost poetic fashion of a person’s “book of life”. It describes how a person leaves a data trail as they go through life. It describes how hard it is to put together the pages of your book of life as you get born, get married, change homes, earn degrees and certifications.
Naturally, there are benefits to society for each person having their own book of life. In the case of the bureau chief, he cited the need to understand what factors influence health and longevity. The tricky part, he said was to “bind the book of life” despite its tendency to be misalign, non-standard and incoherent.
It sounds like the good Doctor is describing record matching and data cleansing, and to some degree a national ID. But the most interesting and amazing thing about this is that the paper was written in 1946. Even back then, there were smart people who knew what we had to do to bring benefit to society.
Naturally, there are benefits to society for each person having their own book of life. In the case of the bureau chief, he cited the need to understand what factors influence health and longevity. The tricky part, he said was to “bind the book of life” despite its tendency to be misalign, non-standard and incoherent.
It sounds like the good Doctor is describing record matching and data cleansing, and to some degree a national ID. But the most interesting and amazing thing about this is that the paper was written in 1946. Even back then, there were smart people who knew what we had to do to bring benefit to society.
Thursday, May 13, 2010
Three Conversations to Have with an Executive - the Only Three
If you’re reading this, you’re most likely in the business of data management. In many companies, particularly large ones, the folks who manage data don’t much talk to the executives. But every so often, there is that luncheon, a chance meeting in the elevator, or even a break from a larger meeting where you and an executive are standing face to face. (S)he asks, what you’re working on. Like a boy scout, be prepared. Keep your response to one of these three things:
- Revenue – How has your team increased revenue for the corporation?
- Efficiency – How has your team lowered costs by improving efficiency for the corporation?
- Risk – How have you and your team lowered the risk to the corporation with better compliance to corporate regulations?
The executive doesn’t want to hear about schemas, transformations or even data quality. Some examples of appropriate responses might include:
- We work on making the CRM/ERP system more efficient by keeping an eye on the information within it. My people ensure that the reports are accurate and complete so you have the tools to make the right decisions.
- We’re doing things like making sure we’re in compliance with [HIPAA/Solvency II/Basel II/Antispam] so no one runs afoul of the law.
- We’re speeding up the time it takes to get valuable information to the [marketing/sales/business development] team so they can react quickly to sales opportunities
- We’re fixing [business problem] to [company benefit].
When you talk to your CEO, it’s your opportunity get him/her in the mindset that your team is beneficial, so when it comes to funding, it will be something they remember. It’s your chance to get out of the weeds and elevate the conversation. Let the sales guys talk about deals. Let the marketing people talk about the market forces or campaigns. As data champions, we also need to be prepared to talk about the value we bring to the game.
Thursday, May 6, 2010
Open Source Data Profiler Demo
Here I am giving a demo on Talend Open Profiler. Demo and commentary included.
You can download Talend Open Profiler here.
Tuesday, May 4, 2010
Are we ready for a National ID card in America?
I read with interest the story about the National ID card. Although I don’t like to link myself to one political party or another, I applaud the effort of trying to get a system in place for national ID. I like efficient data.
However, I have my doubts that a group of Senators can really understand the enormous challenges of such a project. The issue is a politically charged one for certain, so that will be the focus. The details, which we all know contain the devil, will likely be forgotten.
I recall just a short time ago the US government’s Cash for Clunkers program. The program involved buying a new car and turning in your old “clunker” for a new fuel-efficient one. The idea was to support the auto industry and get the gas guzzlers off the road. The devil was in the details, however. Rather than a secure web site with sufficient backbone to properly serve car dealerships, the program required the dealers complete pages and pages of paperwork… real paper paperwork… and fax it into the newly formed government agency for approval. Then they hired workers on other end to enter the data. It was a business process that would have been appropriate for 1975, not 2010.
ACLU legislative counsel Christopher Calabrese said of this National ID program that “all of this will come with a new federal bureaucracy — one that combines the worst elements of the DMV and the TSA”. Based on recent history it’s an accurate description of what will likely happen.
If the government wants to do this thing, they need to bring in a dream team of database experts. Guys like Dr. Ralph Kimball or Bill Inmon, both of whom are world renown for data modeling, should contribute if they are willing. They should ask in Dr. Rich Wang from MIT’s IQ program to be in charge of information quality issues. They should invite guys like Jim Harris to communicate the complex issues to the public. Also, they need to bring in folks with practical experience, like a Jill Dyche or Gwen Thomas. There are probably some others that I haven’t mentioned. Security experts, hardware scalability experts and business process experts need to be part of the mix to protect the citizenry of the United States. They would need to make a plan without bias toward any district or political action committee. That’s why a national database won’t happen.
Don’t get me wrong, if we do so, we could come up with much more efficient systems for checking backgrounds, I-9 job verification, international travel, and more. Identity theft is a big problem here and everywhere, but with a central citizen repository, the US could legislate a notification system when new bank accounts are opened in your name. The census would always show a more accurate number and wouldn't cost billions and billions of dollars to us every ten years. Let's face it, the business process of the census, mailing paper forms and personal door to door interviews, is outdated.
Let’s start this by making it voluntary. If you want to be in the database and avoid long lines at the airport, fine. If you want to be anonymous and wait, that’s fine, too. We’ll get the kinks worked out with the early adopters and roll it out to the laggards later.
What we’re really talking about here is a personal primary key. That data already exists in multiple linkable systems with your name and addresses (past and present) linking it. We as data professionals spend a lot of time and effort working with data to try to find these links. So why not have a primary key to link your personal data instead? Are you really giving up anything that DBAs haven't already figured out?
For those of you against a national database, I don’t think you have anything to fear. Call me a skeptic, but given the political divide between groups, it’s unlikely that any national database of citizens will be done within this decade. But if you’re listening Senators and you decide to move forward, make sure you have the right people, processes and technology in place to do it right.
However, I have my doubts that a group of Senators can really understand the enormous challenges of such a project. The issue is a politically charged one for certain, so that will be the focus. The details, which we all know contain the devil, will likely be forgotten.
I recall just a short time ago the US government’s Cash for Clunkers program. The program involved buying a new car and turning in your old “clunker” for a new fuel-efficient one. The idea was to support the auto industry and get the gas guzzlers off the road. The devil was in the details, however. Rather than a secure web site with sufficient backbone to properly serve car dealerships, the program required the dealers complete pages and pages of paperwork… real paper paperwork… and fax it into the newly formed government agency for approval. Then they hired workers on other end to enter the data. It was a business process that would have been appropriate for 1975, not 2010.
ACLU legislative counsel Christopher Calabrese said of this National ID program that “all of this will come with a new federal bureaucracy — one that combines the worst elements of the DMV and the TSA”. Based on recent history it’s an accurate description of what will likely happen.
If the government wants to do this thing, they need to bring in a dream team of database experts. Guys like Dr. Ralph Kimball or Bill Inmon, both of whom are world renown for data modeling, should contribute if they are willing. They should ask in Dr. Rich Wang from MIT’s IQ program to be in charge of information quality issues. They should invite guys like Jim Harris to communicate the complex issues to the public. Also, they need to bring in folks with practical experience, like a Jill Dyche or Gwen Thomas. There are probably some others that I haven’t mentioned. Security experts, hardware scalability experts and business process experts need to be part of the mix to protect the citizenry of the United States. They would need to make a plan without bias toward any district or political action committee. That’s why a national database won’t happen.
Don’t get me wrong, if we do so, we could come up with much more efficient systems for checking backgrounds, I-9 job verification, international travel, and more. Identity theft is a big problem here and everywhere, but with a central citizen repository, the US could legislate a notification system when new bank accounts are opened in your name. The census would always show a more accurate number and wouldn't cost billions and billions of dollars to us every ten years. Let's face it, the business process of the census, mailing paper forms and personal door to door interviews, is outdated.
Let’s start this by making it voluntary. If you want to be in the database and avoid long lines at the airport, fine. If you want to be anonymous and wait, that’s fine, too. We’ll get the kinks worked out with the early adopters and roll it out to the laggards later.
What we’re really talking about here is a personal primary key. That data already exists in multiple linkable systems with your name and addresses (past and present) linking it. We as data professionals spend a lot of time and effort working with data to try to find these links. So why not have a primary key to link your personal data instead? Are you really giving up anything that DBAs haven't already figured out?
For those of you against a national database, I don’t think you have anything to fear. Call me a skeptic, but given the political divide between groups, it’s unlikely that any national database of citizens will be done within this decade. But if you’re listening Senators and you decide to move forward, make sure you have the right people, processes and technology in place to do it right.
Labels:
national ID database
Friday, April 9, 2010
Links from my eLearning Webinar
I recently delivered a webinar on the Secrets of Affordable Data Governance. In the webinar, I promised to deliver links for lowering the costs of data management. Here are those links:
- Talend Open Source - Download free data profiling, data integration and MDM software.
- US Census - Download census data for cleansing of city name and state with latitude and longitude appends.
- Data.gov - The data available from the US government.
- Geonames - Postal codes and other location reference data for almost every country in the world.
- GRC Data - A source of low-cost customer reference data, including names, addresses, salutations, and more.
- Regular Expressions - Check the shape of data in profiling software or within your database application.
Friday, April 2, 2010
Donating the Data Quality Asset
If you believe like I do that proper data management can change the world, then you have to start wondering if it’s time for all us data quality professionals to stand up and start changing it.
It’s clear that everyone organization, no matter what the size or influence, can benefit from properly managing their data. Even charitable organizations can benefit with a cleaner customer list to get the word out when they need donations. Non-profits who handle charitable goods can benefit from better data in their inventory management. If food banks had a better way of managing data and soliciting volunteers, wouldn’t more people be fed? If churches kept better records of their members, would their positive influence be more widespread? If organizations who accept goods in donation kept a better inventory system, wouldn’t more people benefit? The data asset is not limited to Fortune 1000 companies, but until recently, solutions to manage data properly were only available to the elite.
Open source is coming on strong and is a factor that eases us to donate the data quality. In the past, it many have been a challenge to get mega-vendors to donate high-end solutions, but we can make significant progress on the data quality problem with little or no solutions cost these days. Solutions like Talend Open Profiler, Talend Open Studio, Pentaho and DataCleaner offer data integration and data profiling.
In my last post, I discussed the reference data that is now available for download. Reference data used to be proprietary and costly. It’s a new world – a better one for low-cost data management solutions.
Can we save the world through data quality? If we can help good people spread more goodness, then we can. Let’s give it a try.
It’s clear that everyone organization, no matter what the size or influence, can benefit from properly managing their data. Even charitable organizations can benefit with a cleaner customer list to get the word out when they need donations. Non-profits who handle charitable goods can benefit from better data in their inventory management. If food banks had a better way of managing data and soliciting volunteers, wouldn’t more people be fed? If churches kept better records of their members, would their positive influence be more widespread? If organizations who accept goods in donation kept a better inventory system, wouldn’t more people benefit? The data asset is not limited to Fortune 1000 companies, but until recently, solutions to manage data properly were only available to the elite.
Open source is coming on strong and is a factor that eases us to donate the data quality. In the past, it many have been a challenge to get mega-vendors to donate high-end solutions, but we can make significant progress on the data quality problem with little or no solutions cost these days. Solutions like Talend Open Profiler, Talend Open Studio, Pentaho and DataCleaner offer data integration and data profiling.
In my last post, I discussed the reference data that is now available for download. Reference data used to be proprietary and costly. It’s a new world – a better one for low-cost data management solutions.
Can we save the world through data quality? If we can help good people spread more goodness, then we can. Let’s give it a try.
Labels:
data profiling,
data quality,
donation,
open source,
supply chain
Monday, February 22, 2010
Referential Treatment - The Open Source Reference Data Trend
Reference data can be used in a huge number of data quality and data enrichment processes. The simplest example is a table that contains cities and their associated postal codes – you can use an ETL process to make sure that all your customer records that contain 02026 for a postal code always refer to the standardized “Dedham, MA” for the city and state, not variations like “Deadham Mass” or “Dedam, Massachusetts”.
Reference data is not limited to customer address, however. If everyone were to use the same reference data for parts, you could easily exchange procurement data between partners. If only certain values are allowed in any given table, it would support validation. By having standards for supply chain data, procurement, supply chain, finance and accounting data, processes are more efficient. Organizations like the ISO and ECCMA are working on that.
Availability of Reference Data
In the past, it was difficult to get your hands on reference data. Long ago, no one wanted to share reference data with you - you had to send your customer data to a service provider and get the enriched data back. Others struggled to develop reference data on their own. Lately I’m seeing more and more high quality reference data available for free on the Internet. For data jockeys, these are good times.
GeoNames
A good example of this is GeoNames. The GeoNames geographical database is available for download free of charge under a creative commons attribution license. According to the web site, it “aggregates over 100 different data sets to build a list containing over eight million geographical names and consists of 7 million unique features whereof 2.6 million populated places and 2.8 million alternate names. The data is accessible free of charge through a number of web services and a daily database export. “
GeoNames combines geographical data such as names of places in various languages, elevation, population and others from various sources. All lat/long coordinates are in WGS84 (World Geodetic System 1984). Like Wikipedia, users may manually edit, correct and add new names.
US Census Data
Another rich set of reference data is the US Census “Gazetteer” data. Courtesy of the US government, you can download a database with the following fields:
When I talk about reference data at parties, I immediately see eyes glaze over and it’s clear that my fellow party-goers want to escape my enthusiasm for it. But this availability of reference data is really great news! Together with the open source data integration tools like Talend Open Studio, we’re starting to see what I like to call “open source reference data” becoming available. It all makes the price of improving data quality much lower and our future much brighter.
There’s so much to talk about with regard to reference data and so many good sources. I plan to make more posts on this topic, but feel free to post your beloved reference data sources here in the comments section.
Reference data is not limited to customer address, however. If everyone were to use the same reference data for parts, you could easily exchange procurement data between partners. If only certain values are allowed in any given table, it would support validation. By having standards for supply chain data, procurement, supply chain, finance and accounting data, processes are more efficient. Organizations like the ISO and ECCMA are working on that.
Availability of Reference Data
In the past, it was difficult to get your hands on reference data. Long ago, no one wanted to share reference data with you - you had to send your customer data to a service provider and get the enriched data back. Others struggled to develop reference data on their own. Lately I’m seeing more and more high quality reference data available for free on the Internet. For data jockeys, these are good times.
GeoNames
A good example of this is GeoNames. The GeoNames geographical database is available for download free of charge under a creative commons attribution license. According to the web site, it “aggregates over 100 different data sets to build a list containing over eight million geographical names and consists of 7 million unique features whereof 2.6 million populated places and 2.8 million alternate names. The data is accessible free of charge through a number of web services and a daily database export. “
GeoNames combines geographical data such as names of places in various languages, elevation, population and others from various sources. All lat/long coordinates are in WGS84 (World Geodetic System 1984). Like Wikipedia, users may manually edit, correct and add new names.
US Census Data
Another rich set of reference data is the US Census “Gazetteer” data. Courtesy of the US government, you can download a database with the following fields:
- Field 1 - State Fips Code
- Field 2 - 5-digit Zipcode
- Field 3 - State Abbreviation
- Field 4 - Zipcode Name
- Field 5 - Longitude in Decimal Degrees (West is assumed, no minus sign)
- Field 6 - Latitude in Decimal Degrees (North is assumed, no plus sign)
- Field 7 - 2000 Population (100%)
- Field 8 - Allocation Factor (decimal portion of state within zipcode)
- "25","02026","MA","DEDHAM",71.163741,42.243685,23782,0.003953
When I talk about reference data at parties, I immediately see eyes glaze over and it’s clear that my fellow party-goers want to escape my enthusiasm for it. But this availability of reference data is really great news! Together with the open source data integration tools like Talend Open Studio, we’re starting to see what I like to call “open source reference data” becoming available. It all makes the price of improving data quality much lower and our future much brighter.
There’s so much to talk about with regard to reference data and so many good sources. I plan to make more posts on this topic, but feel free to post your beloved reference data sources here in the comments section.
Subscribe to:
Posts (Atom)
Disclaimer: The opinions expressed here are my own and don't necessarily reflect the opinion of my employer. The material written here is copyright (c) 2010 by Steve Sarsfield. To request permission to reuse, please e-mail me.






