This time of year everyone makes prognostications about the state of the data management field for 2011. I thought I’d take my turn by offering my predictions for the coming year.
Data will become more open
In the old days good quality reference data was an asset kept in the corporate lockbox. If you had a good reference table for common misspellings of parts, cities, or names for example, the mind set was to keep it close and away from falling into the wrong hands. The data might have been sold for profit or simply not available. Today, there really is no “wrong hands”. Governments and corporations alike are seeing the societal benefits of sharing information. More reference data is there for the taking on the internet from sites like data.gov and geonames.org. That trend will continue in 2011. Perhaps we’ll even see some of the bigger players make announcements as to the availability of their data. Are you listening Google?
Business and IT will become blurry
It’s becoming harder and harder to tell an IT guy from the head of marketing. That’s because in order to succeed, the IT folks need to become more like the marketer and vice versa. In the coming year, the difference will be less noticeable and business people get more and more involved in using data to their benefit. Newsflash One: If you’re in IT, you need marketing skills to pitch your projects and get funding. Newsflash Two: If you’re in business, you need to know enough about data management practices to succeed.
Tools will become easier to use
As the business users come into the picture, they will need access to the tools to manage data. Vendors must respond to this new marketplace or die.
Tools will do less heavy lifting
Despite the improvements in the tools, corporations will turn to improving processes and reporting in order to achieve better data management. Dwindling are the days where we’re dealing with data that is so poorly managed that it requires overly complicated data quality tools. We’re getting better at the data management process and therefore, the burden on the tools becomes less. Future tools with focus on supporting the process improvement with work flow features, reporting and better graphical user interfaces.
CEOs and Government Officials will gain enlightenment
Feeding off the success of a few pioneers in data governance as well as failures of IT projects in our past, CEOs and governments will gain enlightenment about managing their data and put teams in place to handle it. It has taken decades of our sweet-talk and cajoling for government and CEOs to achieve enlightenment, but I believe it is practically here.
We will become more reliant on data
Ten years ago, it was difficult to imagine us where we are today with respect to our data addiction. Today, data is a pervasive part of our internet-connected society, living in our PCs, our TVs, our mobile phones many other devices. It’s a huge part of our daily lives. As I’ve said in past posts, the world is addicted to data and that bodes well for anyone who helps the world manage it. In 2011, no matter if the economy turns up or down, our industry will continue to feed the addiction to good, clean data.
Friday, December 10, 2010
Six Data Management Predictions for 2011
Tuesday, November 30, 2010
Match Mitigation: When Algorithms Aren’t Enough
I’d like to get a little technical on this post. I try to keep my posts business-friendly, but sometimes there's importance in detail. If none of this post makes any sense to you, I wrote a sort of primer on how matching works in many data quality tools, which you can get here.
Matching Algorithms
When you use a data quality tool, you’re often using matching algorithms and rules to make decisions on whether records match or not. You might be using deterministic algorithms like Jaro, SoundEx and Metaphones. You might also be using probabilistic matching algorithms.
In many tools, you can set the rules to be tight where the software uses tougher criteria to determine a match, or loose where the software is not so particular. Tight and loose matches are important because you may have strict rules for putting records together, like customers of a bank, or not so strict rules, like when you’re putting together a customer list for marketing purposes.
What to do with Matches
Once data has been processed through the matcher, there are several possible outcomes. Between any two given records, the matcher may find:
- No relationship
- Match – the matcher found a definite match based on the criteria given
- Suspect – the matcher thinks it found a match but is not confident. The results should be manually reviewed.
Some of the newer (and cooler) tools offer strategies for dealing with suspect matches. The tools will present the suspect matches in a graphical user interface and allow users to pick which relationships are accurate and which are not. For example, Talend now offers a data stewardship console that lets you pick and choose records and attributes that will make up a best of breed record.
The goal, of course, is to not have suspect matches, so tuning the matches and limiting the suspect matches is the ultimate. The newest tools will make this easy. Some of the legacy tools make this hard.
Match mitigation is perhaps one of the most often overlooked processes of data quality. Don’t overlook it in your planning and processes.
Tuesday, November 16, 2010
Ideas Having Sex: The Path to Innovation in Data Management
Back in the 1990’s or earlier, if you had an idea for a new product, you’d work with an internal team of engineers and build the individual parts. This innovation took time, as you might not always have exactly the right people working on the job. It was slow and tedious. The product was always confined by its own lineage.
The Android phone market is a perfect examples of the modern way to innovate. Today, when you want to build something groundbreaking like an Android, you pull in expertise from all around the world. Sure, Samsung might make the CPU and Video processing chips, but Primax Electronics in Taiwan might make the digital camera and Broadcomm in the US makes the touch screen, plus many others. Software vendors push the platform further with their cool apps. Innovation happens at break-neck speed because the Android is a collection of ideas that have sex and produce incredible offspring.
Isn’t that really the model of a modern company? You have ideas getting together and making new ideas. When you have free exchange between people, there is no need to re-invent something that has already been invented. See the TED for more on this concept, where British author Matt Ridley argues that, through history, the engine of human progress and prosperity is "ideas having sex.”
The business model behind open source has a similar mission. Open source simply creates better software. Everyone collaborates, not just within one company, but among an Internet-connected, worldwide community. As a result, the open source model often builds higher quality, more secure, more easily integrated software. It does so at a vastly accelerated pace and often at a lower cost.
So why do some industry analysts ignore it? There’s no denying that there are capitalist and financial reasons. I think if an industry analyst were to actually come out and say that the open source solution is the best, it would be career suicide. The old-school would shun the analysts making him less relevant. The link between the way the industry pays and promotes analysts and vice versa seems to favor enterprise application vendors.
Yet the open source community along with Talend has developed a very strong data management offering that should be considered in the top of its class. The solution leverages other cutting edge solutions. To name just a few examples:
- if you want to scale up, you can use distributed platform technology from Hadoop, which enables it to work with thousands of nodes and petabytes of data.
- very strong enterprise class data profiling.
- matching that users can actually use and tune without having to jump between multiple applications.
- a platform that grows with your data management strategy so that if your future is MDM, you can seamlessly move there without having to learn a new GUI.
Saturday, October 16, 2010
Is 99.8 % data accuracy enough?
Ripped from recent headlines, we see how even a .2% failure can have a big impact.
WASHINGTON (AP) ― More than 89,000 stimulus payments of $250 each went to people who were either dead or in prison, a government investigator says in a new report.
Let’s take a good, hard look at this story. It begins with the US economy slumping. The president proposes and passes through congress one of the biggest stimulus packages ever. The idea is sound to many; get America working by offering jobs in green energy, shovel-ready infrastructure projects. Among other actions, the plan is to give lower income people some government money so they can stimulate the economy.
I’m not really here to praise or zing the wisdom of this. I’m just here to give the facts. In hindsight, it appears as though it hasn’t stimulated the economy as many had hoped, but that’s beside the point.
Continuing on, the government issues 52 million people on social security a check for $250. It turns out of that number nearly 100,000 people were in prison or dead, roughly 0.2% of the checks. Some checks are returned, some are cashed. Ultimately, the government loses $22.3 million on the 0.2% error.
While $22.3 million is a HUGE number, 0.2% is a tiny number. It strikes at the heart at why data quality is so important. Social Security spokesman Mark Lassiter said, "…Each year we make payments to a small number of deceased recipients usually because we have not yet received reports of their deaths."
There is strong evidence that the SSA is hooked up to the right commercial data feeds and have the processes in place to use them. It seems as though the social security administration is quite proactive in their search for the dead and imprisoned, but people die and go to prison all the time. They also move, get married and become independent of their parents.
If we try to imagine what it would take to achieve closer to 100% accuracy, it would take up-to-the-minute reference data. It seems that the only real solution is to put forth legislation that requires the reporting to the federal government any of these life changing events. Should we mandate the bereaved or perhaps funeral directors to report the death immediately in a central database? Even with such a law, there still would be a small percentage of checks that would be issued while the recipient was alive and delivered after the recipient is dead. We’d have better accuracy for this issue, but not 100%
While this story takes a poke at the SSA for sending checks to dead people, I have to applaud their achievement of 99.8% accuracy. It could be a lot worse America. A lot worse.
Saturday, August 28, 2010
ERP and SCM Data Profiling Techniques
In this YouTube tutorial for Talend, I walk through some techniques for profiling ERP, SCM and materials master data using Talend Open Profiler. In addition to basic profiling, the correlation analysis feature can be used to identify relationships between part numbers and descriptions.
Monday, August 16, 2010
Data Governance and Data Quality Insider 100th
I have reached my 100th post milestone. I hope you won't mind if I get a little introspective here and tell you a little about my social media journey over these past three years.
How did I get started? One day back in 2007, I disagreed with Vince McBurney’s post (topic unimportant now). I responded and Vince politely told me to shut up and if I really wanted to have an opinion to write my own blog. I did. Thanks for the kick in the pants, Vince.
Some of my most popular posts over these past three years have been:
- Probabilistic Matching: Sounds like a good idea, but…
Here, I take a swipe at the sanctity of probabilistic matching. I probably have received the most hate-mail from this post. My stance still is that a hybrid approach to matching, using both probabilistic and deterministic is key to getting match results. Probabilistic alone is not the solution.
- Data Governance and the Coke Machine Syndrome
I recount a parable given to me by a well-respected boss in my past about meeting management. Meetings can take unexpected turns where huge issues can be settled in minutes, while insignificant ones can eat up the resources of your company. I probably wrote it just after a meeting.
- Data Quality Project Selection
A posting about picking the right data quality projects to work on.
- The “Do Nothing” Option
A posting the recounts a lesson I learned about selling the power of data quality to management.
Blogging has not always been easy. I’ve met some opposition to along the way. There were times when my blogging was perceived as somehow threatening to corporate. At the time, blogging was new and corporations didn't know how to handle it. More companies now have definitive blogging policies and realize the positive impact it has.
What about the people I’ve met? I’ve gained a lot of friendships along the way with people I’ve yet to meet face-to-face. We’re able to build a community here in cyberspace – a data geek community that I am very fond of. I’m hesitant to write a list because I don’t want to leave anyone out, but you know who you are.
If you're thinking of blogging, please, find something you’re passionate about and write. You’ll have a great time!
Thursday, August 12, 2010
Change Management and Data Governance
As I read through the large amount of information on change management, I’m struck by the parallels between change management and data governance. The focus is on processes. It ensures that no matter what changes happen in a corporation, whether it’s downsizing or rapid growth, significant changes are implemented in an orderly fashion and make everyone more effective.
On the other hand, humans are resistant to change. Change management aims to gain buy-in from management to achieve the organization's goal of an orderly and effective transformation. Sound familiar? Data governance speaks to this ability to manage data properly, no matter what growth spurts, mergers or downsizing occurs. It is about changing the hearts and minds of individuals to better manage data and achieve more success while doing so.
Change Management Models
As you examine data governance models, look toward change management models that have been developed by vendors and analysts in the change management space. One that struck my attention was the ADKAR model developed by a company called Prosci. In this model, there are five specific stages that must be realized in order for an organization to successfully change. They include:
- Awareness - An organization must know why a specific change is necessary.
- Desire - The organizational must have the motivation and desire to participate in the call for change.
- Knowledge – The organization must know how to change. Knowing why you must change is not enough.
- Ability - Every individual in the company must implement new skills and processes to make the necessary changes happen.
- Reinforcement - Individuals must sustain the changes, making them the new behavior, averting the tendency to revert back to their old processes.
I often talk about business users and IT working together to solve the data governance problem. By looking at the extensive information available on change management, you can learn a lot about making changes for data governance.




