Showing posts with label data migration. Show all posts
Showing posts with label data migration. Show all posts

Monday, June 13, 2011

The Differences Between Small and Big Data

There is a lot of buzz today about big data and companies stepping up to meet the challenge of ever increasing data volumes. In the center of it all, are Hadoop and the Cloud.  Hadoop can intelligently manage the distribution of processing and your files. It manages the infrastructure needed to break down big data into more manageable chunks for processing by multiple servers. Likewise, a cloud strategy can take data management outside the walls of a corporation into a high scalable infrastructure.

Do you have big data?  It’s difficult to know precisely whether you do because big data is vaguely defined. You may qualify for big data technology if you face hundreds of gigabytes of data, or it may hundreds or thousands of terabytes. The classification of “big data” is not strictly defined by data size, but other business processes, too. Your data management infrastructure needs to take into account factors like future data volumes, peaks and lulls in requirements, business requirements and much more.

Small and Medium-Sized Data

What about “small” and medium-sized data? For example, data from spreadsheet, the occasional flat file, leads from a trade show, and catalog data from vendors may be vital to your business processes. With a new industry focus on transparency, business user involvement and sharing of data, small data is a constant issue.  Spreadsheets and flat files are the preferred method to share data today because most companies have some process for handling them. When you get these small to medium sized data sets,  it is still necessary to:
  • profile them
  • integrate them into your relational database
  • aggregate data from these sources, or extract only the vital parts
  • apply data quality standards when necessary
  • use them as part of a master data management (MDM) initiative

The Difference Goals of Big Data and Little Data
With big data, the concern is usually about your data management technology’s ability to handle massive quantities in order to provide you aggregates that are meaningful.  You need solutions that will scale to meet your data management needs.  However, handling small and medium data sets is more about short and long term costs.  How can you quickly and easily integrate data without a lot of red tape, big license fees, pain and suffering.

Think about it. When you need to handle small and medium data, you have options:
  • Hand-coding: Using hand-coding is sometimes faster than any solution and it still may be OK for ad-hoc, one off data integration.  Once you find yourself hand-coding again and again, you’ll find yourself rethinking that strategy. Eventually managing all that code will waste time and cost you a bundle. If your data volumes grow, hand-coded quickly becomes obsolete due to lack of scaling. Hand-coding gets high marks on speed to value, but falters in sustainability and long-term costs.
  • Open Source: Open source data management tools provide a quick way to get started, low overall costs and high sustainability.  By just downloading and learning the tools, you’re on your way to getting data management done.  The open source solutions may have some limitations on scalability, but most open source providers have low-cost commercial upgrades that meet these needs.  In other words, it's easy to start today and leverage Hadoop and the Cloud if you need it later. Open source gets high marks on speed to value, sustainability and costs.
  • Traditional Data Management Vendors: Small data is a tough issue for the mega-vendors. Even for 50K-100K records, the license cost in both the short term and long term could be prohibitive.  The mega-vendor solutions do tend to scale well, making them sustainable at a cost. However mergers in the data management business do happen. The sustainability of a product can be affected by these mergers.  Commercial vendors get respectable marks in speed to value and sustainability, but falter in high up-front costs and maintenance fees.
I've heard it a million times in this business - start small and fast with technology that gives you a fast success but also scales to future tasks.

    Monday, May 16, 2011

    The Butterfly Effect and Data Quality

    I just wrote a paper called the ‘Butterfly Effect’ of poor data quality for Talend.

    The term butterfly effect refers to the way a minor event – like the movement of a butterfly’s wing – can have a major impact on a complex system – like the weather. The movement of the butterfly wing represents a small change in the initial condition of the system, but it starts a chain of events: moving pollen through the air, which causes a gazelle to sneeze, which triggers a stampede of gazelles, which raises a cloud of dust, which partially blocks the sun, which alters the atmospheric temperature, which ultimately alters the path of a tornado on the other side of the world.

    Enterprise data is equally susceptible to the butterfly effect.  When poor quality data enters the complex system of enterprise data, even a small error – the transposed letters in a street address or part number – can lead to 1) revenue loss; 2) process inefficiency and; 3) failure to comply with industry and government regulations. Organizations depend on the movement and sharing of data throughout the organization, so the impact of data quality errors are costly and far reaching. Data issues often begin with a tiny mistake in one part of the organization, but the butterfly effect can produce far reaching results.

    The Pervasiveness of Data
    When data enters the corporate ecosystem, it rarely stays in one place.  Data is pervasive. As it moves throughout a corporation, data impacts systems and business processes. The negative impact of poor data quality reverberates as it crosses departments, business units and cross-functional systems.
    • Customer Relationship Management (CRM) - By standardizing customer data, you will be able to offer better, more personalized customer service.  And you will be better able to contact your customers and prospects for cross-sell, up-sell, notification and services.
    • ERP / Supply Chain Data- If you have clean data in your supply chain, you can achieve some tangible benefits.  First, the company will have a clear picture about delivery times on orders because of a completely transparent supply chain. Next, you will avoid unnecessary warehouse costs by holding the right amount of inventory in stock.  Finally, you will be able to see all the buying patterns and use that information when negotiating supply contracts.
    • Orders / Billing System - If you have clean data in your billing systems, you can achieve the tangible benefits of more accurate financial reporting and correct invoices that reach the customer in a timely manner.  An accurate bill not only leads to trust among workers in the billing department, but customer attrition rates will be lower if invoices are delivered accurately and on time.
    • Data Warehouse - If you have standardized the data feeding into your data warehouse, you can dramatically improve business intelligence. Employees can access the data warehouse and be assured that the data they use for reports, analysis and decision making is accurate. Using the clean data in a warehouse can help you find trends, see relationships between data, and understand the competition in a new light.
    To read more about the butterfly effect of data quality, download it from the Talend site.

    Monday, February 22, 2010

    Referential Treatment - The Open Source Reference Data Trend

    Reference data can be used in a huge number of data quality and data enrichment processes.  The simplest example is a table that contains cities and their associated postal codes – you can use an ETL process to make sure that all your customer records that contain 02026 for a postal code always refer to the standardized “Dedham, MA” for the city and state, not variations like “Deadham Mass”  or “Dedam, Massachusetts”.

    Reference data is not limited to customer address, however. If everyone were to use the same reference data for parts, you could easily exchange procurement data between partners.  If only certain values are allowed in any given table, it would support validation.  By having standards for supply chain data, procurement, supply chain, finance and accounting data, processes are more efficient.  Organizations like the ISO and ECCMA are working on that.

    Availability of Reference Data
    In the past, it was difficult to get your hands on reference data. Long ago, no one wanted to share reference data with you - you had to send your customer data to a service provider and get the enriched data back.  Others struggled to develop reference data on their own. Lately I’m seeing more and more high quality reference data available for free on the Internet.   For data jockeys, these are good times.

    GeoNames
    A good example of this is GeoNames.  The GeoNames geographical database is available for download free of charge under a creative commons attribution license. According to the web site, it “aggregates over 100 different data sets to build a list containing over eight million geographical names and consists of 7 million unique features whereof 2.6 million populated places and 2.8 million alternate names. The data is accessible free of charge through a number of web services and a daily database export. “

    GeoNames combines geographical data such as names of places in various languages, elevation, population and others from various sources. All lat/long coordinates are in WGS84 (World Geodetic System 1984). Like Wikipedia, users may manually edit, correct and add new names.

    US Census Data
    Another rich set of reference data is the US Census “Gazetteer” data. Courtesy of the US government, you can download a database with the following fields:
    • Field 1 - State Fips Code
    • Field 2 - 5-digit Zipcode
    • Field 3 - State Abbreviation
    • Field 4 - Zipcode Name
    • Field 5 - Longitude in Decimal Degrees (West is assumed, no minus sign)
    • Field 6 - Latitude in Decimal Degrees (North is assumed, no plus sign)
    • Field 7 - 2000 Population (100%)
    • Field 8 - Allocation Factor (decimal portion of state within zipcode)
    So, our Dedham, MA entry includes this data:
    • "25","02026","MA","DEDHAM",71.163741,42.243685,23782,0.003953
    It’s Really Exciting!
    When I talk about reference data at parties, I immediately see eyes glaze over and it’s clear that my fellow party-goers want to escape my enthusiasm for it.  But this availability of reference data is really great news! Together with the open source data integration tools like Talend Open Studio, we’re starting to see what I like to call “open source reference data” becoming available. It all makes the price of improving data quality much lower and our future much brighter.

    There’s so much to talk about with regard to reference data and so many good sources.  I plan to make more posts on this topic, but feel free to post your beloved reference data sources here in the comments section.

    Thursday, January 21, 2010

    ETL, Data Quality and MDM for Mid-sized Business


    Is data quality a luxury that only large companies should be able to afford?  Of course the answer is no. Your company should be paying attention to data quality no matter if you are a Fortune 1000 or a startup. Like a toothache, poor data quality will never get better on its own.

    As a company naturally grows, the effects of poor data quality multiply.  When a small company expands, it naturally develops new IT systems. Mergers often bring in new IT systems, too. The impact of poor data quality slowly invades and hinders the company’s ability to service customers, keep the supply chain efficient and understand its own business. Paying attention to data quality early and often is a winning strategy for even the small and medium-sized enterprise (SME).

    However, SME’s have challenges with the investment needed in enterprise level software. While it’s true that the benefit often outweighs the costs, it is difficult for the typical SME to invest in the license, maintenance and services needed to implement a major data integration, data quality or MDM solution.

    At the beginning of this year, I started with a new employer, Talend. I became interested in them because they were offering something completely different in our world – open source data integration, data quality and MDM.  If you go to the Talend Web site, you can download some amazing free software, like:
    • a fully functional, very cool data integration package (ETL) called Talend Open Studio
    • a data profiling tool, called Talend Open Profiler, providing charts and graphs and some very useful analytics on your data
    The two packages sit on top of a database, typically MySQL – also an open source success.

    For these solutions, Talend uses a business model similar to what my friend Jim Harris has just blogged about – Freemium. Under this new model, free open source content is made available to everyone—providing the opportunity to “up-sell” premium content to a percentage of the audience. Talend works like this.  You can enhance your experience from Talend Open Studio by purchasing Talend Integration Suite (in various flavors).  You can take your data quality initiative to the next level by upgrading Talend Open Profiler to Talend Data Quality.

    If you want to take the combined data integration and data quality to an even higher level, Talend just announced a complete Master Data Management (MDM) solution, which you can use in a more enterprise-wide approach to data governance. There’s a very inexpensive place to start and an evolutionary path your company can take as it matures its data management strategy.

    The solutions have been made possible by the combined efforts of the open source community and Talend, the corporation. If you’d like, you can take a peek at some source code, use the basic software and try your hand at coding an enhancement. Sharing that enhancement with community will only lead to a world full of better data, and that’s a very good thing.

    Thursday, October 22, 2009

    Book Review: Data Modeling for Business


    A couple of weeks ago, I book-swapped with author Donna Burbank. She has a new book entitled Data Modeling for Business. Donna, an experienced consultant by trade, has teamed up with Steve Hoberman, a previous published author and technologist and Chris Bradley, also a consultant, for an excellent exploration of the process of creating a data model. With a subtitle like “A handbook for Aligning the Business with IT using a High-Level Data Model” I knew I was going to find some value in the swap.

    The book describes in plain English the proper way to create a data model, but that simple description doesn’t do it justice. The book is designed for those who are learning from scratch – those who only vaguely understand what a data model is. It uses commonly understood concepts to describe data model concepts. The book describes the impact of the data model to the project’s success and digs into setting up data definitions and the levels of detail necessary for them to be effective. All of this is accomplished in a very plain-talk, straight-forward tone without the pretentiousness you sometimes get in books about data modeling.

    We often talk about the need for business and IT to work together to build a data governance initiative. But many, including myself, have pointed to the communication gap that can exist in a cross-functional team. In order to bridge the gap, a couple of things need to happen. First, IT teams need to expand their knowledge of business processes, budgets and corporate politics. Second, business team members need to expand their knowledge of metadata and data modeling. This book provides an insightful education for the latter. In my book, the Data Governance Imperative, the goal was the former.

    The book is well-written and complete. It’s a perfect companion for those who are trying to build a knowledgeable, cross-function team for data warehouse, MDM or data governance projects. Therefore, I’ve added it to my recommended reading list on my blog.

    Monday, May 4, 2009

    Don’t Sweat the Small Stuff, Except in Data Quality

    April was a busy month. I was the project manager on a new web application, nearly completed my first German web site (also as project manager) and released the book “Data Governance Imperative”. All this real work has taken me away from something I truly love – blogging.

    I did want to share something that affected my project this month, however. Data issues can come in the smallest of places and can have a huge effect on your time line.

    For the web project I completed this month, the goal was to replace a custom-coded application with a similar application built within a content management system. We had to migrate log in data of users of the application, all with various access levels, to the new system.

    During go live, we were on a tight deadline to migrate the data, do final testing of the new application and seamlessly switch everyone over. That all had to happen on the weekend. No one would be the wiser come Monday morning. If you’ve ever done an enterprise application upgrade, you may have followed a similar plan.

    We had done our profiling and knew that there were no data issues. However when the migration actually took place, lo and behold – the old system allowed # as a character in the username and password while the new system didn’t. It forced us to stop the migration and write a rule to handle the issue. Even with this simple issue, the time line came close to missing its Monday morning deadline.

    Should we have spotted that issue? Yes, in hindsight we could have better understood the system restrictions on the username and password and set up a custom business rule in the data profiler to test it. We might have even forced the users to change the # before the switch while they were still using the old application.

    The experience reminds me that data quality is not just about making the data right, it’s about making the data fit for business purpose – fit for the target application. When data is correct for one legacy application, it can be unfit for others. It reminds me that you can plan and test all you want, but you have to be ready for hiccups during the go live phase of the project. The tools, like profiling, are there to help you limit the damage. We were lucky in that this database was relatively small and reload was relatively simple once we figured it all out. For bigger projects, more complete staging of the project – making dry run before the go live phase would have been more effective.

    Disclaimer: The opinions expressed here are my own and don't necessarily reflect the opinion of my employer. The material written here is copyright (c) 2010 by Steve Sarsfield. To request permission to reuse, please e-mail me.