Monday, June 13, 2011

The Differences Between Small and Big Data

There is a lot of buzz today about big data and companies stepping up to meet the challenge of ever increasing data volumes. In the center of it all, are Hadoop and the Cloud.  Hadoop can intelligently manage the distribution of processing and your files. It manages the infrastructure needed to break down big data into more manageable chunks for processing by multiple servers. Likewise, a cloud strategy can take data management outside the walls of a corporation into a high scalable infrastructure.

Do you have big data?  It’s difficult to know precisely whether you do because big data is vaguely defined. You may qualify for big data technology if you face hundreds of gigabytes of data, or it may hundreds or thousands of terabytes. The classification of “big data” is not strictly defined by data size, but other business processes, too. Your data management infrastructure needs to take into account factors like future data volumes, peaks and lulls in requirements, business requirements and much more.

Small and Medium-Sized Data

What about “small” and medium-sized data? For example, data from spreadsheet, the occasional flat file, leads from a trade show, and catalog data from vendors may be vital to your business processes. With a new industry focus on transparency, business user involvement and sharing of data, small data is a constant issue.  Spreadsheets and flat files are the preferred method to share data today because most companies have some process for handling them. When you get these small to medium sized data sets,  it is still necessary to:
  • profile them
  • integrate them into your relational database
  • aggregate data from these sources, or extract only the vital parts
  • apply data quality standards when necessary
  • use them as part of a master data management (MDM) initiative

The Difference Goals of Big Data and Little Data
With big data, the concern is usually about your data management technology’s ability to handle massive quantities in order to provide you aggregates that are meaningful.  You need solutions that will scale to meet your data management needs.  However, handling small and medium data sets is more about short and long term costs.  How can you quickly and easily integrate data without a lot of red tape, big license fees, pain and suffering.

Think about it. When you need to handle small and medium data, you have options:
  • Hand-coding: Using hand-coding is sometimes faster than any solution and it still may be OK for ad-hoc, one off data integration.  Once you find yourself hand-coding again and again, you’ll find yourself rethinking that strategy. Eventually managing all that code will waste time and cost you a bundle. If your data volumes grow, hand-coded quickly becomes obsolete due to lack of scaling. Hand-coding gets high marks on speed to value, but falters in sustainability and long-term costs.
  • Open Source: Open source data management tools provide a quick way to get started, low overall costs and high sustainability.  By just downloading and learning the tools, you’re on your way to getting data management done.  The open source solutions may have some limitations on scalability, but most open source providers have low-cost commercial upgrades that meet these needs.  In other words, it's easy to start today and leverage Hadoop and the Cloud if you need it later. Open source gets high marks on speed to value, sustainability and costs.
  • Traditional Data Management Vendors: Small data is a tough issue for the mega-vendors. Even for 50K-100K records, the license cost in both the short term and long term could be prohibitive.  The mega-vendor solutions do tend to scale well, making them sustainable at a cost. However mergers in the data management business do happen. The sustainability of a product can be affected by these mergers.  Commercial vendors get respectable marks in speed to value and sustainability, but falter in high up-front costs and maintenance fees.
I've heard it a million times in this business - start small and fast with technology that gives you a fast success but also scales to future tasks.

    3 comments:

    Paige Roberts said...

    Really liked this post.

    It points out one of the aspects people don't often think about in relation to the definition of big data. It isn't just about data volume and complexity, but about the complexity of what you need to do with it. Data that would only be considered medium sized can become a big data problem when the work you need to do with it overwhelms your available compute resources.

    A million rows may not seem like a large data volume, but when you have to do fuzzy matching for deduping, and that means a million times a million comparisons, assuming you're only matching on one field which isn't likely, it can bring traditional systems to their knees.

    Make that data 10 million rows or 100 million and you really have a huge headache.

    Big data scaling needs can affect even medium-sized data volumes.

    And on another note, you hit on one of my pet peeves: commercial integration software does not have to cost as much as a new office building. Good software at a good price is not a mythical beast.

    Just sayin.

    Paige

    Steve Sarsfield said...

    Thanks Paige. The job of the vendors needs to be to lower the barrier to adoption. It shouldn't be so difficult and expensive to deal with 100,000 records or an Excel spreadsheet or a text file. I think open source is now a key player in data management for that reason.
    The matching issue you mentioned really can be solved with blocking. It's a feature that many solutions have that allow you to eliminate records that have no resemblance to one another; there's no need to send them into the complicated algorithms in the matcher. But again, you shouldn't need a PhD in mathematics to set up blocks and optimize your match.

    Sydney said...

    Great post!

    Share it

    Disclaimer: The opinions expressed here are my own and don't necessarily reflect the opinion of my employer. The material written here is copyright (c) 2010 by Steve Sarsfield. To request permission to reuse, please e-mail me.