Monday, April 17, 2017

Avoiding the three common myths of big data



There are many common myths when it comes to big data analytics. Like the lost city of Atlantis and the Bermuda Triangle, they seem to be ubiquitous in the teachings of big data. Let’s explore three of them.

Myth One:  Big Data = Hadoop
You often see the discussion about big data transition right into how to solve any issue on Hadoop.  However, Hadoop is not the only solution for big data analytics.  Many of the bigger vendors in the RDBMS space have been handling very large data sets for years, before the emergence of Hadoop-based big data. The fact is, Hadoop is not a database and has severe limitations on 1) the depth of analytics it can perform; 2) How many concurrent queries it can handle, and; 3) database standards like ACID compliance, SQL compliance and more.
For example, Vertica has been handling huge loads of data.  One customer loads 60 TB per hour into Vertica and has thousands of users (and applications) running analytics on it. That is an extreme example of big data, but it is proof that other solutions can scale to almost any workload. Hadoop is fantastic on the cost front, but is not the only solution for big data.  

Myth Two: Databases are too expensive and incapable of big data
I see in the media and in the white papers I read that relational databases aren’t capable of performing analytics on big data.  It’s true that some RDBM Systems cannot handle big data. It’s also true that some legacy databases charge you a lot and yet don’t seem to be able to scale. However, database companies like Vertica who have adopted columnar, MPP architectures, greater scalability and a simplified pricing model will often fit the bill for many companies.
These systems are not perceived has cost effective solutions. The truth is that you’re paying for a staff of engineers who can debug and build stronger, better products.  Although open source is easy to adopt and easy to test, most companies I see invest more in engineering to support the open source solutions. You can pay licensing costs or you can pay engineers, but either way, there is cost.
One of the biggest benefits of open source is that it has driven down the cost of all analytical platforms, so some of the new platforms like Vertica have much lower costs than your legacy data warehouse technology.

Myth Three: Big Data = NoSQL or SPARK
Again, I see other new technologies being described as the champion of big data. The truth is that the use case for Hadoop, NoSQL and Spark are all slightly different.  These nuances are crucial when deciding how to architect your big data platform.
NoSQL is best when you don’t want to spend the time putting structure on the data.  You can load data into NoSQL databases with less attention to structure and analyze it.  However, it’s the optimizations and the way that data is stored that make it capable of big data analytics at the petabyte scale, so don’t expect this solution to scale. Spark is great for fast analytics in memory and particularly operational analytics, but it’s also hard to scale if you need to keep all of the data in-memory in order to run fast.  It gets expensive to have this hardware.  Most successful architectures that I’ve seen use Spark for fast running queries on data streams, then they hand the data off to other solutions for the deep analysis.  Vertica and other solutions are really best for deep analysis of a lot of data and potentially a lot of concurrent users.  Analytical systems need to support things like mixed workload management so that if you have concurrent users and a whopper of a query comes in, it won’t eat up all the resources and drag down the shorter queries. Analytical systems need optimizations for disk access, since you can’t always load petabytes into memory. This is the domain of Vertica.

Today’s Modern Architecture
In today’s modern architecture, you may have to rely on a multitude of solutions to solve your big data challenges.  If you have only a couple of terabytes, almost any of the solutions mentioned will do the trick.  However, if you eventually want to scale into the tens or hundreds of terabytes (or more), using one solution for a varied analytical workload will start to show signs of strain. It’s then that you need to explore a hybrid solution and use the right tool for the right job.

Disclaimer: The opinions expressed here are my own and don't necessarily reflect the opinion of my employer. The material written here is copyright (c) 2010 by Steve Sarsfield. To request permission to reuse, please e-mail me.