There are many common myths when it comes to big data
analytics. Like the lost city of Atlantis and the Bermuda Triangle, they seem
to be ubiquitous in the teachings of big data. Let’s explore three of them.
Myth One: Big Data = Hadoop
You often see the discussion about big data transition right
into how to solve any issue on Hadoop.
However, Hadoop is not the only solution for big data analytics. Many of the bigger vendors in the RDBMS space
have been handling very large data sets for years, before the emergence of
Hadoop-based big data. The fact is, Hadoop is not a database and has severe
limitations on 1) the depth of analytics it can perform; 2) How many concurrent
queries it can handle, and; 3) database standards like ACID compliance, SQL
compliance and more.
For example, Vertica has been handling huge loads of data. One customer loads 60 TB per hour into
Vertica and has thousands of users (and applications) running analytics on it.
That is an extreme example of big data, but it is proof that other solutions
can scale to almost any workload. Hadoop is fantastic on the cost front, but is
not the only solution for big data.
Myth Two: Databases
are too expensive and incapable of big data
I see in the media and in the white papers I read that
relational databases aren’t capable of performing analytics on big data. It’s true that some RDBM Systems
cannot handle big data. It’s also true that some legacy databases charge you a
lot and yet don’t seem to be able to scale. However, database companies like
Vertica who have adopted columnar, MPP architectures, greater scalability and a
simplified pricing model will often fit the bill for many companies.
These systems are not perceived has cost effective
solutions. The truth is that you’re paying for a staff of engineers who can
debug and build stronger, better products.
Although open source is easy to adopt and easy to test, most companies I
see invest more in engineering to support the open source solutions. You can
pay licensing costs or you can pay engineers, but either way, there is cost.
One of the biggest benefits of open source is that it has
driven down the cost of all analytical platforms, so some of the new platforms
like Vertica have much lower costs than your legacy data warehouse technology.
Myth Three: Big Data
= NoSQL or SPARK
Again, I see other new technologies being described as the champion
of big data. The truth is that the use case for Hadoop, NoSQL and Spark are all
slightly different. These nuances are
crucial when deciding how to architect your big data platform.
NoSQL is best when you don’t want to spend the time putting
structure on the data. You can load data
into NoSQL databases with less attention to structure and analyze it. However, it’s the optimizations and the way
that data is stored that make it capable of big data analytics at the petabyte
scale, so don’t expect this solution to scale. Spark is great for fast
analytics in memory and particularly operational analytics, but it’s also hard
to scale if you need to keep all of the data in-memory in order to run
fast. It gets expensive to have this
hardware. Most successful architectures
that I’ve seen use Spark for fast running queries on data streams, then they
hand the data off to other solutions for the deep analysis. Vertica and other solutions are really best
for deep analysis of a lot of data and potentially a lot of concurrent
users. Analytical systems need to
support things like mixed workload management so that if you have concurrent
users and a whopper of a query comes in, it won’t eat up all the resources and
drag down the shorter queries. Analytical systems need optimizations for disk
access, since you can’t always load petabytes into memory. This is the domain
of Vertica.
Today’s Modern
Architecture
In today’s modern architecture, you may have to rely on a
multitude of solutions to solve your big data challenges. If you have only a couple of terabytes,
almost any of the solutions mentioned will do the trick. However, if you eventually want to scale into
the tens or hundreds of terabytes (or more), using one solution for a varied
analytical workload will start to show signs of strain. It’s then that you need
to explore a hybrid solution and use the right tool for the right job.