Monday, April 17, 2017

Avoiding the three common myths of big data



There are many common myths when it comes to big data analytics. Like the lost city of Atlantis and the Bermuda Triangle, they seem to be ubiquitous in the teachings of big data. Let’s explore three of them.

Myth One:  Big Data = Hadoop
You often see the discussion about big data transition right into how to solve any issue on Hadoop.  However, Hadoop is not the only solution for big data analytics.  Many of the bigger vendors in the RDBMS space have been handling very large data sets for years, before the emergence of Hadoop-based big data. The fact is, Hadoop is not a database and has severe limitations on 1) the depth of analytics it can perform; 2) How many concurrent queries it can handle, and; 3) database standards like ACID compliance, SQL compliance and more.
For example, Vertica has been handling huge loads of data.  One customer loads 60 TB per hour into Vertica and has thousands of users (and applications) running analytics on it. That is an extreme example of big data, but it is proof that other solutions can scale to almost any workload. Hadoop is fantastic on the cost front, but is not the only solution for big data.  

Myth Two: Databases are too expensive and incapable of big data
I see in the media and in the white papers I read that relational databases aren’t capable of performing analytics on big data.  It’s true that some RDBM Systems cannot handle big data. It’s also true that some legacy databases charge you a lot and yet don’t seem to be able to scale. However, database companies like Vertica who have adopted columnar, MPP architectures, greater scalability and a simplified pricing model will often fit the bill for many companies.
These systems are not perceived has cost effective solutions. The truth is that you’re paying for a staff of engineers who can debug and build stronger, better products.  Although open source is easy to adopt and easy to test, most companies I see invest more in engineering to support the open source solutions. You can pay licensing costs or you can pay engineers, but either way, there is cost.
One of the biggest benefits of open source is that it has driven down the cost of all analytical platforms, so some of the new platforms like Vertica have much lower costs than your legacy data warehouse technology.

Myth Three: Big Data = NoSQL or SPARK
Again, I see other new technologies being described as the champion of big data. The truth is that the use case for Hadoop, NoSQL and Spark are all slightly different.  These nuances are crucial when deciding how to architect your big data platform.
NoSQL is best when you don’t want to spend the time putting structure on the data.  You can load data into NoSQL databases with less attention to structure and analyze it.  However, it’s the optimizations and the way that data is stored that make it capable of big data analytics at the petabyte scale, so don’t expect this solution to scale. Spark is great for fast analytics in memory and particularly operational analytics, but it’s also hard to scale if you need to keep all of the data in-memory in order to run fast.  It gets expensive to have this hardware.  Most successful architectures that I’ve seen use Spark for fast running queries on data streams, then they hand the data off to other solutions for the deep analysis.  Vertica and other solutions are really best for deep analysis of a lot of data and potentially a lot of concurrent users.  Analytical systems need to support things like mixed workload management so that if you have concurrent users and a whopper of a query comes in, it won’t eat up all the resources and drag down the shorter queries. Analytical systems need optimizations for disk access, since you can’t always load petabytes into memory. This is the domain of Vertica.

Today’s Modern Architecture
In today’s modern architecture, you may have to rely on a multitude of solutions to solve your big data challenges.  If you have only a couple of terabytes, almost any of the solutions mentioned will do the trick.  However, if you eventually want to scale into the tens or hundreds of terabytes (or more), using one solution for a varied analytical workload will start to show signs of strain. It’s then that you need to explore a hybrid solution and use the right tool for the right job.

Thursday, November 3, 2016

MPP Analytical Database vs SQL on Hadoop

Users find lower licensing costs when storing data in Hadoop—although they often do pay for subscriptions. Storing data efficiently in a cluster of nodes is the table stakes for data management today. However, it’s important to remember what happens next. The next step is often about performing analytics on the data as it sits in the Hadoop cluster. When it comes to this, our internal benchmarking testing reveals limitations of the Apache Hadoop platform.

Since I work there, I recently got some metrics from a team of Vertica engineers who set up a 5-node cluster of Hewlett Packard Enterprise DL380 ProLiant servers. They created 3 TBs of data in ORC, Parquet, and our own ROS format. Then, they put the TPC-DS benchmarks to the test with Vertica, Impala, Hive on Tez, and even Apache Spark. They took a look at CDH 5.7.1 and Impala 2.5.0 and HDP 2.4.2 Hawq 2.0 in comparison to Vertica.

Performing Complex Analytics
They first took note of whether all the benchmarks would run. This becomes important when you’re thinking about the analytical workload. Do you plan to perform any complex analytics? In these benchmarks, Vertica completed 100% of the TPC-DS benchmarks while all others could not.



For example, if you want to perform time series analytics and the queries are not available, how much will it cost you to engineer a solution? How many lines of code will you have to write and maintain to accomplish the desired analytics. Hadoop-based solutions do not often have out-of-the-box geo-spatial, pattern matching, machine learning, and data preparation – these types of analytics are not part of the benchmark.

Achieving Top Speed
Let’s assume that you don’t need to run all of the TPC-DS queries or that you can spend the time and resources to modify them. In our testing, I compared the performance metrics on just the queries that would run, Hadoop-based solutions were not comparable in performance either. For example, the numbers for Impala were as follows:



This was actually the closest result. On the 59 queries that Hive on Tez could complete, it took about 21 hours to Vertica’s 90 minutes. Apache Spark took 11 hours to complete 29 queries while Vertica took 25 minutes.

Handling Concurrent Queries
In the metrics I received, I also found that Hadoop-based solutions had limitations on the number of concurrent queries they can run before the query fails. In the tests, the engineers continually and simultaneously ran 1 long, 2 medium, and 5 short-running queries. In most cases, the Hadoop-based solutions choked on the long query and sped through the short queries in a reasonable time. Vertica completed all queries, every time.

An Analytical Database is the Right Way to Perform Big Data Analytics
Although they are an inexpensive way to store data, Hadoop-based solutions are no match for columnar analytical databases like Vertica for big data analytics.

Hadoop-based solutions cannot:
  • Perform at the same level of the ANSI SQL analytics, often failing on the TPC-DS benchmark queries
  • Deliver analytics as fast, sometimes significantly slower than a column store
  • Offer the concurrency of an analytical database for a combination of long, medium and short running queries

Hadoop is a good platform for low-cost storage for big data and data transformation. It delivers some level of analytics for a small number of users or data scientists. But if you need to provide your organization with robust advanced analytics for hundreds or thousands of concurrent users and achieve excellent performance, a big data analytics database is the best solution.

Tuesday, February 23, 2016

Why you may need yet another database - operational vs analytical systems



If your company has long been say, an Oracle shop, yet you’ve got a purchase order in your hand for yet another database, you may be wondering why, just why you need another database?
Let’s face it, when it comes to performing analytics on big data, there are major structural differences in the ways that databases work.  Your project team is asking you for technology that is best suited for the problem at hand.  You need to know that databases tend specialize and offer different characteristics and benefits to an organization.
Let’s start exploring this concept by considering a challenge where multiple analytical environments are needed to solve a problem.  For example, consider a security analytics application where a company wants to both a) look at the live stream of web and application logs and be aware immediately of unusual activity in order to thwart an attack, and; b) perform forensic analysis of say, three months of log data, to determine vulnerabilities and understand completely what has happened in the past.  You need be able to look quickly at both the stream and the data lake for answers.
Unfortunately, no solution on the market offers the ultimate solution for doing both of these tasks, particularly if we need to accomplish the tasks with huge volumes of data. Be suspicious of any vendor who claims to specialize in both because the very underpinnings of the database are usually formulated with one or the other (or something completely different) in mind.  Either you use a ton of memory and cache for quick storage and in-memory analytics, or you optimize the data as it’s stored to enhance the performance of long-running queries.

A Database is Not Just a Database
Two common types of databases used in the above scenario are operational and analytical. In operational systems, the goal is to ingest data quickly with minimal transformations. The analytics that are performed often look more at the stream on data, looking for outliers or interruptions in normal operations.  You may hear these referred to as “small queries” because they tend to look at smaller amounts of data and ask more simple questions.

On the other hand, analytical databases are more likely tied to the questions that the business wants to answer from the data. To more quickly answer questions like “how many widgets did we sell last year by region”, data is modeled to answer in the quickest way possible. These are often where long queries are executed, queries that involve JOINs with lots of data. Highly scalable databases are often the best solution here, since it’s always best to scale up with more hardware, give access to information consumers and democratize the analytics.  Columnar databases like Vertica fit the bill very well for analytics because they do just that – preconfigure the data for fast analytics at petabyte scale. 


Enter the Messaging Bus
If you agree that sometimes we need a nimble analytical database to fly through our small queries and a full-fledged MPP system to do our heavy lifting, then how do we reconcile data between the systems? In the past, practitioners would write custom code to have the systems share data, but the complexity of doing this, given that data models and applications are always changing, is high. An easier approach in the recent past was to create data integration (ETL) jobs.  The ETL would help manage the metadata, data models and any change in the applications. 

Today, the choice is often a messaging bus. Apache Kafka is often used to take on this task because it’s fast and scalable. It uses a publish-subscribe messaging system to share data with any application that subscribes to it. Having one standard for data sharing makes sense for both the users and software developers.  If you want to make a new database part of your ecosystem, sharing data is simplified if it supports Kafka or another messaging bus technology.


Who is doing this today?
As I mentioned earlier, for many companies, the solution is to have both analytical and operational solutions. With today’s big data workloads, companies like Playtika, for example, have implemented Kakfa and Spark to handle operational data and columnar for in-depth analytics.  You can read more about Playtika’s story here.  These architectures may be more complex, but have a huge benefit of being able to handle just about any workload thrown at it.  They can handle the volume and veracity of data while maximizing the value it can bring to the organization.

That’s not all
There are other specialists in the database world.  For example, Graph databases apply graph theory to the storage of information about the relationships between entries. Think about social media where understanding the relationships between people is the goal, or recommendation engines that link the buyers’ affinity to purchase an item based in their history. Relationship queries in your standard database can be slow and unpredictable. Graph databases are designed specifically for this sort of thing. More about that topic can be found in Walt Maguire’s excellentblog posts

Monday, February 1, 2016

The Format War for Hadoop Structured Data



A war is raging that pits Hadoop distribution vendors against each other in determining exactly how to store structured big data. The battle is between the ORC file format, spearheaded by Hortonworks, and the Parquet file format, promoted by Cloudera.
ORC and Parquet are separate Apache projects with the similar goal of providing very fast analytics. To achieve performance, the formats have similar characteristics in that they both store data in columns rather than rows. This enables a majority of analytics to run faster than if the data was stored in rows or some semi-structured format. They also both support compression; when you store data in columns, it tends to compress very efficiently.  It’s easier to compress a column of dates, for example, than it is to compress mixed numbers, dates and strings. Compression saves you intensive disk access, a common bottleneck for analytics.
If you’re part of the HPE Vertica community, the goals of ORC and Parquet may sound familiar.  Columnar databases, including Vertica, have had columnar formats as part of the core product since the beginning. Before ORC and Parquet were in incubation, Vertica developed the ROS format for columnar, compressed big data storage.  Over the years, we have tuned and enhanced the format by adding a large number of compression algorithms designed to make the data storage and retrieval very efficient.  We’ve thought through features like backup and restore. After all, with a columnar store database, the concept of incremental backup/restore changes quite a bit.  We’ve had time to think through security, encryption and a long list of challenges when managing data in columnar format.

Orc vs Parquet – War, what is it good for?
Which format is better? Hortonworks has argued that ORC is ahead of Parquet in its capabilities to do predicate pushdown.  In layman terms, this claim is about performing analytics closer to where the data sits rather than spurring on excess network traffic. Cloudera has argued for Parquet in its efficient C++ code base.  It also argues that ORC data containers are primarily described with HIVE, while Parquet’s data containers can be described using HIVE, Thrift and AVRO.  The important thing to remember is that if you have chosen Hortonworks as your Hadoop distribution, it may be a little tricky to perform analytics on Parquet.  Accessing ORC files from Cloudera might also be a challenge.
At HPE, our goal is to seamlessly support ORC, Parquet and ROS as part of the Vertica analytics platform. Vertica has developed an ORC reader, in collaboration with Hortonworks, to be super-efficient at performing analytics on ORC files.  Just this week we also announced certification of Vertica on the CDH 5 platform and we have connectors into Parquet via our HDFS connector.  We’re also working with Cloudera to continuously optimize our Parquet file access. The goal is to read, write and federate multiple formats to minimize unnecessary data movement and transformations. For the information workers who need to run analytics, it shouldn’t matter where the data sits or in what format.

On the Horizon – Kudu
The aforementioned file formats are tied to analytical use cases.  In other words, if you have petabytes of data in your data lake and you need to crunch through it in short order, ORC, Parquet and ROS are valuable.  However, Cloudera recently announce a new data structure and project called Kudu (link) that also addresses the needs of an operational analytics use case – one where you need to small queries on the smaller data sets, particularly as they are ingested into the data lake. It’s still in incubation, but if the vision is realized, it will mean better efficiency and easier implementation for companies who need to do both analytical and operational systems.  We’ll explore this and its tie to Kafka and Spark in my next post.

Disclaimer: The opinions expressed here are my own and don't necessarily reflect the opinion of my employer. The material written here is copyright (c) 2010 by Steve Sarsfield. To request permission to reuse, please e-mail me.