A war is raging that pits Hadoop distribution vendors
against each other in determining exactly how to store structured big data. The
battle is between the ORC file format, spearheaded by Hortonworks, and the
Parquet file format, promoted by Cloudera.
ORC and Parquet are separate Apache projects with the
similar goal of providing very fast analytics. To achieve performance, the formats
have similar characteristics in that they both store data in columns rather
than rows. This enables a majority of analytics to run faster than if the data
was stored in rows or some semi-structured format. They also both support
compression; when you store data in columns, it tends to compress very
efficiently. It’s easier to compress a
column of dates, for example, than it is to compress mixed numbers, dates and
strings. Compression saves you intensive disk access, a common bottleneck for
analytics.
If you’re part of the HPE Vertica
community, the goals of ORC and Parquet may sound familiar. Columnar databases, including Vertica, have
had columnar formats as part of the core product since the beginning. Before
ORC and Parquet were in incubation, Vertica developed the ROS format for
columnar, compressed big data storage.
Over the years, we have tuned and enhanced the format by adding a large
number of compression algorithms designed to make the data storage and
retrieval very efficient. We’ve thought
through features like backup and restore. After all, with a columnar store
database, the concept of incremental backup/restore changes quite a bit. We’ve had time to think through security,
encryption and a long list of challenges when managing data in columnar format.
Orc vs Parquet – War,
what is it good for?
Which format is better? Hortonworks has argued that ORC is
ahead of Parquet in its capabilities to do predicate pushdown. In layman terms, this claim is about
performing analytics closer to where the data sits rather than spurring on
excess network traffic. Cloudera has argued for Parquet in its efficient C++
code base. It also argues that ORC data
containers are primarily described with HIVE, while Parquet’s data containers can
be described using HIVE, Thrift and AVRO.
The important thing to remember is that if you have chosen Hortonworks
as your Hadoop distribution, it may be a little tricky to perform analytics on
Parquet. Accessing ORC files from
Cloudera might also be a challenge.
At HPE, our goal is to seamlessly support ORC, Parquet and
ROS as part of the Vertica analytics platform. Vertica has developed an ORC
reader, in collaboration with Hortonworks, to be super-efficient at performing
analytics on ORC files. Just this week we also announced certification
of Vertica on the CDH 5 platform and we have connectors into Parquet via
our HDFS connector. We’re also working
with Cloudera to continuously optimize our Parquet file access. The goal is to
read, write and federate multiple formats to minimize unnecessary data movement
and transformations. For the information workers who need to run analytics, it
shouldn’t matter where the data sits or in what format.
On the Horizon – Kudu
The aforementioned file formats are tied to analytical use
cases. In other words, if you have
petabytes of data in your data lake and you need to crunch through it in short
order, ORC, Parquet and ROS are valuable.
However, Cloudera recently announce a new data structure and project
called Kudu (link)
that also addresses the needs of an operational analytics use case – one where
you need to small queries on the smaller data sets, particularly as they are
ingested into the data lake. It’s still in incubation, but if the vision is
realized, it will mean better efficiency and easier implementation for
companies who need to do both analytical and operational systems. We’ll explore this and its tie to Kafka and
Spark in my next post.
No comments:
Post a Comment