Tuesday, February 23, 2016

Why you may need yet another database - operational vs analytical systems

If your company has long been say, an Oracle shop, yet you’ve got a purchase order in your hand for yet another database, you may be wondering why, just why you need another database?
Let’s face it, when it comes to performing analytics on big data, there are major structural differences in the ways that databases work.  Your project team is asking you for technology that is best suited for the problem at hand.  You need to know that databases tend specialize and offer different characteristics and benefits to an organization.
Let’s start exploring this concept by considering a challenge where multiple analytical environments are needed to solve a problem.  For example, consider a security analytics application where a company wants to both a) look at the live stream of web and application logs and be aware immediately of unusual activity in order to thwart an attack, and; b) perform forensic analysis of say, three months of log data, to determine vulnerabilities and understand completely what has happened in the past.  You need be able to look quickly at both the stream and the data lake for answers.
Unfortunately, no solution on the market offers the ultimate solution for doing both of these tasks, particularly if we need to accomplish the tasks with huge volumes of data. Be suspicious of any vendor who claims to specialize in both because the very underpinnings of the database are usually formulated with one or the other (or something completely different) in mind.  Either you use a ton of memory and cache for quick storage and in-memory analytics, or you optimize the data as it’s stored to enhance the performance of long-running queries.

A Database is Not Just a Database
Two common types of databases used in the above scenario are operational and analytical. In operational systems, the goal is to ingest data quickly with minimal transformations. The analytics that are performed often look more at the stream on data, looking for outliers or interruptions in normal operations.  You may hear these referred to as “small queries” because they tend to look at smaller amounts of data and ask more simple questions.

On the other hand, analytical databases are more likely tied to the questions that the business wants to answer from the data. To more quickly answer questions like “how many widgets did we sell last year by region”, data is modeled to answer in the quickest way possible. These are often where long queries are executed, queries that involve JOINs with lots of data. Highly scalable databases are often the best solution here, since it’s always best to scale up with more hardware, give access to information consumers and democratize the analytics.  Columnar databases like Vertica fit the bill very well for analytics because they do just that – preconfigure the data for fast analytics at petabyte scale. 

Enter the Messaging Bus
If you agree that sometimes we need a nimble analytical database to fly through our small queries and a full-fledged MPP system to do our heavy lifting, then how do we reconcile data between the systems? In the past, practitioners would write custom code to have the systems share data, but the complexity of doing this, given that data models and applications are always changing, is high. An easier approach in the recent past was to create data integration (ETL) jobs.  The ETL would help manage the metadata, data models and any change in the applications. 

Today, the choice is often a messaging bus. Apache Kafka is often used to take on this task because it’s fast and scalable. It uses a publish-subscribe messaging system to share data with any application that subscribes to it. Having one standard for data sharing makes sense for both the users and software developers.  If you want to make a new database part of your ecosystem, sharing data is simplified if it supports Kafka or another messaging bus technology.

Who is doing this today?
As I mentioned earlier, for many companies, the solution is to have both analytical and operational solutions. With today’s big data workloads, companies like Playtika, for example, have implemented Kakfa and Spark to handle operational data and columnar for in-depth analytics.  You can read more about Playtika’s story here.  These architectures may be more complex, but have a huge benefit of being able to handle just about any workload thrown at it.  They can handle the volume and veracity of data while maximizing the value it can bring to the organization.

That’s not all
There are other specialists in the database world.  For example, Graph databases apply graph theory to the storage of information about the relationships between entries. Think about social media where understanding the relationships between people is the goal, or recommendation engines that link the buyers’ affinity to purchase an item based in their history. Relationship queries in your standard database can be slow and unpredictable. Graph databases are designed specifically for this sort of thing. More about that topic can be found in Walt Maguire’s excellentblog posts

Monday, February 1, 2016

The Format War for Hadoop Structured Data

A war is raging that pits Hadoop distribution vendors against each other in determining exactly how to store structured big data. The battle is between the ORC file format, spearheaded by Hortonworks, and the Parquet file format, promoted by Cloudera.
ORC and Parquet are separate Apache projects with the similar goal of providing very fast analytics. To achieve performance, the formats have similar characteristics in that they both store data in columns rather than rows. This enables a majority of analytics to run faster than if the data was stored in rows or some semi-structured format. They also both support compression; when you store data in columns, it tends to compress very efficiently.  It’s easier to compress a column of dates, for example, than it is to compress mixed numbers, dates and strings. Compression saves you intensive disk access, a common bottleneck for analytics.
If you’re part of the HPE Vertica community, the goals of ORC and Parquet may sound familiar.  Columnar databases, including Vertica, have had columnar formats as part of the core product since the beginning. Before ORC and Parquet were in incubation, Vertica developed the ROS format for columnar, compressed big data storage.  Over the years, we have tuned and enhanced the format by adding a large number of compression algorithms designed to make the data storage and retrieval very efficient.  We’ve thought through features like backup and restore. After all, with a columnar store database, the concept of incremental backup/restore changes quite a bit.  We’ve had time to think through security, encryption and a long list of challenges when managing data in columnar format.

Orc vs Parquet – War, what is it good for?
Which format is better? Hortonworks has argued that ORC is ahead of Parquet in its capabilities to do predicate pushdown.  In layman terms, this claim is about performing analytics closer to where the data sits rather than spurring on excess network traffic. Cloudera has argued for Parquet in its efficient C++ code base.  It also argues that ORC data containers are primarily described with HIVE, while Parquet’s data containers can be described using HIVE, Thrift and AVRO.  The important thing to remember is that if you have chosen Hortonworks as your Hadoop distribution, it may be a little tricky to perform analytics on Parquet.  Accessing ORC files from Cloudera might also be a challenge.
At HPE, our goal is to seamlessly support ORC, Parquet and ROS as part of the Vertica analytics platform. Vertica has developed an ORC reader, in collaboration with Hortonworks, to be super-efficient at performing analytics on ORC files.  Just this week we also announced certification of Vertica on the CDH 5 platform and we have connectors into Parquet via our HDFS connector.  We’re also working with Cloudera to continuously optimize our Parquet file access. The goal is to read, write and federate multiple formats to minimize unnecessary data movement and transformations. For the information workers who need to run analytics, it shouldn’t matter where the data sits or in what format.

On the Horizon – Kudu
The aforementioned file formats are tied to analytical use cases.  In other words, if you have petabytes of data in your data lake and you need to crunch through it in short order, ORC, Parquet and ROS are valuable.  However, Cloudera recently announce a new data structure and project called Kudu (link) that also addresses the needs of an operational analytics use case – one where you need to small queries on the smaller data sets, particularly as they are ingested into the data lake. It’s still in incubation, but if the vision is realized, it will mean better efficiency and easier implementation for companies who need to do both analytical and operational systems.  We’ll explore this and its tie to Kafka and Spark in my next post.

Thursday, January 14, 2016

What’s in store for Big Data Analytics in 2016

It’s the time of year again for predictions on all sorts of topics. Worthy, solid predictions are often based on the past and present trends and then projecting those trends into the coming year. Since I spend a lot of time studying trends of big data and analytics, I’m going to offer my predictions for the upcoming year.

Big Data will Triumph over Global Troubles
While there were awesome use cases, big data in 2015 was still somewhat a science experiment. This year there is hope for major breakthroughs in solving some of the world’s most challenging problems with big data.  Organizations are already doing amazing things, but we’re just scratching the surface of what we can accomplish with big data.  I’ve had several conversations with clients who are looking to map the human genome and tackle problems like cancer, Alzheimer’s disease and more by mapping the genes linked to them. I believe there are eminent breakthroughs here credited to our ability to handle huge data volume and perform faster and faster analytics improves.

But that’s not all.  People are using big data science for transportation research, making planes, trains and automobiles smarter and more efficient.  Non-profits are using big data to drive decisions about conservation and ecology with big data. We have a real opportunity this year to make the world a better place with big data.  Data is the new currency in scientific breakthroughs. The capability we now have to crunch through it with our algorithms is the disruptor.

Algorithms will be the New Edge
2016 is sure to be a year for using algorithms, specifically predictive analytics, to boost company revenue. Analysts like Gartner predict that differentiated algorithms alone will help corporations achieve a boost of 5% to 10% in revenue in the near future. Algorithms will make the best use of huge volume of customer-generated data that we get from our phones, devices and the internet of things to formulate more helpful, targeted offers for prospects and customers.  New, younger companies will leverage predictive analytics to disrupt their markets and potentially unseat the established leaders.  Predictive analytics can serve to update power delivery and consumption, medical research and treatment, and other lofty human problems, in addition to generating new revenue.

It’s difficult to see whether the algorithms themselves will be an emerging market, as some analysts say, or whether we will share most of our algorithms in our communities of data scientists. I think society will benefit more from an open source approach here, and the young minds who develop the algorithms will probably be more willing to take an open approach. Think about it, if you could predict Alzheimer’s disease with your algorithm, wouldn’t you want to share it with the world?

Hybrid Architectures will Rule in 2016
Companies are adopting a strategy where they use the right tool for the right job when it comes to big data analytics. This means that daily analytics and proprietary data is analyzed on-premise in ever growing data warehouse data volumes. Small, short-lived projects are often deployed on the cloud, and Hadoop is often used to keep costs low on data that is important, or data that needs to be farmed for mission-critical information. Finally, technologies like Spark are in their infancy to help with real-time, operational analytics.

It will be up to the vendors and open source community to provide some consistency across these different deployment strategies. Information workers really won’t care where it is running, just that they can use their favorite visualization tools, SQL, R and Python. Sometimes these workloads run in their own environment, but vendors can help reduce the work involved if, for example, you want to move your cloud project to on-premise. By offering a consistent SQL, for example, across these deployment architectures, you can avoid the headaches of a hybrid environment.

Open Source will Attain New Maturity
I’ve written many times about the hype around Hadoop and the maturity of the Hadoop platform by comparison to commercially available software. Let’s face it, many open source solutions for big data analytics were somewhat immature in 2015. As I mentioned in my last post, it’s a matter of taking software that is extremely useful and spending a few years to overcome shortcomings and build out a complete platform for big data analytics.  This year, the Hadoop community will build it out to be a more complete platform.  My prediction is that we’ll see greater maturity in 2016. With greater maturity will come wider adoption.

That said, I have observed that the open source community tends to focus on the start and not the finish. For example, over the past few years, SQL users have heard about many flavors of SQL on Hadoop.  Spark seems to be the latest and coolest new project offering SQL analytics on big data and it show great promise. However, the shift seems to be toward new projects and away from making the legacy projects work better.

Hewlett Packard Enterprise Role
I was inspired to write these predictions by a webinar that I attended in which some of the executives of Hewlett Packard Enterprise and influencers gave their vision of 2016.  For more information, watch the replay video here. Hewlett Packard Enterprise (HPE) has a role to play in making these predictions come true. HPE’s vision starts with the understanding that data fuels the new style of business driving the idea economy. Data will distinguish disruptors from the disrupted. Big data promises new customers, better experiences and new revenue streams. But all opportunities come with challenges. The recipe for success is continuously iterating on what questions to ask, which data to analyze and how to use the insights at all levels of your organization.

Sunday, November 10, 2013

Big Data is Not Just Hadoop

Hybrid Solutions will Solve our Big Data Problems for Years to Come 

When I talk to the people on the front line of big data, I notice that the most common use case of big data is to provide visualization and analytics across the types of data and volumes of data we have in the modern world.  For many, it’s an expansion of the power of the data warehouse that deals with the new data bloated world in which we live.

 Today, you have bigger volumes, more sources and you are being asked to turn around analytics even faster than before.  Overnight runs are still in use, but real-time analytics are becoming more and more expected by our business users. 

To deal with the new volumes of data, the yellow elephant craze is in full swing and many companies are looking for ways to use Hadoop to store and process big data. Last week at Strata/Hadoop World, many of the keynote speeches talked about the fact that there are really no limits to Hadoop.  I agree. However, in data governance, you must consider not only the technical solutions, but also the processes and people in your organization, and you must fit the solutions to the people and process.

As powerful as Hadoop is, there still is a skill shortage of Map/Reduce coders and Pig scripters.  There are still talented analytics professionals who aren't experts in R yet. This shortage will be with us for decades as a new generation of IT workers are trained in Hadoop.

This is in part why so many Hadoop distributions are in the process of putting SQL on Hadoop.  This is also why many traditional analytics vendors are adding Hadoop and ways to access the Hadoop cluster from their SQL-based applications.  The two worlds are colliding and it's very good for world of analytics.

I’ve blogged about the cost of big data solutions, traditional enterprise solutions and how the differ.  In short, you tend to spend money on licenses when you have an old school analytics solution, while your money goes to expertise and training if you adopt a Hadoop-centric approach.  But even this line is getting blurry with SQL-based solutions opening up their queries to Hadoop storage. Analytical databases can deliver fast big data analytics with access to Hadoop, as well as compression and columnar storage when the data is stored within.  You don’t even need open source to have a term license model today.  They are available more and more in other data storage solutions, as are pay-per-use models that charge per terabyte.

If you have a big data problem that needs to be solved, don’t jump right on the Hadoop bandwagon.  Consider the impact that big data will have on your solutions and on your teams and take a long look at the new generation of columnar data storage and SQL-centric analytical platforms to get the job done.

Sunday, January 20, 2013

Top Four Reasons Why Financial Services Companies Need Solid Data Governance

Image licensed from iStockPhoto
In working with clients in the financial services business, I’ve noticed that there is a common set of reasons why they adopt data governance.  When it comes down to proving value of data management, it’s all about revenue, efficiency and compliance.

Number One - Accurate Risk Assessment

Based on new regulations like Sarbanes and Dodd-Frank, a financial services company's risk and assurance teams are often asked to determine the amount regulatory capital reserves when building credit risk models. A crucial part of this function is understanding how the underlying data has the on the accuracy of the calculations. Teams must be able to attest to the quality of the data by having in place the appropriate monitoring, controls, and alerts.  They must provide regulators with information they can believe in.

Data champions in this field must be able to draw the link between the regulations and data. They must assess the alignment of data and processes that support your models, quantify the impact of poor data quality on your regulatory capital calculations, and put into place monitoring and governance to manage this data over time.

Number Two – Process Efficiency

If your team is spending a lot of time checking and rechecking your reports, it can be quite inefficient. When a report generated conflicts with another report, it may bring some doubt to the validity of all reports. There is likely a data quality issue is behind it. The problem manifests itself as a huge time-suck on monthly and quarterly closes.  Data champions must point to this inefficiency in order to put in place a solid data management strategy.

Number Three - Anti-money Laundering

Financial Services companies need to be vigilant about money laundering. To do this, some look for currency transactions designed to evade current reporting requirements. If a client is making five deposits of $3,000 each in a single day, for example, it may be an attempt to keep under the radar on reporting. Data quality must help identify these transactions, even if the client is making deposits from different branches, using different deposit mechanisms (ATM or Customer Service Rep.) and even when they are using slight variation on their name.

Other systems monitor wire transfers to look for countries or individuals that appear on a list compiled by Treasury’s Office of Foreign Assets Control (OFAC). Being able to successfully match your clients against the OFAC list using fuzzy matching is crucial to success.

Number Four – Revenue
Despite all of the regulations and reporting that banks must attend to, there is still obligation to stockholders to make money while providing excellent service to the customers.  Revenue hinges upon a consistent, current and relevant view of clients across all of the bank’s products.  Poor data management creates significant hidden cost and can hinder your ability to recognize and understand opportunity – where you can up-sell and cross-sell your customers.  Data champions and data scientists must work with the marketing teams to identify and tackle the issues here.  Knowing when and how to ask the customer for new business can lead to significant growth.

These are just some examples that are very common to financial services.  In my experience, most financial services companies have all of these issues to some degree, but tackle them with an agile approach, taking a small portion of one of these problems and solving it little by little. Along the way, they follow the value brought and the value potential if more investment is made.

Sunday, January 6, 2013

Big Data After the Hype

Total Data Management

This year, I’ve been following the meteoric rise of big data. It has been a boon for vendors who are venturing into this area.  It has produced countless start-ups and much buzz in the data management world.

However, when it comes down to it, what we’re really talking about here is data management and data governance.  Whether you have to deal with big data, enterprise data or spread-marts, data needs to be managed no matter what size. The tides are turning for a total data management approach. Recent surveys shows that despite the market hype, most technologists and business users feel that big data is an off-shoot of data management, not a branch of technology in itself. 

So, why the hype?  I'm convinced it is mostly vendor-generated. In 2010, when big data began to gain notoriety, there was a disconnect for some vendors.  While partnered with traditional enterprise data management companies like the Oracle and IBM’s of this world, not all vendors were prepared for the growing popularity of open source and Hadoop. Others were (and still are) better positioned. They began talking about big data as a product differentiator. Vendors who don’t have the basic architecture for managing data in Hadoop have been and will continue to struggle. 

For example, ETL tools that have a basic connection to move data in and out of Hbase, Hortonworks and Cloudera can’t stop there.  The power of Hadoop must be harnessed, and it’s not always an easy thing to do when your technology requires executables tied to CPUs.  One of the powerful things about Hadoop is that it scales based on a languages like PIG, Sqoop and Java without having to install anything.  Want to expand the number of servers?  Add a datanode server, tell the name namenode and rebalance - and your off and running.  However, even this simple innovation is more difficult on some vendors’ architectures than others.

Another rethinking that is taking place in the market is long-standing CPU-based pricing structure.  Vendors who they keep their pricing structure based on core processors for Hadoop will continue to struggle because it runs counter to the power of Hadoop. You hear about the volume, velocity and variety.  Technically, if you want to step up the volume with another datanode, it’s no big deal. However, it becomes a big deal if you have to renegotiate a vendor contract each time.

Last year, around this time, I did write about the various costs associated with the scale of data. In summary, the costs of licenses and connectors are the bigger for enterprise data, while the costs associated with skills are more likely to affect you with big data.  There will come a time where the skills gap will be closed, however.

In the year 2013, we’ll begin to see the un-hyping of big data in favor of this total data management approach. For buyers, big data will be a tick-box in their RFP’s in the effort to manage data, no matter what the size.

There was an error in this gadget
Disclaimer: The opinions expressed here are my own and don't necessarily reflect the opinion of my employer. The material written here is copyright (c) 2010 by Steve Sarsfield. To request permission to reuse, please e-mail me.