Data Governance Insider: columnar database

Tuesday, February 23, 2016

Why you may need yet another database - operational vs analytical systems

If your company has long been say, an Oracle shop, yet you’ve got a purchase order in your hand for yet another database, you may be wondering why, just why you need another database?

Let’s face it, when it comes to performing analytics on big data, there are major structural differences in the ways that databases work. Your project team is asking you for technology that is best suited for the problem at hand. You need to know that databases tend specialize and offer different characteristics and benefits to an organization.

Let’s start exploring this concept by considering a challenge where multiple analytical environments are needed to solve a problem. For example, consider a security analytics application where a company wants to both a) look at the live stream of web and application logs and be aware immediately of unusual activity in order to thwart an attack, and; b) perform forensic analysis of say, three months of log data, to determine vulnerabilities and understand completely what has happened in the past. You need be able to look quickly at both the stream and the data lake for answers.

Unfortunately, no solution on the market offers the ultimate solution for doing both of these tasks, particularly if we need to accomplish the tasks with huge volumes of data. Be suspicious of any vendor who claims to specialize in both because the very underpinnings of the database are usually formulated with one or the other (or something completely different) in mind. Either you use a ton of memory and cache for quick storage and in-memory analytics, or you optimize the data as it’s stored to enhance the performance of long-running queries.

A Database is Not Just a Database

Two common types of databases used in the above scenario are operational and analytical. In operational systems, the goal is to ingest data quickly with minimal transformations. The analytics that are performed often look more at the stream on data, looking for outliers or interruptions in normal operations. You may hear these referred to as “small queries” because they tend to look at smaller amounts of data and ask more simple questions.

On the other hand, analytical databases are more likely tied to the questions that the business wants to answer from the data. To more quickly answer questions like “how many widgets did we sell last year by region”, data is modeled to answer in the quickest way possible. These are often where long queries are executed, queries that involve JOINs with lots of data. Highly scalable databases are often the best solution here, since it’s always best to scale up with more hardware, give access to information consumers and democratize the analytics. Columnar databases like Vertica fit the bill very well for analytics because they do just that – preconfigure the data for fast analytics at petabyte scale.

Enter the Messaging Bus

If you agree that sometimes we need a nimble analytical database to fly through our small queries and a full-fledged MPP system to do our heavy lifting, then how do we reconcile data between the systems? In the past, practitioners would write custom code to have the systems share data, but the complexity of doing this, given that data models and applications are always changing, is high. An easier approach in the recent past was to create data integration (ETL) jobs. The ETL would help manage the metadata, data models and any change in the applications.

Today, the choice is often a messaging bus. Apache Kafka is often used to take on this task because it’s fast and scalable. It uses a publish-subscribe messaging system to share data with any application that subscribes to it. Having one standard for data sharing makes sense for both the users and software developers. If you want to make a new database part of your ecosystem, sharing data is simplified if it supports Kafka or another messaging bus technology.

Who is doing this today?

As I mentioned earlier, for many companies, the solution is to have both analytical and operational solutions. With today’s big data workloads, companies like Playtika, for example, have implemented Kakfa and Spark to handle operational data and columnar for in-depth analytics. You can read more about Playtika’s story here. These architectures may be more complex, but have a huge benefit of being able to handle just about any workload thrown at it. They can handle the volume and veracity of data while maximizing the value it can bring to the organization.

That’s not all

There are other specialists in the database world. For example, Graph databases apply graph theory to the storage of information about the relationships between entries. Think about social media where understanding the relationships between people is the goal, or recommendation engines that link the buyers’ affinity to purchase an item based in their history. Relationship queries in your standard database can be slow and unpredictable. Graph databases are designed specifically for this sort of thing. More about that topic can be found in Walt Maguire’s excellentblog posts.

Sunday, November 10, 2013

Big Data is Not Just Hadoop

Hybrid Solutions will Solve our Big Data Problems for Years to Come

When I talk to the people on the front line of big data, I notice that the most common use case of big data is to provide visualization and analytics across the types of data and volumes of data we have in the modern world. For many, it’s an expansion of the power of the data warehouse that deals with the new data bloated world in which we live.

Today, you have bigger volumes, more sources and you are being asked to turn around analytics even faster than before. Overnight runs are still in use, but real-time analytics are becoming more and more expected by our business users.

To deal with the new volumes of data, the yellow elephant craze is in full swing and many companies are looking for ways to use Hadoop to store and process big data. Last week at Strata/Hadoop World, many of the keynote speeches talked about the fact that there are really no limits to Hadoop. I agree. However, in data governance, you must consider not only the technical solutions, but also the processes and people in your organization, and you must fit the solutions to the people and process.

As powerful as Hadoop is, there still is a skill shortage of Map/Reduce coders and Pig scripters. There are still talented analytics professionals who aren't experts in R yet. This shortage will be with us for decades as a new generation of IT workers are trained in Hadoop.

This is in part why so many Hadoop distributions are in the process of putting SQL on Hadoop. This is also why many traditional analytics vendors are adding Hadoop and ways to access the Hadoop cluster from their SQL-based applications. The two worlds are colliding and it's very good for world of analytics.

I’ve blogged about the cost of big data solutions, traditional enterprise solutions and how the differ. In short, you tend to spend money on licenses when you have an old school analytics solution, while your money goes to expertise and training if you adopt a Hadoop-centric approach. But even this line is getting blurry with SQL-based solutions opening up their queries to Hadoop storage. Analytical databases can deliver fast big data analytics with access to Hadoop, as well as compression and columnar storage when the data is stored within. You don’t even need open source to have a term license model today. They are available more and more in other data storage solutions, as are pay-per-use models that charge per terabyte.

If you have a big data problem that needs to be solved, don’t jump right on the Hadoop bandwagon. Consider the impact that big data will have on your solutions and on your teams and take a long look at the new generation of columnar data storage and SQL-centric analytical platforms to get the job done.

Data Governance Insider

Tuesday, February 23, 2016

Why you may need yet another database - operational vs analytical systems

Sunday, November 10, 2013

Big Data is Not Just Hadoop

Hybrid Solutions will Solve our Big Data Problems for Years to Come

Blog Archive

Share and Follow

About Me

wikipedia

Kindle Edition

Book Recommendations

Other Blogs I Like

RSS Feed