If your company has
long been say, an Oracle shop, yet you’ve got a purchase order in your hand for
yet another database, you may be wondering why, just why you need another
database?
Let’s face it, when
it comes to performing analytics on big data, there are major structural
differences in the ways that databases work.
Your project team is asking you for technology that is best suited for
the problem at hand. You need to know
that databases tend specialize and offer different characteristics and benefits
to an organization.
Let’s start exploring this concept by considering a
challenge where multiple analytical environments are needed to solve a problem. For example, consider a security analytics
application where a company wants to both a) look at the live stream of web and
application logs and be aware immediately of unusual activity in order to
thwart an attack, and; b) perform forensic analysis of say, three months of log
data, to determine vulnerabilities and understand completely what has happened
in the past. You need be able to look
quickly at both the stream and the data lake for answers.
Unfortunately, no solution on the market offers the ultimate
solution for doing both of these tasks, particularly if we need to accomplish
the tasks with huge volumes of data. Be suspicious of any vendor who claims to
specialize in both because the very underpinnings of the database are usually
formulated with one or the other (or something completely different) in
mind. Either you use a ton of memory and
cache for quick storage and in-memory analytics, or you optimize the data as
it’s stored to enhance the performance of long-running queries.
A Database is Not Just
a Database
Two common types of databases used in the above
scenario are operational and analytical. In operational systems, the goal is to
ingest data quickly with minimal transformations. The analytics that are
performed often look more at the stream on data, looking for outliers or
interruptions in normal operations. You may hear these referred to as
“small queries” because they tend to look at smaller amounts of data and ask
more simple questions.
On the other hand, analytical databases are more likely tied to the
questions that the business wants to answer from the data. To more quickly
answer questions like “how many widgets did we sell last year by region”, data
is modeled to answer in the quickest way possible. These are often where long
queries are executed, queries that involve JOINs with lots of data. Highly
scalable databases are often the best solution here, since it’s always best to
scale up with more hardware, give access to information consumers and
democratize the analytics. Columnar
databases like Vertica fit the bill very well for analytics because they do
just that – preconfigure the data for fast analytics at petabyte scale.
Enter the Messaging Bus
If you agree that sometimes we need a nimble analytical database to
fly through our small queries and a full-fledged MPP system to do our heavy
lifting, then how do we reconcile data between the systems? In the past,
practitioners would write custom code to have the systems share data, but the
complexity of doing this, given that data models and applications are always
changing, is high. An easier approach in the recent past was to create data
integration (ETL) jobs. The ETL would
help manage the metadata, data models and any change in the applications.
Today, the choice is often a messaging bus. Apache Kafka is often
used to take on this task because it’s fast and scalable. It uses a publish-subscribe
messaging system to share data with any application that subscribes to it. Having
one standard for data sharing makes sense for both the users and software
developers. If you want to make a new
database part of your ecosystem, sharing data is simplified if it supports
Kafka or another messaging bus technology.
Who is doing this today?
As I mentioned earlier, for many companies, the solution is
to have both analytical and operational solutions. With today’s big data
workloads, companies like Playtika, for example, have implemented Kakfa and
Spark to handle operational data and columnar for in-depth analytics. You
can read more about Playtika’s story here. These
architectures may be more complex, but have a huge benefit of being able to
handle just about any workload thrown at it. They can handle the volume
and veracity of data while maximizing the value it can bring to the
organization.
That’s not all
There are other specialists in the database world. For example, Graph databases apply graph
theory to the storage of information about the relationships between entries. Think
about social media where understanding the relationships between people is the
goal, or recommendation engines that link the buyers’ affinity to purchase an
item based in their history. Relationship queries in your standard database can
be slow and unpredictable. Graph databases are designed specifically for this
sort of thing. More about that topic can be found in Walt Maguire’s excellentblog posts.