Showing posts with label big data. Show all posts
Showing posts with label big data. Show all posts

Monday, April 17, 2017

Avoiding the three common myths of big data



There are many common myths when it comes to big data analytics. Like the lost city of Atlantis and the Bermuda Triangle, they seem to be ubiquitous in the teachings of big data. Let’s explore three of them.

Myth One:  Big Data = Hadoop
You often see the discussion about big data transition right into how to solve any issue on Hadoop.  However, Hadoop is not the only solution for big data analytics.  Many of the bigger vendors in the RDBMS space have been handling very large data sets for years, before the emergence of Hadoop-based big data. The fact is, Hadoop is not a database and has severe limitations on 1) the depth of analytics it can perform; 2) How many concurrent queries it can handle, and; 3) database standards like ACID compliance, SQL compliance and more.
For example, Vertica has been handling huge loads of data.  One customer loads 60 TB per hour into Vertica and has thousands of users (and applications) running analytics on it. That is an extreme example of big data, but it is proof that other solutions can scale to almost any workload. Hadoop is fantastic on the cost front, but is not the only solution for big data.  

Myth Two: Databases are too expensive and incapable of big data
I see in the media and in the white papers I read that relational databases aren’t capable of performing analytics on big data.  It’s true that some RDBM Systems cannot handle big data. It’s also true that some legacy databases charge you a lot and yet don’t seem to be able to scale. However, database companies like Vertica who have adopted columnar, MPP architectures, greater scalability and a simplified pricing model will often fit the bill for many companies.
These systems are not perceived has cost effective solutions. The truth is that you’re paying for a staff of engineers who can debug and build stronger, better products.  Although open source is easy to adopt and easy to test, most companies I see invest more in engineering to support the open source solutions. You can pay licensing costs or you can pay engineers, but either way, there is cost.
One of the biggest benefits of open source is that it has driven down the cost of all analytical platforms, so some of the new platforms like Vertica have much lower costs than your legacy data warehouse technology.

Myth Three: Big Data = NoSQL or SPARK
Again, I see other new technologies being described as the champion of big data. The truth is that the use case for Hadoop, NoSQL and Spark are all slightly different.  These nuances are crucial when deciding how to architect your big data platform.
NoSQL is best when you don’t want to spend the time putting structure on the data.  You can load data into NoSQL databases with less attention to structure and analyze it.  However, it’s the optimizations and the way that data is stored that make it capable of big data analytics at the petabyte scale, so don’t expect this solution to scale. Spark is great for fast analytics in memory and particularly operational analytics, but it’s also hard to scale if you need to keep all of the data in-memory in order to run fast.  It gets expensive to have this hardware.  Most successful architectures that I’ve seen use Spark for fast running queries on data streams, then they hand the data off to other solutions for the deep analysis.  Vertica and other solutions are really best for deep analysis of a lot of data and potentially a lot of concurrent users.  Analytical systems need to support things like mixed workload management so that if you have concurrent users and a whopper of a query comes in, it won’t eat up all the resources and drag down the shorter queries. Analytical systems need optimizations for disk access, since you can’t always load petabytes into memory. This is the domain of Vertica.

Today’s Modern Architecture
In today’s modern architecture, you may have to rely on a multitude of solutions to solve your big data challenges.  If you have only a couple of terabytes, almost any of the solutions mentioned will do the trick.  However, if you eventually want to scale into the tens or hundreds of terabytes (or more), using one solution for a varied analytical workload will start to show signs of strain. It’s then that you need to explore a hybrid solution and use the right tool for the right job.

Tuesday, February 23, 2016

Why you may need yet another database - operational vs analytical systems



If your company has long been say, an Oracle shop, yet you’ve got a purchase order in your hand for yet another database, you may be wondering why, just why you need another database?
Let’s face it, when it comes to performing analytics on big data, there are major structural differences in the ways that databases work.  Your project team is asking you for technology that is best suited for the problem at hand.  You need to know that databases tend specialize and offer different characteristics and benefits to an organization.
Let’s start exploring this concept by considering a challenge where multiple analytical environments are needed to solve a problem.  For example, consider a security analytics application where a company wants to both a) look at the live stream of web and application logs and be aware immediately of unusual activity in order to thwart an attack, and; b) perform forensic analysis of say, three months of log data, to determine vulnerabilities and understand completely what has happened in the past.  You need be able to look quickly at both the stream and the data lake for answers.
Unfortunately, no solution on the market offers the ultimate solution for doing both of these tasks, particularly if we need to accomplish the tasks with huge volumes of data. Be suspicious of any vendor who claims to specialize in both because the very underpinnings of the database are usually formulated with one or the other (or something completely different) in mind.  Either you use a ton of memory and cache for quick storage and in-memory analytics, or you optimize the data as it’s stored to enhance the performance of long-running queries.

A Database is Not Just a Database
Two common types of databases used in the above scenario are operational and analytical. In operational systems, the goal is to ingest data quickly with minimal transformations. The analytics that are performed often look more at the stream on data, looking for outliers or interruptions in normal operations.  You may hear these referred to as “small queries” because they tend to look at smaller amounts of data and ask more simple questions.

On the other hand, analytical databases are more likely tied to the questions that the business wants to answer from the data. To more quickly answer questions like “how many widgets did we sell last year by region”, data is modeled to answer in the quickest way possible. These are often where long queries are executed, queries that involve JOINs with lots of data. Highly scalable databases are often the best solution here, since it’s always best to scale up with more hardware, give access to information consumers and democratize the analytics.  Columnar databases like Vertica fit the bill very well for analytics because they do just that – preconfigure the data for fast analytics at petabyte scale. 


Enter the Messaging Bus
If you agree that sometimes we need a nimble analytical database to fly through our small queries and a full-fledged MPP system to do our heavy lifting, then how do we reconcile data between the systems? In the past, practitioners would write custom code to have the systems share data, but the complexity of doing this, given that data models and applications are always changing, is high. An easier approach in the recent past was to create data integration (ETL) jobs.  The ETL would help manage the metadata, data models and any change in the applications. 

Today, the choice is often a messaging bus. Apache Kafka is often used to take on this task because it’s fast and scalable. It uses a publish-subscribe messaging system to share data with any application that subscribes to it. Having one standard for data sharing makes sense for both the users and software developers.  If you want to make a new database part of your ecosystem, sharing data is simplified if it supports Kafka or another messaging bus technology.


Who is doing this today?
As I mentioned earlier, for many companies, the solution is to have both analytical and operational solutions. With today’s big data workloads, companies like Playtika, for example, have implemented Kakfa and Spark to handle operational data and columnar for in-depth analytics.  You can read more about Playtika’s story here.  These architectures may be more complex, but have a huge benefit of being able to handle just about any workload thrown at it.  They can handle the volume and veracity of data while maximizing the value it can bring to the organization.

That’s not all
There are other specialists in the database world.  For example, Graph databases apply graph theory to the storage of information about the relationships between entries. Think about social media where understanding the relationships between people is the goal, or recommendation engines that link the buyers’ affinity to purchase an item based in their history. Relationship queries in your standard database can be slow and unpredictable. Graph databases are designed specifically for this sort of thing. More about that topic can be found in Walt Maguire’s excellentblog posts

Thursday, January 14, 2016

What’s in store for Big Data Analytics in 2016

It’s the time of year again for predictions on all sorts of topics. Worthy, solid predictions are often based on the past and present trends and then projecting those trends into the coming year. Since I spend a lot of time studying trends of big data and analytics, I’m going to offer my predictions for the upcoming year.

Big Data will Triumph over Global Troubles
While there were awesome use cases, big data in 2015 was still somewhat a science experiment. This year there is hope for major breakthroughs in solving some of the world’s most challenging problems with big data.  Organizations are already doing amazing things, but we’re just scratching the surface of what we can accomplish with big data.  I’ve had several conversations with clients who are looking to map the human genome and tackle problems like cancer, Alzheimer’s disease and more by mapping the genes linked to them. I believe there are eminent breakthroughs here credited to our ability to handle huge data volume and perform faster and faster analytics improves.

But that’s not all.  People are using big data science for transportation research, making planes, trains and automobiles smarter and more efficient.  Non-profits are using big data to drive decisions about conservation and ecology with big data. We have a real opportunity this year to make the world a better place with big data.  Data is the new currency in scientific breakthroughs. The capability we now have to crunch through it with our algorithms is the disruptor.

Algorithms will be the New Edge
2016 is sure to be a year for using algorithms, specifically predictive analytics, to boost company revenue. Analysts like Gartner predict that differentiated algorithms alone will help corporations achieve a boost of 5% to 10% in revenue in the near future. Algorithms will make the best use of huge volume of customer-generated data that we get from our phones, devices and the internet of things to formulate more helpful, targeted offers for prospects and customers.  New, younger companies will leverage predictive analytics to disrupt their markets and potentially unseat the established leaders.  Predictive analytics can serve to update power delivery and consumption, medical research and treatment, and other lofty human problems, in addition to generating new revenue.

It’s difficult to see whether the algorithms themselves will be an emerging market, as some analysts say, or whether we will share most of our algorithms in our communities of data scientists. I think society will benefit more from an open source approach here, and the young minds who develop the algorithms will probably be more willing to take an open approach. Think about it, if you could predict Alzheimer’s disease with your algorithm, wouldn’t you want to share it with the world?

Hybrid Architectures will Rule in 2016
Companies are adopting a strategy where they use the right tool for the right job when it comes to big data analytics. This means that daily analytics and proprietary data is analyzed on-premise in ever growing data warehouse data volumes. Small, short-lived projects are often deployed on the cloud, and Hadoop is often used to keep costs low on data that is important, or data that needs to be farmed for mission-critical information. Finally, technologies like Spark are in their infancy to help with real-time, operational analytics.

It will be up to the vendors and open source community to provide some consistency across these different deployment strategies. Information workers really won’t care where it is running, just that they can use their favorite visualization tools, SQL, R and Python. Sometimes these workloads run in their own environment, but vendors can help reduce the work involved if, for example, you want to move your cloud project to on-premise. By offering a consistent SQL, for example, across these deployment architectures, you can avoid the headaches of a hybrid environment.

Open Source will Attain New Maturity
I’ve written many times about the hype around Hadoop and the maturity of the Hadoop platform by comparison to commercially available software. Let’s face it, many open source solutions for big data analytics were somewhat immature in 2015. As I mentioned in my last post, it’s a matter of taking software that is extremely useful and spending a few years to overcome shortcomings and build out a complete platform for big data analytics.  This year, the Hadoop community will build it out to be a more complete platform.  My prediction is that we’ll see greater maturity in 2016. With greater maturity will come wider adoption.

That said, I have observed that the open source community tends to focus on the start and not the finish. For example, over the past few years, SQL users have heard about many flavors of SQL on Hadoop.  Spark seems to be the latest and coolest new project offering SQL analytics on big data and it show great promise. However, the shift seems to be toward new projects and away from making the legacy projects work better.

Hewlett Packard Enterprise Role
I was inspired to write these predictions by a webinar that I attended in which some of the executives of Hewlett Packard Enterprise and influencers gave their vision of 2016.  For more information, watch the replay video here. Hewlett Packard Enterprise (HPE) has a role to play in making these predictions come true. HPE’s vision starts with the understanding that data fuels the new style of business driving the idea economy. Data will distinguish disruptors from the disrupted. Big data promises new customers, better experiences and new revenue streams. But all opportunities come with challenges. The recipe for success is continuously iterating on what questions to ask, which data to analyze and how to use the insights at all levels of your organization.

Sunday, November 10, 2013

Big Data is Not Just Hadoop

Hybrid Solutions will Solve our Big Data Problems for Years to Come 

When I talk to the people on the front line of big data, I notice that the most common use case of big data is to provide visualization and analytics across the types of data and volumes of data we have in the modern world.  For many, it’s an expansion of the power of the data warehouse that deals with the new data bloated world in which we live.

 Today, you have bigger volumes, more sources and you are being asked to turn around analytics even faster than before.  Overnight runs are still in use, but real-time analytics are becoming more and more expected by our business users. 

To deal with the new volumes of data, the yellow elephant craze is in full swing and many companies are looking for ways to use Hadoop to store and process big data. Last week at Strata/Hadoop World, many of the keynote speeches talked about the fact that there are really no limits to Hadoop.  I agree. However, in data governance, you must consider not only the technical solutions, but also the processes and people in your organization, and you must fit the solutions to the people and process.

As powerful as Hadoop is, there still is a skill shortage of Map/Reduce coders and Pig scripters.  There are still talented analytics professionals who aren't experts in R yet. This shortage will be with us for decades as a new generation of IT workers are trained in Hadoop.

This is in part why so many Hadoop distributions are in the process of putting SQL on Hadoop.  This is also why many traditional analytics vendors are adding Hadoop and ways to access the Hadoop cluster from their SQL-based applications.  The two worlds are colliding and it's very good for world of analytics.

I’ve blogged about the cost of big data solutions, traditional enterprise solutions and how the differ.  In short, you tend to spend money on licenses when you have an old school analytics solution, while your money goes to expertise and training if you adopt a Hadoop-centric approach.  But even this line is getting blurry with SQL-based solutions opening up their queries to Hadoop storage. Analytical databases can deliver fast big data analytics with access to Hadoop, as well as compression and columnar storage when the data is stored within.  You don’t even need open source to have a term license model today.  They are available more and more in other data storage solutions, as are pay-per-use models that charge per terabyte.

If you have a big data problem that needs to be solved, don’t jump right on the Hadoop bandwagon.  Consider the impact that big data will have on your solutions and on your teams and take a long look at the new generation of columnar data storage and SQL-centric analytical platforms to get the job done.

Thursday, March 22, 2012

Big Data Hype is an Opportunity for Data Management Pros

Big Data is a hot topic in the data management world. Recently, I’ve seen press and vendors describing it with the words crucial, tremendous opportunity, overcoming vexing challenges, and enabling technology.  With all the hoopla, management is probably asking many of you about your Big Data strategy. It has risen to the corporate management level; your CxO is probably aware.

Most of the data management professionals I’ve met are fairly down-to-earth, pragmatic folks.  Data is being managed correctly or not. The business rule works, or it does not. Marketing spin is evil. In fact, the hype and noise around big data may be something to be filtered by many of you. You’re appropriately trying to look through the hype and get to the technology or business process that’s being enhanced by Big Data.
However, in addition to filtering through the big data hype to the IT impact, data management professionals should also embrace the hype.

Sure, we want to handle the high volume transactions that often come with big data, but we still have relational databases and unstructured data sources to deal with.  We still have business users using Excel for databases with who-knows-what in them.  We still have e-mail attachments from partners that need to be incorporated into our infrastructure.  We still have a wide range of data sources and targets that we have to deal with, including, but not limited to, big data. In my last blog post, I wrote about how big data is just one facet of total data management.

The opportunity is for data management pros to think about their big data management strategy holistically and solve some of their old and tired issues around data management. It’s pretty easy to draw a picture for management that Big Data needs to take a Total Data Management approach.  An approach that includes some of our worn-out and politically-charged data governance issues, including:


  • Data Ownership – One barrier to big data management is accountability for the data.  By deciding you are going to plan for big data, you also need to make decisions about who owns the big data, and all your data sets for that matter.
  • Spreadmarts – Keeping unmanaged data out of spreadsheets is increasingly more crucial in companies who must handle Big Data. So-called “spreadmarts,” which are important pieces of data stored in Excel spreadsheets, are easily replicated to team desktops. In this scenario, you lose control of versions as well as standards. However, big data can help make it easy for everyone to use corporate information, no matter what size.
  • Unstructured Data – Although big data might tend be more analytical than operational, big data is most commonly unstructured data.  A total data management approach takes into account unstructured data in either case. Having technology and processes that handles unstructured data, big or small, is crucial to total data management.
  • Corporate Strategy and Mergers – If your company is one that grows through acquisition, managing big data is about being able to handle, not only your own data, but the data of those companies you acquire.  Since you don’t know what systems those companies will have, a big data governance strategy and flexible tools are important to big data.


My point is, with big data, try to avoid the typical noise filtering exercises you normally take on the latest buzzword.  Instead, use the hype and buzz to your advantage to address a holistic view of data management in your organization.

Tuesday, January 24, 2012

Big Data, Enterprise Data and Discrete Data

Total Data Management©
The data management world is buzzing about big data.  Many are the number of blog posts articles and white papers covering this new area. Just about every data management vendor is scrambling to build tools to meet the needs of big data.

The world is correct to pay notice. The ability for companies to handle big data represents exciting innovation where large relational databases with high price tags are sometimes replaced with flat files, technologies like Hadoop and intelligent parsers to create analytics from massive amounts of data.  It’s a game-changer for those in the Business Intelligence and relational database business.  It’s about managing an increasingly common huge data problem more effectively and at lower cost.

However, where there is big data, there is also enterprise (medium) data and discrete (small) data. With each size of data come very specific challenges.   



BIG DATA
ENTERPRISE DATA
DISCRETE DATA
Technologies
Hadoop and flat files to reduce costs and avoid relational database costs.
Relational databases
Spreadsheets and flat files and flat databases. May come from other non-relational sources, such as e-mail attachments, social media JSON, and XML data.
Use Cases
Real-time analytics of a large number of transactions, including web analytics, SaaS up-time optimization, mission-critical analysis of transactions
Just about every business application today, including CRM, ERP, Data Warehouse, and MDM.
Companies with no or little data management strategy, or for those companies dealing with immature data architecture. Companies who receive mission-critical data via e-mail.  Companies who need to closely follow social media streams.
Innovation
Handles huge amounts of data that is predominantly used for business analytics and operational BI.
Provides a power data management architecture that can be accessed by a common language (SQL).
Handles more diverse and more dynamic sources.
Positives
Replaces high cost multi-server relational databases with lower costs flat files and Hadoop server farms.
Provides a scalable, reproducible environment in which database applications and solutions can be developed. Replaces unwieldy human-intensive data processes with streamlined central repository of information. Used in many businesses in day-to-day operations.
‘Simplifies’ the data management process to the point of being completely within the grasp of the business users without too much complicated technology.  In the long run, however, data management is more costly and unwieldy when it is in spreadmarts.
Negatives
Relatively new technology with limited pool of Big Data experts. Legacy medium-sized systems can sometimes scale.
Can be costly when data volumes become high, as new servers and new enterprise licenses get more common.  Also, the number of sources and diversity of data types.
Error-prone and labor intensive.
Cost Focus
Expertise
Servers and licenses/ Connectors and database technology
Efficiency and productivity























Growing Up
An organization’s data management maturity plays a role in big and little data.  If you’re still managing your customer list in a spreadsheet, it’s probably something you started when your company was fairly young.  Now, the uses for the data should be expanded and you are still stuck in the young company’s process. Something that was agile when you were young is inefficient today.

Your pain may also have something to do with your partners’ data management maturity.  While the other companies you do business with are good at what they do, supplying products and services to your company, they may not be as good at data management. The new parts catalog comes every so often as an e-mail attachment.  You need an efficient process to update whoever uses it.

No matter how mature you are, it is likely that you will have to deal with all types of data. When selecting tools, make sure you examine the cost and efficiency of all of these types, not just big data.


Disclaimer: The opinions expressed here are my own and don't necessarily reflect the opinion of my employer. The material written here is copyright (c) 2010 by Steve Sarsfield. To request permission to reuse, please e-mail me.