Sunday, November 10, 2013

Big Data is Not Just Hadoop

Hybrid Solutions will Solve our Big Data Problems for Years to Come 

When I talk to the people on the front line of big data, I notice that the most common use case of big data is to provide visualization and analytics across the types of data and volumes of data we have in the modern world.  For many, it’s an expansion of the power of the data warehouse that deals with the new data bloated world in which we live.

 Today, you have bigger volumes, more sources and you are being asked to turn around analytics even faster than before.  Overnight runs are still in use, but real-time analytics are becoming more and more expected by our business users. 

To deal with the new volumes of data, the yellow elephant craze is in full swing and many companies are looking for ways to use Hadoop to store and process big data. Last week at Strata/Hadoop World, many of the keynote speeches talked about the fact that there are really no limits to Hadoop.  I agree. However, in data governance, you must consider not only the technical solutions, but also the processes and people in your organization, and you must fit the solutions to the people and process.

As powerful as Hadoop is, there still is a skill shortage of Map/Reduce coders and Pig scripters.  There are still talented analytics professionals who aren't experts in R yet. This shortage will be with us for decades as a new generation of IT workers are trained in Hadoop.

This is in part why so many Hadoop distributions are in the process of putting SQL on Hadoop.  This is also why many traditional analytics vendors are adding Hadoop and ways to access the Hadoop cluster from their SQL-based applications.  The two worlds are colliding and it's very good for world of analytics.

I’ve blogged about the cost of big data solutions, traditional enterprise solutions and how the differ.  In short, you tend to spend money on licenses when you have an old school analytics solution, while your money goes to expertise and training if you adopt a Hadoop-centric approach.  But even this line is getting blurry with SQL-based solutions opening up their queries to Hadoop storage. Analytical databases can deliver fast big data analytics with access to Hadoop, as well as compression and columnar storage when the data is stored within.  You don’t even need open source to have a term license model today.  They are available more and more in other data storage solutions, as are pay-per-use models that charge per terabyte.

If you have a big data problem that needs to be solved, don’t jump right on the Hadoop bandwagon.  Consider the impact that big data will have on your solutions and on your teams and take a long look at the new generation of columnar data storage and SQL-centric analytical platforms to get the job done.

Sunday, January 20, 2013

Top Four Reasons Why Financial Services Companies Need Solid Data Governance

Image licensed from iStockPhoto
In working with clients in the financial services business, I’ve noticed that there is a common set of reasons why they adopt data governance.  When it comes down to proving value of data management, it’s all about revenue, efficiency and compliance.

Number One - Accurate Risk Assessment

Based on new regulations like Sarbanes and Dodd-Frank, a financial services company's risk and assurance teams are often asked to determine the amount regulatory capital reserves when building credit risk models. A crucial part of this function is understanding how the underlying data has the on the accuracy of the calculations. Teams must be able to attest to the quality of the data by having in place the appropriate monitoring, controls, and alerts.  They must provide regulators with information they can believe in.

Data champions in this field must be able to draw the link between the regulations and data. They must assess the alignment of data and processes that support your models, quantify the impact of poor data quality on your regulatory capital calculations, and put into place monitoring and governance to manage this data over time.

Number Two – Process Efficiency

If your team is spending a lot of time checking and rechecking your reports, it can be quite inefficient. When a report generated conflicts with another report, it may bring some doubt to the validity of all reports. There is likely a data quality issue is behind it. The problem manifests itself as a huge time-suck on monthly and quarterly closes.  Data champions must point to this inefficiency in order to put in place a solid data management strategy.

Number Three - Anti-money Laundering

Financial Services companies need to be vigilant about money laundering. To do this, some look for currency transactions designed to evade current reporting requirements. If a client is making five deposits of $3,000 each in a single day, for example, it may be an attempt to keep under the radar on reporting. Data quality must help identify these transactions, even if the client is making deposits from different branches, using different deposit mechanisms (ATM or Customer Service Rep.) and even when they are using slight variation on their name.

Other systems monitor wire transfers to look for countries or individuals that appear on a list compiled by Treasury’s Office of Foreign Assets Control (OFAC). Being able to successfully match your clients against the OFAC list using fuzzy matching is crucial to success.

Number Four – Revenue
Despite all of the regulations and reporting that banks must attend to, there is still obligation to stockholders to make money while providing excellent service to the customers.  Revenue hinges upon a consistent, current and relevant view of clients across all of the bank’s products.  Poor data management creates significant hidden cost and can hinder your ability to recognize and understand opportunity – where you can up-sell and cross-sell your customers.  Data champions and data scientists must work with the marketing teams to identify and tackle the issues here.  Knowing when and how to ask the customer for new business can lead to significant growth.

These are just some examples that are very common to financial services.  In my experience, most financial services companies have all of these issues to some degree, but tackle them with an agile approach, taking a small portion of one of these problems and solving it little by little. Along the way, they follow the value brought and the value potential if more investment is made.

Sunday, January 6, 2013

Big Data After the Hype

Total Data Management

This year, I’ve been following the meteoric rise of big data. It has been a boon for vendors who are venturing into this area.  It has produced countless start-ups and much buzz in the data management world.

However, when it comes down to it, what we’re really talking about here is data management and data governance.  Whether you have to deal with big data, enterprise data or spread-marts, data needs to be managed no matter what size. The tides are turning for a total data management approach. Recent surveys shows that despite the market hype, most technologists and business users feel that big data is an off-shoot of data management, not a branch of technology in itself. 

So, why the hype?  I'm convinced it is mostly vendor-generated. In 2010, when big data began to gain notoriety, there was a disconnect for some vendors.  While partnered with traditional enterprise data management companies like the Oracle and IBM’s of this world, not all vendors were prepared for the growing popularity of open source and Hadoop. Others were (and still are) better positioned. They began talking about big data as a product differentiator. Vendors who don’t have the basic architecture for managing data in Hadoop have been and will continue to struggle. 

For example, ETL tools that have a basic connection to move data in and out of Hbase, Hortonworks and Cloudera can’t stop there.  The power of Hadoop must be harnessed, and it’s not always an easy thing to do when your technology requires executables tied to CPUs.  One of the powerful things about Hadoop is that it scales based on a languages like PIG, Sqoop and Java without having to install anything.  Want to expand the number of servers?  Add a datanode server, tell the name namenode and rebalance - and your off and running.  However, even this simple innovation is more difficult on some vendors’ architectures than others.

Another rethinking that is taking place in the market is long-standing CPU-based pricing structure.  Vendors who they keep their pricing structure based on core processors for Hadoop will continue to struggle because it runs counter to the power of Hadoop. You hear about the volume, velocity and variety.  Technically, if you want to step up the volume with another datanode, it’s no big deal. However, it becomes a big deal if you have to renegotiate a vendor contract each time.

Last year, around this time, I did write about the various costs associated with the scale of data. In summary, the costs of licenses and connectors are the bigger for enterprise data, while the costs associated with skills are more likely to affect you with big data.  There will come a time where the skills gap will be closed, however.

In the year 2013, we’ll begin to see the un-hyping of big data in favor of this total data management approach. For buyers, big data will be a tick-box in their RFP’s in the effort to manage data, no matter what the size.

Thursday, August 9, 2012

Big Data, Good and Evil

As I get involved more and more in the world of Big Data, I find myself reflecting upon where it all will go.  Big Data could help us live better lives by solving crimes, predicting scientific outcomes, detecting  fraud and, of course, optimize our marketing so that we don’t bother people who don’t want our products and target them when we think they do. While the ‘goodness’ of some of those items are decidedly debatable, that’s the bright side. Big Data does represent a paradigm shift for our society, but since it’s still young, we’re just not sure exactly how big Big Data is yet.

When I write about Big Data, I’m talking about leveraging new sources of data like social media, transaction data, sensor data, networked devices and more.  These data sources tend to be… well, big. Mashing them up with your traditional CRM data or supply chain data can tell you some fascinating things.  They even tell you some interesting things all by themselves. It can give you information that wasn’t possible to attain, until recently, when we achieved the technology nd ability to handle Big Data in a meaningful way. We are already starting to see amazing case studies from Big Data.

On the other hand, there is potential folly. Despite the absolute evolutionary power that Big Data can bring to us, it’s also human nature for some to abuse.  When technological evolution brought us snail-mail, many abused it with junk mail.  When technology brought us e-mail, a few abused it by spamming us. Abuse is my biggest concern. The potential abuse with Big Data is that corporations completely figure out what makes us tick thereby giving them unprecedented power over our buying decisions. It could lead our social issues, too.  For example, if Big Data says that people who eat cheeseburgers after 9 PM are more likely to get a heart attack, do we justify outlawing cheeseburgers after 9?  I'd rather make my own decisions.

The movie “The minority report” starring Tom Cruise has come to mind.  As truth imitates fiction, I can help but think of the mall scene from the movie which overall painted a fairly grim picture of marketing in the future. Now, I see it as prophetic.

This type of marketing already exists within some free online e-mail systems.  For example, if I’m e-mailing my friends about a trip to Vegas or gambling, or even when I post this blog that mentions Vegas, it’s no mistake when ads for Caesars Palace appear.  It’s cool, but yet I am uneasy. Will future employers use big data to help decide if I am worthy of work. Will my e-mail conversations about Las Vegas lead them to believe I am a compulsive gambler thus giving the edge to someone else?  If so, what is my recourse to set the record straight?

Government has reportedly been getting in on big data, too.  A recent Wired magazine story talked about a huge government facility outside in Utah. While there is clearly a "good" aspect to this big data, namely the catching of bad guys, the most troubling aspect of this might be that the citizens have no control of their own data. Oversight on what can and cannot be done with the wealth of information at this facility is unclear.

That said, I generally have an overall positive view of the good that Big Data will bring to society, and the positive influence it will have on data management professionals. We have a society today that is more open and more willing to post private information to the public. Society is therefore more tolerant today and will be even more so in the future.

Ultimately, when and if Big Data becomes abusive to privacy, overzealous capitalism, social issues, et al, expect capitalism to also solve it. Look for companies who set up online e-mail and promote the fact that they don’t track conversations. Look for utilities to overwhelm any negative information about you in the Big Data universe with positive information. We could be looking at a cottage industry of managing  and protecting your Big Data image.

Thursday, May 17, 2012

Naming your Data Management Project


 In my line of work, I get to see many requests for proposals and sometimes I am invited to take part when a project is progressing.  I may be one of the only people on earth who gets pleasure in companies improving their data management strategy because I almost always see a huge return on investment. We’re making the world a better place by managing data the right way, so thanks to those who have made me part of your project.


I do have one word of advice for project managers, however. Please think when you name your projects. I can’t tell you how many times I’ve come into a project where some long description is the name of a project and it soon becomes and equally uncompelling acronym.  They are project names like:

  • Salesforce Marketing Analyst Data Mart and Sales Marketing Information Daily Audit or you can go by the catchy acronym SMADMASMIDA
  • Outlook Sales Partner Contact Daily Reconciliation or OSPCDR
  • Operational Business Intelligence for Marketing Analytics or OBIMA

The names and their acronyms are pretty close to meaningless.  People will be more excited by references to the news and pop culture than by intellectual terminology. It matters. Using the technical terms put you in an elitist club of IT, and remember, we’re trying to break down the barriers between business and IT.

Some examples:

  • Any Business Intelligence project today that doesn’t have the name ‘Moneyball’ in the title is missing a huge opportunity.  Everyone knows what the movie Moneyball is about and the way that the Oakland A’s used business intelligence to win. Easy sale of your project to business.
  • Big Data initiatives could be named after Adele’s “Rolling in the Deep”.  Rolling in the Deep is what a ship does while out at sea. The image is a small ship tossed on a very deep, dark ocean (of data).
  • The song title is an adaptation of a British slang phrase “roll deep” which means to have a group who always has your back, who can get you out of trouble. It’s a nice image to signify the pervasiveness of data, the fact that there is strength in numbers and for data governance.  

Of course, pop culture is a good way to start, but company culture and the history of your organization are also great inspiration for naming your project.   Given the French background of Talend, my current employer, a name for a data consolidation project might be something like ‘Pas de Deux’ which promotes a vision of a relationship between two people or things.

The point is, try to use the name of the project to promote a vision of the business problem you’re trying to solve.  It’ll play better with the business folks. The name matters.

Monday, April 2, 2012

Why Code Base is Important in Vendor Selection


The horticulture of software

Spring has sprung here in the northern hemisphere and mind turn to the plant life that will be sprouting all across our home towns. The new growth has me thinking about the similarities between horticulture and the code base of our data management solutions.

Reviewing software solutions before you buy is a major effort for users and/or vendor selection committees. Much time is spent on looking at whether the features of the product will meet team needs. Features are so important that companies will spend time to produce RFPs with extensive feature lists. They may even require a proof of concept; the vendor must install and test the solution in the purchaser’s work environment. This goes for those applications used to manage data, but also many other applications.

However, I believe that buyers should carefully look at the style of growth to the code base. In the data management field, we undergone decades of technology combined with decades of market consolidation.  The code base for the application you’re about to buy may have grown from the following horticultural strategy:

  • Grafting  –  A large software company sees potential in the data management field and begins to acquire companies and grafting them together to create a solution. Sometimes the acquisition isn’t done by technologists, but by upper management seeking to fill holes in the product line. Sometimes they even buy competing technologies, leaving everyone trying to figure out who will win. Sometimes the graft doesn’t take.
  • Old Growth – Companies have an existing technology that has worked for decades. However, back in 1990 when they released version 1.0, JAVA was experimental and not the dominant force it is today.  FORTRAN was the preferred programming language and COBOL copybooks were the data model.  I know some companies in the data management market have spent millions updating old growth code to be more competitive in this market, and some others who have not.  This becomes a dilemma for all vendors at some point.  When do you prune out the dead wood?
  • Sapling – Companies who are just breaking into the data management marketplace and have a good-looking start for data management.  However, the sapling doesn’t yet have all the branches you want on it.  Will the sapling survive among the other deciduous solutions in the market?

When you’re selecting a vendor, you ideally want a code base that is mature, but not too mature.  You want limited grafting.   The growth of the code and the grafting affects:

  • Speed of innovation for the vendor
  • Customization for you
  • Future expansion for both of you
  • The age and experience of the technologists necessary to operate it
  • Consulting requirements
  • Ability to cross-train personnel (E.g. DI people running DQ and vice versa)

So, when you’re selecting a data management solution, or any technology solution, don’t just compare the features, but take a look at how the product grew to where it is today.  Look for the solution in the optimal stage of growth that will meet your needs today and those for the future.


Thursday, March 22, 2012

Big Data Hype is an Opportunity for Data Management Pros

Big Data is a hot topic in the data management world. Recently, I’ve seen press and vendors describing it with the words crucial, tremendous opportunity, overcoming vexing challenges, and enabling technology.  With all the hoopla, management is probably asking many of you about your Big Data strategy. It has risen to the corporate management level; your CxO is probably aware.

Most of the data management professionals I’ve met are fairly down-to-earth, pragmatic folks.  Data is being managed correctly or not. The business rule works, or it does not. Marketing spin is evil. In fact, the hype and noise around big data may be something to be filtered by many of you. You’re appropriately trying to look through the hype and get to the technology or business process that’s being enhanced by Big Data.
However, in addition to filtering through the big data hype to the IT impact, data management professionals should also embrace the hype.

Sure, we want to handle the high volume transactions that often come with big data, but we still have relational databases and unstructured data sources to deal with.  We still have business users using Excel for databases with who-knows-what in them.  We still have e-mail attachments from partners that need to be incorporated into our infrastructure.  We still have a wide range of data sources and targets that we have to deal with, including, but not limited to, big data. In my last blog post, I wrote about how big data is just one facet of total data management.

The opportunity is for data management pros to think about their big data management strategy holistically and solve some of their old and tired issues around data management. It’s pretty easy to draw a picture for management that Big Data needs to take a Total Data Management approach.  An approach that includes some of our worn-out and politically-charged data governance issues, including:


  • Data Ownership – One barrier to big data management is accountability for the data.  By deciding you are going to plan for big data, you also need to make decisions about who owns the big data, and all your data sets for that matter.
  • Spreadmarts – Keeping unmanaged data out of spreadsheets is increasingly more crucial in companies who must handle Big Data. So-called “spreadmarts,” which are important pieces of data stored in Excel spreadsheets, are easily replicated to team desktops. In this scenario, you lose control of versions as well as standards. However, big data can help make it easy for everyone to use corporate information, no matter what size.
  • Unstructured Data – Although big data might tend be more analytical than operational, big data is most commonly unstructured data.  A total data management approach takes into account unstructured data in either case. Having technology and processes that handles unstructured data, big or small, is crucial to total data management.
  • Corporate Strategy and Mergers – If your company is one that grows through acquisition, managing big data is about being able to handle, not only your own data, but the data of those companies you acquire.  Since you don’t know what systems those companies will have, a big data governance strategy and flexible tools are important to big data.


My point is, with big data, try to avoid the typical noise filtering exercises you normally take on the latest buzzword.  Instead, use the hype and buzz to your advantage to address a holistic view of data management in your organization.

Share it

Disclaimer: The opinions expressed here are my own and don't necessarily reflect the opinion of my employer. The material written here is copyright (c) 2010 by Steve Sarsfield. To request permission to reuse, please e-mail me.