Showing posts with label data integration. Show all posts
Showing posts with label data integration. Show all posts

Monday, June 13, 2011

The Differences Between Small and Big Data

There is a lot of buzz today about big data and companies stepping up to meet the challenge of ever increasing data volumes. In the center of it all, are Hadoop and the Cloud.  Hadoop can intelligently manage the distribution of processing and your files. It manages the infrastructure needed to break down big data into more manageable chunks for processing by multiple servers. Likewise, a cloud strategy can take data management outside the walls of a corporation into a high scalable infrastructure.

Do you have big data?  It’s difficult to know precisely whether you do because big data is vaguely defined. You may qualify for big data technology if you face hundreds of gigabytes of data, or it may hundreds or thousands of terabytes. The classification of “big data” is not strictly defined by data size, but other business processes, too. Your data management infrastructure needs to take into account factors like future data volumes, peaks and lulls in requirements, business requirements and much more.

Small and Medium-Sized Data

What about “small” and medium-sized data? For example, data from spreadsheet, the occasional flat file, leads from a trade show, and catalog data from vendors may be vital to your business processes. With a new industry focus on transparency, business user involvement and sharing of data, small data is a constant issue.  Spreadsheets and flat files are the preferred method to share data today because most companies have some process for handling them. When you get these small to medium sized data sets,  it is still necessary to:
  • profile them
  • integrate them into your relational database
  • aggregate data from these sources, or extract only the vital parts
  • apply data quality standards when necessary
  • use them as part of a master data management (MDM) initiative

The Difference Goals of Big Data and Little Data
With big data, the concern is usually about your data management technology’s ability to handle massive quantities in order to provide you aggregates that are meaningful.  You need solutions that will scale to meet your data management needs.  However, handling small and medium data sets is more about short and long term costs.  How can you quickly and easily integrate data without a lot of red tape, big license fees, pain and suffering.

Think about it. When you need to handle small and medium data, you have options:
  • Hand-coding: Using hand-coding is sometimes faster than any solution and it still may be OK for ad-hoc, one off data integration.  Once you find yourself hand-coding again and again, you’ll find yourself rethinking that strategy. Eventually managing all that code will waste time and cost you a bundle. If your data volumes grow, hand-coded quickly becomes obsolete due to lack of scaling. Hand-coding gets high marks on speed to value, but falters in sustainability and long-term costs.
  • Open Source: Open source data management tools provide a quick way to get started, low overall costs and high sustainability.  By just downloading and learning the tools, you’re on your way to getting data management done.  The open source solutions may have some limitations on scalability, but most open source providers have low-cost commercial upgrades that meet these needs.  In other words, it's easy to start today and leverage Hadoop and the Cloud if you need it later. Open source gets high marks on speed to value, sustainability and costs.
  • Traditional Data Management Vendors: Small data is a tough issue for the mega-vendors. Even for 50K-100K records, the license cost in both the short term and long term could be prohibitive.  The mega-vendor solutions do tend to scale well, making them sustainable at a cost. However mergers in the data management business do happen. The sustainability of a product can be affected by these mergers.  Commercial vendors get respectable marks in speed to value and sustainability, but falter in high up-front costs and maintenance fees.
I've heard it a million times in this business - start small and fast with technology that gives you a fast success but also scales to future tasks.

    Tuesday, November 16, 2010

    Ideas Having Sex: The Path to Innovation in Data Management

    I read a recent analyst report on the data quality market and “enterprise-class” data quality solutions. Per usual, the open source solutions were mentioned at a passing while the data quality solutions of the past were given high marks. Some of the solutions picked in the top originated from days when mainframe was king. Some of the top contenders still contained cobbled-together applications from ill-conceived acquisitions. It got me thinking about the way we do business today and how so much of it is changing.

    Back in the 1990’s or earlier, if you had an idea for a new product, you’d work with an internal team of engineers and build the individual parts.  This innovation took time, as you might not always have exactly the right people working on the job.  It was slow and tedious. The product was always confined by its own lineage.

    The Android phone market is a perfect examples of the modern way to innovate.  Today, when you want to build something groundbreaking like an Android, you pull in expertise from all around the world. Sure, Samsung might make the CPU and Video processing chips, but Primax Electronics in Taiwan might make the digital camera and Broadcomm in the US makes the touch screen, plus many others. Software vendors push the platform further with their cool apps. Innovation happens at break-neck speed because the Android is a collection of ideas that have sex and produce incredible offspring.

    Isn’t that really the model of a modern company?  You have ideas getting together and making new ideas. When you have free exchange between people, there is no need to re-invent something that has already been invented. See the TED for more on this concept, where British author Matt Ridley argues that, through history, the engine of human progress and prosperity is "ideas having sex.”

    The business model behind open source has a similar mission.  Open source simply creates better software. Everyone collaborates, not just within one company, but among an Internet-connected, worldwide community. As a result, the open source model often builds higher quality, more secure, more easily integrated software. It does so at a vastly accelerated pace and often at a lower cost.

    So why do some industry analysts ignore it? There’s no denying that there are capitalist and financial reasons.  I think if an industry analyst were to actually come out and say that the open source solution is the best, it would be career suicide. The old-school would shun the analysts making him less relevant. The link between the way the industry pays and promotes analysts and vice versa seems to favor enterprise application vendors.

    Yet the open source community along with Talend has developed a very strong data management offering that should be considered in the top of its class. The solution leverages other cutting edge solutions. To name just a few examples:
    • if you want to scale up, you can use distributed platform technology from Hadoop, which enables it to work with thousands of nodes and petabytes of data.
    • very strong enterprise class data profiling.  
    • matching that users can actually use and tune without having to jump between multiple applications.
    • a platform that grows with your data management strategy so that if your future is MDM, you can seamlessly move there without having to learn a new GUI.
    The way we do business today has changed. Innovation can only happen when ideas have sex, as Matt Ridley puts it. As long as we’re engaged in exchange and specialization, we will achieve those new levels of innovation.

    Monday, February 22, 2010

    Referential Treatment - The Open Source Reference Data Trend

    Reference data can be used in a huge number of data quality and data enrichment processes.  The simplest example is a table that contains cities and their associated postal codes – you can use an ETL process to make sure that all your customer records that contain 02026 for a postal code always refer to the standardized “Dedham, MA” for the city and state, not variations like “Deadham Mass”  or “Dedam, Massachusetts”.

    Reference data is not limited to customer address, however. If everyone were to use the same reference data for parts, you could easily exchange procurement data between partners.  If only certain values are allowed in any given table, it would support validation.  By having standards for supply chain data, procurement, supply chain, finance and accounting data, processes are more efficient.  Organizations like the ISO and ECCMA are working on that.

    Availability of Reference Data
    In the past, it was difficult to get your hands on reference data. Long ago, no one wanted to share reference data with you - you had to send your customer data to a service provider and get the enriched data back.  Others struggled to develop reference data on their own. Lately I’m seeing more and more high quality reference data available for free on the Internet.   For data jockeys, these are good times.

    GeoNames
    A good example of this is GeoNames.  The GeoNames geographical database is available for download free of charge under a creative commons attribution license. According to the web site, it “aggregates over 100 different data sets to build a list containing over eight million geographical names and consists of 7 million unique features whereof 2.6 million populated places and 2.8 million alternate names. The data is accessible free of charge through a number of web services and a daily database export. “

    GeoNames combines geographical data such as names of places in various languages, elevation, population and others from various sources. All lat/long coordinates are in WGS84 (World Geodetic System 1984). Like Wikipedia, users may manually edit, correct and add new names.

    US Census Data
    Another rich set of reference data is the US Census “Gazetteer” data. Courtesy of the US government, you can download a database with the following fields:
    • Field 1 - State Fips Code
    • Field 2 - 5-digit Zipcode
    • Field 3 - State Abbreviation
    • Field 4 - Zipcode Name
    • Field 5 - Longitude in Decimal Degrees (West is assumed, no minus sign)
    • Field 6 - Latitude in Decimal Degrees (North is assumed, no plus sign)
    • Field 7 - 2000 Population (100%)
    • Field 8 - Allocation Factor (decimal portion of state within zipcode)
    So, our Dedham, MA entry includes this data:
    • "25","02026","MA","DEDHAM",71.163741,42.243685,23782,0.003953
    It’s Really Exciting!
    When I talk about reference data at parties, I immediately see eyes glaze over and it’s clear that my fellow party-goers want to escape my enthusiasm for it.  But this availability of reference data is really great news! Together with the open source data integration tools like Talend Open Studio, we’re starting to see what I like to call “open source reference data” becoming available. It all makes the price of improving data quality much lower and our future much brighter.

    There’s so much to talk about with regard to reference data and so many good sources.  I plan to make more posts on this topic, but feel free to post your beloved reference data sources here in the comments section.

    Tuesday, February 16, 2010

    The Secret Ingredient in Major IT Initiatives

    One of my first jobs was that of assistant cook at a summer camp.  (In this case, the term ‘cook’ was loosely applied meaning to scrub pots and pans for the head cook.) It was there I learned that most cooks have ingredients that they tend to use more often.  The cook at Camp Marlin tended to use honey where applicable.  Food TV star Emeril likes to use garlic and pork fat.  Some cooks add a little hot pepper to their chocolate recipes – it is said to bring out the flavor of the chocolate.  Definitely a secret ingredient.
    For head chefs taking on major IT initiatives the secret ingredient is always data quality technology. Attention to data quality doesn’t make the recipe of an IT initiative alone so much as it makes an IT initiative better.  Let’s take a look at how this happens.

    Profiling
    No matter what the project, data profiling provides a complete understanding of the data before the project team attempts to migrate it. This can help the project team create a more accurate plan for integration.  On the other hand, it is ill-advised to migrate data to your new solution as-is, as it can lead to major costs over-runs and project delays as you have to load and reload it.

    Customer Relationship Management (CRM)
    By using data quality technology in CRM, the organization will benefit from a cleaner customer list with fewer duplicate records. Data quality technology can work as a real-time process, limiting the amount of typos and duplicates in the system, thus leading to improved call center efficiency.  Data profiling can also help an organization understand and monitor the quality of a purchased list for integration will avoid issues with third-party data.

    Enterprise Resource Planning (ERP) and Supply Chain Management (SCM)

    If data is accurate, you will have a more complete picture of the supply chain. Data quality technology can be used to more accurately report inventory levels, lowering inventory costs. When you make it part of your ERP project, you may also be able to improve bargaining power with suppliers by gaining improved intelligence about their corporate buying power. 

    Data Warehouse and Business  Intelligence
    Data quality helps disparate data sources to act as one when migrated to a data warehouse. Data quality makes data warehouse possible by standardizing disparate data. You will be able to generate more accurate reports when trying to understand sales patterns, revenue, customer demographics and more.

    Master Data Management (MDM)
    Data quality is a key component of master data management.     An integral part of making applications communicate and share data is to have standardized data.  MDM enhances the basic premise of data quality with additional features like persistent keys, a graphical user interface to mitigate matching, the ability to publish and subscribe to enterprise applications, and more.

    So keep in mind, when you decide to improve data quality, it is often because of your need to make a major IT initiative even stronger.  In most projects, data quality is the secret ingredient to make your IT projects extraordinary.  Share the recipe.

    Thursday, January 21, 2010

    ETL, Data Quality and MDM for Mid-sized Business


    Is data quality a luxury that only large companies should be able to afford?  Of course the answer is no. Your company should be paying attention to data quality no matter if you are a Fortune 1000 or a startup. Like a toothache, poor data quality will never get better on its own.

    As a company naturally grows, the effects of poor data quality multiply.  When a small company expands, it naturally develops new IT systems. Mergers often bring in new IT systems, too. The impact of poor data quality slowly invades and hinders the company’s ability to service customers, keep the supply chain efficient and understand its own business. Paying attention to data quality early and often is a winning strategy for even the small and medium-sized enterprise (SME).

    However, SME’s have challenges with the investment needed in enterprise level software. While it’s true that the benefit often outweighs the costs, it is difficult for the typical SME to invest in the license, maintenance and services needed to implement a major data integration, data quality or MDM solution.

    At the beginning of this year, I started with a new employer, Talend. I became interested in them because they were offering something completely different in our world – open source data integration, data quality and MDM.  If you go to the Talend Web site, you can download some amazing free software, like:
    • a fully functional, very cool data integration package (ETL) called Talend Open Studio
    • a data profiling tool, called Talend Open Profiler, providing charts and graphs and some very useful analytics on your data
    The two packages sit on top of a database, typically MySQL – also an open source success.

    For these solutions, Talend uses a business model similar to what my friend Jim Harris has just blogged about – Freemium. Under this new model, free open source content is made available to everyone—providing the opportunity to “up-sell” premium content to a percentage of the audience. Talend works like this.  You can enhance your experience from Talend Open Studio by purchasing Talend Integration Suite (in various flavors).  You can take your data quality initiative to the next level by upgrading Talend Open Profiler to Talend Data Quality.

    If you want to take the combined data integration and data quality to an even higher level, Talend just announced a complete Master Data Management (MDM) solution, which you can use in a more enterprise-wide approach to data governance. There’s a very inexpensive place to start and an evolutionary path your company can take as it matures its data management strategy.

    The solutions have been made possible by the combined efforts of the open source community and Talend, the corporation. If you’d like, you can take a peek at some source code, use the basic software and try your hand at coding an enhancement. Sharing that enhancement with community will only lead to a world full of better data, and that’s a very good thing.

    Thursday, October 22, 2009

    Book Review: Data Modeling for Business


    A couple of weeks ago, I book-swapped with author Donna Burbank. She has a new book entitled Data Modeling for Business. Donna, an experienced consultant by trade, has teamed up with Steve Hoberman, a previous published author and technologist and Chris Bradley, also a consultant, for an excellent exploration of the process of creating a data model. With a subtitle like “A handbook for Aligning the Business with IT using a High-Level Data Model” I knew I was going to find some value in the swap.

    The book describes in plain English the proper way to create a data model, but that simple description doesn’t do it justice. The book is designed for those who are learning from scratch – those who only vaguely understand what a data model is. It uses commonly understood concepts to describe data model concepts. The book describes the impact of the data model to the project’s success and digs into setting up data definitions and the levels of detail necessary for them to be effective. All of this is accomplished in a very plain-talk, straight-forward tone without the pretentiousness you sometimes get in books about data modeling.

    We often talk about the need for business and IT to work together to build a data governance initiative. But many, including myself, have pointed to the communication gap that can exist in a cross-functional team. In order to bridge the gap, a couple of things need to happen. First, IT teams need to expand their knowledge of business processes, budgets and corporate politics. Second, business team members need to expand their knowledge of metadata and data modeling. This book provides an insightful education for the latter. In my book, the Data Governance Imperative, the goal was the former.

    The book is well-written and complete. It’s a perfect companion for those who are trying to build a knowledgeable, cross-function team for data warehouse, MDM or data governance projects. Therefore, I’ve added it to my recommended reading list on my blog.

    Friday, March 20, 2009

    The Down Economy and Data Integration

    Vendors, writers and analysts are generating a lot of buzz about the poor economic growth conditions in the world. It’s true that in tough times, large, well-managed companies tend to put off IT purchases until the picture gets a bit rosier. Some speculate that the poor economy will affect data integration vendors and their ability to advance big projects with customers. Yet, I don’t think it will have a deep or lasting impact. Here are just some of the signs still seem to point to a strong data integration economy.

    Stephen Swoyer at TDWI wrote a very interesting article that attempts to prove that data integration and BI projects are going full-steam ahead, despite a lock-down on spending in other areas.

    Research from Forrester seems to suggest that IT job cuts in 2009 won’t be as steep as they were in the 2001/2002 dot com bubble burst. Forrester says that the US market for jobs in information technology will not escape the recession, with total jobs in IT occupations down by 1.2% in 2009, but the pain will be relatively mild compared with past recessions. (You have to be a Forrester customer to get this report.)

    You can read the article by Doug Henschen from Intelligent Enterprise for further proof on the impact of BI and real time analytics. The article contains success stories from Wal-Mart, Kimberly-Clark and Goodyear, too.

    On this topic, SAP BusinessObjects recently asked me if I’d blog about their upcoming webinar on this topic entitled: Defy the Times: Business Growth in a Weak Economy. The concept of the webinar being that you can use business intelligence and analytics to cut operating expenses and discretionary spending and improve efficiencies. It might be a helpful webinar if you’re on a data warehouse team and trying to prove your importance to management during this economic down-turn. Use vendors to help you provide third-party confirmation of your value.

    So, is the poor economy threatening the data integration economy? I don’t think so. When you look at the problems of growing data volumes and the value of data integration, I don’t see how these positive stories can change any time soon. You can run out of money, but the world will never run out of data.

    Monday, March 10, 2008

    Approaching IT Projects with Data Quality in Mind


    I co-authored a white paper at the end of 2006 with a simple goal: to talk directly to project managers about the process they go through when putting together a data intensive project. By “data intensive” project, I mean dealing with mergers and acquisitions data, CRM, ERP consolidation, Master Data Management, and any project where you have to move big data.

    Project teams can be so focused on application features and functions that they sometimes miss the most important part. In the case of a merger, project teams must often deal with unknown data coming in from the merger that may require profiling at part of their project plan. In the case of a CRM system, companies are trying to consolidate whatever ad hoc system is in place and data from people who may care very little about data quality. In the case of master data management and data governance, the thought of sharing data across the enterprise brings to mind a need for a corporate standard for data. Data intensive projects may have different specific needs, but just remembering that you need to consider data in your project will get you far.

    To achieve real success, companies need to plan a way to manage data as part of the project steps. If you don’t think about the data as part of the project preparation, blueprinting, implementation, rollout preparation, go live and maintenance, your project is vulnerable to failure. Most commonly, delay and failure is due to late-project realization that the data has problems. Knowing the data challenges you face early in the process is the key to success.

    This white paper discusses the importance and ways to best involve business users in the project to ensure their needs are met. It covers ways to stay in scope on the project while considering the big picture and the going concern of data quality within your organization. Finally, it covers how to incorporate technology throughout a project to expedite data quality initiatives. The white paper is still available today for download. Click here and see "Data Quality Essentials: For Any Data-Intensive Project"

    Answer to above: All of them

    Thursday, February 21, 2008

    SAP Data Management Success Stories

    I’m preparing for a web seminar on SAP data management success, and I’m really starting to look forward to it.

    Moen and Oki Data will be sharing their data quality success stories with our audience. These are two very successful implementations of the Trillium Software System in the SAP environment.

    My Trillium Software colleague, Laurie and I will take up only about ten minutes to first frame up and then wrap up the presentation. But the bulk of the presentation is about Moen and Oki Data and the success they’ve been able to achieve in a) quickly starting a data management program in SAP R/3, SAP ERP and SAP CRM; and b) taking the process and technology from one project to another.

    If you want to join us, please click here. The webinar is on February 27th at 2 PM Eastern.

    Sunday, February 10, 2008

    Mainframe Computing and Information Quality

    Looking for new ways to use the power of your mainframe? My friend Wally called me the other day and was talking about moving applications off the mainframe to the Unix platform and cleansing data during the migration. “Sure, we can help you with that.” I said. But he was surprised to hear that there is a version of the Trillium Software System that is optimized for the Mainframe (z/OS server). We’ve continually updated our mainframe data quality solution and we have no plans to stop.

    Mainframe computers still play a central role in the daily operations of many large companies. Mainframes are designed to allow many simultaneous users and applications access to the same data without interfering with one other. Security, scalability, and reliability are key factors to the mainframe’s power in mission-critical applications. These applications typically include customer order processing, financial transactions, production and inventory control, payroll, and others.

    While others have abandoned the mainframe platform, the Trillium Software System supports the z/OS (formerly known as OS/390) environment. Batch data standardization executes on either a 64-bit or 31-bit system. It also supports CICS, the transactional-based processing system designed for real time processing. z/OS and CICS easily support thousands of transactions per second, making it a very powerful data quality platform. The Trillium Software System can power your mainframe with an outstanding data quality engine, no matter if your data is stored in DB2, text files, COBOL copybooks, or XML.

    The Trillium Software System will standardize, cleanse and match data using our proprietary rules engine. You can remove duplicates, ensure that your name and address data will mail properly, CASS certify data and more. It’s a great way to get your data ready for SOA on the mainframe, too.

    My hats off to Clara C. on our development team, who heads up the project for maintaining the mainframe version of the Trillium Software System. She’s well-known at Trillium Software for her mainframe acumen and for hosting the annual pot-luck lunch around the holidays. (She makes an excellent mini hot dog in Jack Daniels sauce.)

    I’m not sure whether Wally will stick with his mainframe or migrate the whole thing to UNIX servers, but he was happy to know he has an option. With an open data quality platform, like the Trillium Software System, it’s not a huge job to move the whole process from the mainframe to UNIX by leveraging the business rules developed on one platform and copying them to the other.

    Tuesday, February 5, 2008

    Oracle Data Integration Suite - Trillium Software Inside

    Finally! Finally, I can talk about the exciting news regarding Trillium Software’s partnership with Oracle. It’s a perfect decision for Oracle to begin working with Trillium in the data integration market, combining Sunopsis technology with Trillium Software technology to address some of the competitive challenges of IBM and the Webshere platform.

    Trillium Software has long been a supporter of the Oracle platform, first offering batch technology for cleansing Oracle databases. A few years ago, we began offering direct support for Oracle’s older data integrator, OWB. Now, this integration with ODI is going to serve Oracle customers with excellent data quality within a superb data integration platform.

    Trillium Software prides itself in it’s our connectivity into major enterprise applications. Here are a few of the most popular ones:

    • SAP - SAP R/3, SAP CRM, SAP ERP and SAP NetWeaver MDM.
    • Oracle - OWB, ODI, Siebel eBusiness, Siebel UCM, Oracle CDH, and Oracle eBusiness Suite.
    • Ab Initio
    • Siperian

    In addition, we still have quite a few customers on the Informatica platform, and we continue to support those customers, despite the fact that Informatica has had a competitive data quality solution since its acquisition of Similarity Systems. We even maintain our integration with IBM Websphere, despite IBM’s acquisition of Ascential, who had acquired data quality vendor Vality. Still, we have a significant number of users who are using Datastage with Trillium Software and don’t want to switch.

    Why support all these integration points when other vendors don’t? It’s where the reality of the marketplace meets product development. Let’s face it, large companies most often don’t run a single application platform across their entire enterprise. Most have a mixture of IBM, Oracle, Siebel, and many other enterprise vendors. Sometimes, this makes perfect sense for the organization. The heterogeneous enterprise often occurs when the application vendors can’t meet all the needs of the organization. So, for example, SAP ERP may meets the need of manufacturing, but Siebel better meets the requirements of sales and marketing.

    On the other hand, it makes sense to standardize the data platform of your company. If you can plug the same rules engine into any of these platforms, data quality is more easily a simple component of corporate governance. Now you don’t have to hire staff to operate and maintain multiple data quality tools. Now, you won’t have to try to tune one data quality tool to make it behave like another. It is much easier to achieve a company-wide gold customer master record with a single information quality platform like Trillium Software.

    Disclaimer: The opinions expressed here are my own and don't necessarily reflect the opinion of my employer. The material written here is copyright (c) 2010 by Steve Sarsfield. To request permission to reuse, please e-mail me.