Big Data, enterprise data warehousing, Good Strat, Good Strategy, Martyn Jones, Martyn Richard Jones
Hold this thought: To paraphrase the great Bob Hoffman, just when you think that if the Big Data babblers were to generate one more ounce of bull**** the entire f****** solar system would explode, what do they do? Exceed expectations.
I am a mild mannered person, but if there is one thing that irks me, it is when I hear variations on the theme of “Data Warehousing is Big Data”, “Big data is in many ways an evolution of data warehousing” and “with Big Data you no longer need a Data Warehouse”.
Big Data is not Data Warehousing, it is not the evolution of Data Warehousing and it is not a sensible and coherent alternative to Data Warehousing. No matter what certain vendors will put in their marketing brochures or stick up their noses.
In spite of all of the high-visibility screw-ups that have carried the name of Data Warehousing, even when they were not Data Warehouse projects at all, the definition, strategy, benefits and success stories of data warehousing are known, they are in the public domain and they are tangible.
Data Warehousing is a practical, rational and coherent way of providing information needed for strategic and tactical option-formulation and decision-making.
Data Warehousing is a strategy driven, business oriented and technology based business process.
We stock Data Warehouses with data that, in one way or another, comes from internal and optional external sources, and from structured and optional unstructured data. The process of getting data from a data source to the target Data Warehouse, involves extraction, scrubbing, transformation and loading, ETL for short.
Data Warehousing’s defining characteristics are:
Subject Oriented: Operational databases, such as order processing and payroll databases and ERP databases, are organized around business processes or functional areas. These databases grew out of the applications they served. Thus, the data was relative to the order processing application or the payroll application. Data on a particular subject, such as products or employees, was maintained separately (and usually inconsistently) in a number of different databases. In contrast, a data warehouse is organized around subjects. This subject orientation presents the data in a much easier-to-understand format for end users and non-IT business analysts.
Integrated: Integration of data within a warehouse is accomplished by making the data consistent in format, naming and other aspects. Operational databases, for historic reasons, often have major inconsistencies in data representation. For example, a set of operational databases may represent “male” and “female” by using codes such as “m” and “f”, by “1” and “2”, or by “b” and “g”. Often, the inconsistencies are more complex and subtle. In a Data Warehouse, on the other hand, data is always maintained in a consistent fashion.
Time Variant: Data warehouses are time variant in the sense that they maintain both historical and (nearly) current data. Operational databases, in contrast, contain only the most current, up-to-date data values. Furthermore, they generally maintain this information for no more than a year (and often much less). In contrast, data warehouses contain data that is generally loaded from the operational databases daily, weekly, or monthly, which is then typically maintained for a period of 3 to 10 years. This is a major difference between the two types of environments.
Historical information is of high importance to decision makers, who often want to understand trends and relationships between data. For example, the product manager for a Liquefied Natural Gas soda drink may want to see the relationship between coupon promotions and sales. This is information that is almost impossible – and certainly in most cases not cost effective – to determine with an operational database.
Non-Volatile: Non-volatility means that after the data warehouse is loaded there are no changes, inserts, or deletes performed against the informational database. The Data Warehouse is, of course, first loaded with cleaned, integrated and transformed data that originated in the operational databases.
We build Data Warehouses iteratively, a piece or two at a time, and each iteration is primarily a result of business requirements, and not technological considerations.
Each iteration of a Data Warehouse is well bound and understood – small enough to be deliverable in a short iteration, and large enough to be significant.
Conversely, Big Data is characterised as being about:
Massive volumes: so great are they that mainstream relational products and technologies such as Oracle, DB2 and Teradata just can’t hack it, and
High variety: not only structured data, but also the whole range of digital data, and
High velocity: the speed at which data is generated, transmitted and received.
These are known as the three Vs of Big Data, and they are subject to significant and debilitating contradictions, even amongst the gurus of Big Data (as I have commented elsewhere: Contradictions of Big Data).
From time to time, Big Data pundits slam Data Warehousing for not being able to cope with the Big Data type hacking that they are apparently used to carrying out, but this is a mistake of those who fail to recognise a false Data Warehouse when they see one.
So let’s call these false flag Data Warehouse projects something else, such as Data Doghouses.
“Data Doghouse, meet Pig Data.”
Failed or failing Data Doghouses fail for the same reasons that Big Data projects will frequently fail. Both will almost invariably fail to deliver artefacts on time and to expectations; there will be failures to deliver value or even simply to return a break even in costs versus benefits; and of course, there will be failures to deliver any recognisable insight.
Failure happens in Data Doghousing (and quite possibly in Big Data as well) because there is a lack of coherent and cohesive arguments for embarking on such endeavours in the first place; a lack of real business drivers; and, a lack of sense and sensibility.
There is also a willing tendency to ignore the advice of people who warn against joining in the Big Data hubris. Why do some many ignore the ulterior motives of interested parties who are solely engaged in riding on the faddish Big Data bandwagon to maximise the revenue they can milk off punters? Why do we entertain pundits and charlatans who ‘big up’ Big Data whilst simultaneously cultivating an ignorance of data architecture, data management and business realities?
Some people say that the main difference between Big Data and Data Warehousing is that Big Data is technology, and Data Warehousing is architecture.
Now, whilst I totally respect the views of the father of Data Warehousing himself, I also think that he was being far too kind to the Big Data technology camp. However, of course, that is Bill’s choice.
Let me put it this way, if Oracle gave me the code for Oracle 3, I could add 256 bit support, parallel processing and give it an interface makeover, and it would be 1000 times better than any Big Data technology currently in the market (and that version of Oracle is from about 1983).
Therefore, Data Warehousing has no serious competing paragon. Data Warehousing is a real architecture, it has real process methodologies, it is tried and proven, it has success stories that are no secrets, and these stories include details of data, applications and the names of the companies and people involved, and we can point at tangible benefits realised. It’s clear, it’s simple and it’s transparent.
Just like Big Data, right?
See what I mean?
Therefore, the next time someone says to you that Big Data will replace Data Warehousing or that Data Warehousing is Big Data, or any variations on that sort of ‘stupidity’ theme, you can now tell them to take a hike, in the confidence that you are on the side of reason.
Many thanks for reading.
More perspectives on Big Data
Aligning Big Data: http://www.linkedin.com/pulse/aligning-big-data-martyn-jones
Big Data and the Analytics Data Store: http://www.linkedin.com/pulse/big-data-analytics-store-martyn-jones
A Modern Manager’s Guide to Big Data:http://www.linkedin.com/pulse/managers-guide-big-data-context-martyn-jones
Core Statistics coexisting with Data Warehousing
Accomodating Big Data
And a big thank you to Bill Inmon (the father of Data Warehousing and of DW 2.0)
I remember similar foot stamping occurring from mainframe vendors 20 years ago.
The “purity” of the DW model won’t protect it if Big Data applications like Hadoop become easier to learn and deploy.
Data Warehousing was an excellent solution to a problem we’ve since found new ways of solving with better, more robust, more flexible, and more importantly CHEAPER technologies. At the end of the day cost is king, and getting macro-scale analytics into more peoples hands is only ever a good thing.
Martyn Jones said:
What is this puerile and derogative “purity” of Data Warehousing to which you allude?
If you think that Data Warehousing is about technology, and that Hadoop or some such old and clunky technology is the Data Warehousing replacement then please continue to preach this line to the four winds.
Those who revel in Schadenfreude will thank you for your efforts and more importantly the (lack of positive) results.
Home Despot said:
Ah yes, here it is. The ultimate ground-breaking argument against naysayers. “You’re just like those mainframe guys”. If you don’t drink this Kool-Aid, it’s because you’re “old”, you’re a dinosaur.
The implication of course being that this technology is somehow worthy of being at the same table as the Personal Computer.
Home Despot said:
I enjoyed this article immensely. I’m a long-time data warehouse implementer, but in the early days of Big Data, I was excited to jump on something new. I tried to like Hadoop and Spark and NoSQL. I really did. I thought for sure this would be a new cool thing to learn and get good at. And learn it I did. I worked with it for nearly a year before the disappointment was too much. It reminded me of the time when object databases were going to replace relational. I jumped on that back then as well
None of these technologies could perform as well as a relational database in normal sized data sets. By normal I mean under 100 Terabytes. Most of these technologies looked like they came from some hacker’s project for his masters degree. The technology is laughably rudimentary and hard to use.The performance was horrific, the querying was anything but flexible, and the reliability was simple not there.
But I see how this technology came about. Having spoken to may who embrace this technology, I see certain common characteristics in the user base:
1) They have little, if any, understanding of relational databases or what they do. Instead, hey come from a pure coding background such as Java, C++, Python, etc,.
In very rare cases, this technology is useful for extremely large datasets. CERN I believe generates over 90 Petabytes of data on one run. I saw a demo of a big Telco, cable company that was analysing 600 Terabytes of data to answer a few very specific, pre-defined questions.
But for your typical 100-500 GB to 5 TB data warehouse, you’ll enjoy the relational engine much more.