Considering the canvas that is the Pacific Ocean. “How on earth” he thought, “can people die of thirst and polluted water, when we have so much fresh, clean and pristine water on this goddam planet?”
The Data Leviathan, Martyn Jones
This may come as a surprise to some people who know my opinions on the subjects of data in general and Big Data in particular. Nevertheless, I believe that Data Warehousing has the flexibility, agility and rationality required to save Big Data from its almost certain fate. Not all of it of course, but at least the parts of it that are actually worth saving, or better said, are worthy of some consideration, by some people and businesses, some of the time.
Big Data, as the ultimate expression of the fractious and feverish search for nuggets of gold in zillions of buckets of trash, has seen its day. Long gone are the times when Big Data pundits could, metaphorically speaking, look at a windswept Saharan desert suffering temperatures in the high forties, and enjoying the ageless absence of water, infrastructure and commerce, and see it as a massive opportunity of unequalled proportions for building supply merchants around the world.
Look back on it as a “Want sand? We´ve got sand!” and “Want data, we´ve got data!” Sort of thing.
First, it may be helpful to look at the existential threats facing Big Data.
Many people are very foolishly and ignorantly claiming that Big Data technology (and by this they mean the Hadoop ecosphere) will see the demise of the need for Data Warehousing and the use of relational database management systems. This represents an existential threat to Big Data because it tars it with a thick layer of hubris, ignorance and arrogance. When people see Big-Data-hype for what it is, and there is plenty of it around, will they damn all of Big Data, or will they be selective, considered and rational? Usually we are not so selective and Big Data could become mortally wounded, a pariah at the edge of data management and architecture.
Another threat is in a similar vein. There are a plethora of Big Data success stories that are either not success stories, are not Big Data success stories or are simply vacuous attempts at ‘bigging up’ Big Data but without actually providing anything in the way of facts, reason or substance. We see swatches of this detritus on forums like Forum X (real name witheld to placate the censor) every week, as scoundrels pimp Big Data as if it was some ten-cent prostitute down on their luck. It’s tacky, unprofessional and demeaning, and yet this is another way to ensure that Big Data becomes suspect, marginalized and ultimately ignored.
Maybe the biggest threat to the Big Data movement itself comes from the most unlikely of places; well, on a superficial level at least. It´s the Hadoop ecosphere itself. Hadoop was born of necessity. When Google wanted to crawl the entire World Wide Web and index it on an ongoing basis, there was nothing that really satisfied the need, so they made their own solution. All fine and dandy. But DIY software for data architecture and management on this scale is a costly basis, so what better the sharing the love right? With products such as Oracle there is a massive user base that supports the cost of maintenance, corrections, enhancements and new feature releases, and these costs are spread out so that individual businesses only pay for a miniscule part of development costs. Not so when you have developed your own proprietary in-house solution- It´s not a product, it has not been productized and it almost impossible to get others to chip in to pay for bug fixes, corrections and architecture refractoring and the design, development, testing and release of feature enhancements. Unless you are prepared to either take the product to market or to get people to participate in the global open source game. To be polite, one could say that launching Hadoop in the way it was launched, and at the time it was launched, was a mistake. But could Google continue to fund Hadoop on their own or even expand the Hadoop ecosphere without the involvement of others? Maybe they could have, they had deep pockets. However, in business terms, would that have been wise? In my view, the public launch of Hadoop was like launching Oracle RDBMS in 1971, twenty years before RDBMS really started to take off in a big way. That´s being polite. However, to be brutally honest, in historical perspective terms, Hadoop is no Oracle. The problem is simple, just how many companies need to commodity based massive search, count and very-simple-list producing engine. It’s a brute strength and simple product that meets Google’s requirements, but how many businesses do what Google (or for that matter Facebook, Twitter, LinkedIn or YouTube) need to do? That´s the third existential risk right there.
Hadoop has been deliberately made synonymous with Big Data. Hadoop falls and it might take all of Big Data down with it. We can wait and see, or we could do something about it. At least to save the ‘good bits’.
So what is this ‘something’ that we could do?
Regardless of the mechanics of how we go about putting it all together there are some things we need to have clear before we really start to bring Big Data into the mainstream:
Anything of significance we do should have a corresponding significant business imperative. Moreover, when I say business imperatives I mean business-imperatives, and not expedient IT-imperatives along the lines of “we must do something, anything”.
In the context of the full range of data integration possibilities, we can conceptualize how we can enrich data on the Data Warehousing landscape by complimenting, for example, abstractions of corporate data derived from operational systems and internal and external structured databases, with data that refines, improves and expands segmentation, delineation, categorization and classification.
We should strive to understand how we can model ideas for improving the time and place utility of data, and through the incorporation of the outcomes of more immediate analysis, such as that derived from Big Data processing and analytics.
We should look at the Data Warehouse eco-sphere options we have for addressing the near-term (including near-real-time) decision-making capabilities offered throughout the operational landscape, including through more aligned self-service web applications, common examples being those hosted by Amazon, Adidas and Zalando. This is where the Inmon conception of Data Warehousing will really kick-in, especially where we augment a typical Data Warehouse landscape with a near-real-time Operational Data Store, Analytics Data Store and a robust, multi-faceted and totally interconnected Core Statistics platform, which would include Big Data data-management, technology and analytics.
In essence, we must see that we can use Big Data, combined with Enterprise Data Warehousing as a better means to address significant business challenges, and based on this we should create a strategy for implementing and executing a programme that will create those synergies and drive the benefits accruable from the strategy that we designed to do just that.
I have spoken and written about how to bring Big Data into the mainstream of Enterprise Data Warehousing, and I will end this piece by reiterating some of the key points of the approach.
Since the publication of the article Aligning Big Data, which basically laid out a draft view of DW 3.0 Information Supply Framework and placed Big Data within a larger framework, I have been asked on a number of occasions recently to go into a little more detail with regards to the Analytics Data Store (ADS) component. This is an initial response to those requests.
To recap, the overall architecture consists of 3 major components: Data Sources; Core Data Warehousing; and, Core Statistics.
Data Sources – This element covers all the current sources, varieties and volumes of data available which may be used to support processes of ‘challenge identification’, ‘option definition’, decision making, including statistical analysis and scenario generation.
Core Data Warehousing – This is a suggested evolution path of the DW 2.0 model. It faithfully extends the Inmon paradigm to not only include unstructured and complex data but also the information and outcomes derived from statistical analysis performed outside of the Core Data Warehousing landscape.
Core Statistics – This element covers the core body of statistical competence, especially but not only with regards to evolving data volumes, data velocity and speed, data quality and data variety.
Fig.1 – 3 components of the Information Supply Framework
This piece will focus on the Core Statistics segment and in particular the Analytics Data Store, which is specifically designed to support professional statistical analysis and at the same time to support the speculative use of data.
Fig.2 – Core Statistics – Analytics Data Store
The Analytics Data Store
Daniel Keys Moran once stated that “You can have data without information, but you cannot have information without data.“, we’ll deal with that nonsense at another time.
The Analytics Data Store is the reference data store collection for the entire Core Statistics segment.
The following is a high-level diagram of the Analytics Data Store together some of its major option features:
Fig.3 –Inside the Analytics Data Store
Operating System Platform – Typically the operating system platform will be a flavor of UNIX (Linux or some other flavor).
The standard UNIX distributions can support parallel file manipulation commands, for mapping and reducing data in files that can be theoretically in the order ofzebibytes.
Additionally, Hadoop Distributed File System can be overlaid on the UNIX platform to leverage the underlying UNIX primitives giving it access and control over the underlying devices, whether that device is a file, disk, cluster, node or anything else (but these files cannot be manipulated using regular UNIX primitives unless using something like FUSE)..
Hadoop – Hadoop is a set of algorithms (an organised collection of code) for distributed storage and processing of data sets on clusters of commodity computer hardware. The modules in Hadoop are designed with the idea that hardware failures are commonplace and should be automatically handled by the software. This is not however unique to Hadoop as there are UNIX distributions that also fulfill these functions, and then some. However, the attraction of open source software running on commodity hardware cannot be dismissed lightly.
Relational DBMS – This is the database model that most people who know anything about databases are familiar with. RDBMS is based on the relational data model. The relational data model provides an uncomplicated view of data to all users by representing data in two-dimensional tables of rows and columns. These tables are called relational tables. A relational database is a collection of relational tables. RDBMS is the data manager for relational databases.
Relational DBMS users use Structured Query Language (SQL), the industry-standard relational database management language, and with typically some extensions to SQL, to interact with the databases.
Document DBMS – This is a class of database management system oriented towards the management of unstructured, semi-structured and complexly structured documents, primarily digital textual documents. Examples of what might be labeledDocument-oriented DBMS include Documentum EDMS and MongoDB.
Graph DBMS – Also known as Semantic Data Model Databases (back in the day). According to Wikipedia “a graph database is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data.” One of the features of some of the early Graph DBMS (my first contact with this technology was at Unisys in the late eighties with a product called InfoExec,) was that the query languages allowed for structured queries to be stated in more business-like terms.
Key-Value DBMS – One can either view this type of database as an innovative reuse of the design of simple programmatic ‘collections’ (trust Microsoft to be the only ones to name a simple thing with a simple name,) used to structure data, then applied to the realm of database management, or as a mental aberration invented by bodgers and hackers. At the end of the day Key Value DBMS simply provides a simple means to store in memory ‘associative array stores’ on disk. If there is more to it than that then please let me know?
Object DBMS – Object-oriented database management system stores information is represented in the form of objects – as used in object-oriented programming.
Object-relational databases are a hybrid of both the object oriented and relational approaches. I have found use for object-relational in operational applications, but never in MIS reporting, OLAP, Data Warehousing, Business Intelligence or Statistics. Does anyone have an alternative perspective?
Column Oriented DBMS – This refers to how data is stored. Typically we now view data as being stored in rows or records, but it’s not the only way of storing data.
Column Oriented DBMS store data first by values in ‘columns’, hence the name.
Examples of this type of database implementation go from Apache HBase as a distributed NoSQL column-oriented store built on top of HDFS, to EXASOL, currently the world’s fastest in-memory database management system.
As you see, the Analytics Data Store is fast becoming a super-fantastic mix of artefacts, gadgets and toys which should satisfy everyone; from the most experienced and knowledgeable statisticians, passing by the data ‘creatives’, the data scientists and the data data-users to the most game oriented of data plumbers and punters.
The ADS is above all about quality over quantity, the now over the mañana, and the ‘just do it’ over the ‘can we?”.
But, also remember these words from Colin Powell: “Experts often possess more data than judgement.” So, be forewarned and forearmed.
Using the Analytics Data Store
What are the applications that the Analytics Data Store might be used to support?
Here is a non-exhaustive list (first described in the mid eighties) of the potential applications:
Interpretation – Inferring situation descriptions from the analysis of a variety of data.
Prediction – Inferring likely consequences based on situational data.
Diagnosis – Inferring deviations and malfunctions from observables – from data.
Design – Analysing data and configuring objects under constraints.
Planning – Designing actions based on data feedback and analysis.
Monitoring – Comparing observations to known plan vulnerabilities.
Debugging – Prescribing remedies for malfunctions based on the analysis of data.
Repair – Devising and executing a plan to administer a prescribed remedy.
Instruction – Diagnosing, debugging and repairing behavioural patterns captured in data.
Control – Interpreting, predicting, repairing and monitoring systems behaviour.
Given the availability and quality of data to support the activities listed above, the Analytics Data Store can provide a sound source of data for a wide range of statistical analysis, forensic and speculative activities.
The Analytics Data Store is developed iteratively to support the data needs of a range of activities, from main stream statistical analysis, and formal data mining to creative and eclectic exercises in speculative analytics and non-traditional data correlation. This will ensure that business value can be assessed sooner rather than later.
The Analytics Data Store is essentially technology implementation agnostic and that it has a clear mission and business objectives within an overall Information Supply Framework.
The choices of technology products are based on best fit criteria, so the use of technology should not be driven by the old commercial approach of ‘solutions in search of problems’ approach, which failed so miserably time after time again, but on ‘what are the most appropriate artefacts, resources and technologies to use in approaching this problem or testing this hypothesis?
Many thanks for reading.
If you enjoy this piece or find it useful then please consider joining The Big Data Contrarians: https://www.linkedin.com/grp/home?gid=8338976
Many thanks, Martyn.