To begin at the beginning
I am quite a fan of many aspects that sit under the Data Mesh umbrella. However, when it comes to a proper fact-based understanding and analysis of the history, place and architecture (business, data and technical) of Data Warehousing, the leading exponents of data mesh have it woefully wrong.
Therefore, the purpose of this blog article is to set the record straight.
The data warehouse as a place to copy OLTP exhaust data to?
The way many of the data mesh, data lake, and data fabric talking-heads view modern data warehousing as being just a place to dump and report on data, is quite problematic in its naivety, lack of rigor, and frivolity.
To be brief, this assertion about data as exhaust is a misreading of the past and demonstrates an ignorance of data, its management and its architecture.
So; to be clear, data didn’t get invented around the time of the birth of Windows 8, Hadoop, tik-tok or the iPad. Data has been around for a very long time, even before computer systems were but a twinkle in the eyes of the pioneering mums and dads, but admittedly well after the dinosaurs went AWOL.
What some folk need to clue into is the fact that data warehousing has understandably borrowed from many areas of data management (digital and non-digital) including in areas such as:
- The subject orientation of data
- Distributed data processing
- Time slicing, time-variance and time series as well as time-invariant data
- Iterative development and delivery
- Information Centre architectures
- Database analysis and design
- Data migration tools and techniques
- Function decomposition and business data domains
- Joint application development / rapid application development
- Reusable designs
- Timebox methodologies
- Decision Support Systems / Executive Information Systems
- End User Computing
- Entity relationship modelling / dimensional modelling
- Relational database management systems
- MPP, SMP and hybrid SMP platforms
- In addition, there is a longer list of notable contributors.
Hello, ma and pa!
Whilst Data Warehousing has borrowed from things everywhere, Bill Inmon, the “father of Data Warehousing”, defines it as being “a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process.”
Subject Oriented: The data in the Data Warehouse is organised conceptually (the big canvas), logically (detailing the big picture) and physically (detailing how it is implemented) by subjects / data-domains of interest to the business, such as customer, product and sales.
The thing to remember about subject-areas / data-domains is that they are not created ad-hoc by IT according to the sentiments of the time, e.g. during requirements gathering, but through a deeper understanding of the business, its processes, and its pertinent business subject areas.
Integrated: All data entering the data warehouse is subject to normalisation and integration rules and constraints to ensure that the data stored is consistently and contextually unambiguous.
Time Variant: Time variance gives us the ability to view and contrast data from multiple viewpoints over time. It is an essential element in the organisation of data within the data warehouse and dependent data marts.
Non-Volatile: The data warehouse represents structured and consistent snapshots of business data over time. Once a data snapshot is established, it is rarely if ever modified.
Management Decision Making: This is the principal focus of Data Warehousing, although Data Warehouses have secondary uses, such as complementing operational reporting and analysis.
Demand driven: adding data to the data warehouse is based on business demand for that data and NOTHING else. Preemptive loading of data, just in case, should be avoided like the plague.
So Data Warehousing is about far more than dumping operational data elsewhere and letting people stick reporting and BI tools on top of it.
Relational database management systems were first used for OLTP?
If anyone claims that relational database systems were first used with operational applications, treat everything they say with extreme caution.
The first RDBMS products got used for reporting needs, such as reporting on data in databases designed using dimensional modelling. There was a good reason for this as none of the implementations even came with a usable audit trail facility. So, not at all OLTP friendly at the beginning.
Data Warehousing necessarily means monolithic databases?
Another false meme doing the rounds is that Data Warehousing necessarily means monolithic databases. This is not in fact what data warehouses have been for many businesses.
Part of the problem with this claim is that even the term monolithic is being corrupted to mean “any architecture or technology that I don’t like”.
Having worked with massive clusters of computing power, with tons of nodes and disks and memory and ultra-fast mesh communications backplanes, I instinctively know this monolithic labeling is dubious. In addition, we have had the ability to isolate data at various levels of abstraction. Indeed, we can organise singular and interdependent subject/data domain areas using database technology that has been available for decades. So distributed data storage and compute is nothing new and certainly nothing new to data warehousing.
So, the notion that “data mesh is an emerging architecture that establishes an alternative, de-centralized pattern to the data warehouse,” is basically bullshit born of ignorance and immaturity.
Data Warehousing necessarily means monolithic and siloed teams?
This isn’t a problem with data warehousing this is a problem of how IT companies and their customers reframed the idea of data warehousing development and infrastructure.
Before IT got its grubby mitts on data warehousing the initiatives in this solution space were highly dynamic and focused, and consisted of small multi-disciplinary teams working in close conjunction with the business. The data warehouses would be populated and the data marts would be built, on-demand, in chunks that were small enough to be doable in an iteration and large enough to have business significance.
Don’t lay the blame for the demise of this approach on data warehousing, blame it on the vendors and the big “systems integration” providers focusing on revenue funnels.
The history of data in IT is about monolithic data
Where to begin with this? Back in the day and outside of the biggest IT shops, every major application ran on its own proprietary platform with its very own siloed databases. Some IT shops I used to visit would have a plethora of platforms from various vendors running mainly in-house developed applications and programs.
So, pulling together data from a disparate and heterogeneous IT landscape, and then properly integrating it, in order to drive strategic and tactical oriented reporting, was a nightmare. When open systems and Unix came on the scene the costs of certain compute platforms tumbled as their power, throughput and storage capacity increased exponentially. Especially with regards to MPP and SMP platforms. For many organisations, this enabled the solution for these distributed application platform and siloed database woes; the Data Warehouse.
Are the data mesh folk preaching a return to those pre-Information-Centre and pre-Data-Warehouse times?
Data Warehousing technology is just about databases?
Another rather silly assertion is that data warehousing technology is just about databases. Yes, an industry expert on data lakes actually said this in a podcast on Software Engineering Radio. Avoiding the fact that data warehousing technology includes:
- Communications and connection technology
- Security technology
- Extract, Load and Transform / Extract, Transform and Load
- Data cleansing / data quality tools
- Meta-data management and cataloguing (data governance)
- Scheduling and keep-alive technology
- MPP, SMP and Hybrid SMP
- Database management systems
- Distributed file systems
- Business intelligence, advanced analytics tools, visualization tools, dashboard builders, and so on.
- Etc. Etc. Etc.
Data warehouse databases must be fully normalized?
This is yet another problematic meme that the data mesh folk are spreading. Data warehouse data models were typically modeled using a flexible class of third-normal form modeling. Data marts, on the other hand, were (and still are) modeled as dimensional schemas. There was never a movement to have everything in the DW model in 4th, 5th, or 6th normal form. This is simply untrue and not even true for Data Vault 2.0 models. As for the mention of DKNF, clearly, people are just making lazy stuff up about data warehousing rather than doing a bit of research.
Central ownership of data?
Yet another ill-informed claim from the data mesh folk. They really need to brush-up on their history, or correct their total lack of a historical perspective.
For one, there was never a generalized centralized ownership of data even if there was central custodianship of data.
Data warehouses get queried directly ?
Data mesh folk believe that data warehouses get queried directly
What in the name of Sam Hill do they think Ralph Kimball and others were doing with all of those dimensional models and data marts? Trolling Bill Inmon? Digging for nuggets of gold? Making the partridges dizzy?
Data Warehousing means having thousands of ETL jobs
Another absurdly tetric myth being pushed by a cross-section of the expansive, permissive and obtuse talking-head data-mesh-massive is that data warehousing necessarily means that enterprises have to have thousands of complex, expensive and unmanageable ETL jobs in order to build, maintain and expand their data warehousing reach.
For a small segment of « data warehousing » cases I am sure that this might be the case. Thinking of FBI, CIA, NSA and Google… maybe so. But, come one folks, your needs and that of your business are probably not along those lines, not even remotely.
I have a different view. In my experience, small and medium enterprises using data warehousing will typically have between twenty to ninety odd ETL jobs, max. Large enterprises could well have hundreds of ETL jobs, but an organisation with thousands of ETL jobs either indicates an extraordinary and somewhat unique organisation or an amazingly bad ETL pipeline architecture, and maybe even also truly terrible “data warehouse” data-architectures and data-models.
You don’t need a data warehouse to do data warehousing
Here comes another one. You don’t need a data warehouse to do data warehousing.
Let me repeat that.
You don’t need a data warehouse to do data warehousing.
What in the name of all things sacred and profane contributed to producing this abjectly irritating anti-monolithic pattern boloney?
If you are doing data warehousing then you’ve got a data warehouse.
If you are doing data and analytics without a data warehouse you aren’t doing data warehousing.
So, here’s a question for the data pop-pickers out there: is the humungous obtuseness and frivolity of the data-warehouse bashers an accidental or deliberate thing?
Data and structures in the data warehouse are organised based on use cases
Another dopey idea doing the rounds is that data and structures in the data warehouse are directly influenced by use cases.
Unless you are doing it all wrong this should not be the case.
Data in the data warehouse should be structured generically and by subject areas so as to be universally applicable across the enterprise whilst being able to support the downstream data requirements of specific use cases.
Also, remember that there should be no ambiguity in the data warehouse.
The business users query the data warehouse
Unless you have a pretty exceptional use case, maybe a special strategic and urgent requirement, the business users will not directly query the data warehouse.
Instead, business users will use data stored in data marts built as generically as possible whilst delivering on specific business use cases for data and analytics. Done right, data marts are hugely business user and BI/visualisation tool friendly.
Just say “no” to directly querying the data warehouse!
The logical and economic contradictions of data and analytics with respect to data mesh
I am just leaving this here as a placeholder. I will be getting back to it in another blog post that will discuss the contradictions of data mesh when it comes to data warehousing.
Data warehousing is out of date?
According to some data-mesh folk, data warehousing is “a data management construct that dates back to the 1980s” And as if that was somehow a bad thing. For me, that’s a specious argument that actually treats people like idiots.
It’s like someone saying to Isaac Newton, “so, Sir Isaac, you don’t still believe in that old gravity nonsense do you?”
What they also don’t realise is that their beloved distributed computing and data paradigm-shifter is older than the oldest data warehouse.
That’s it folks!
To wrap up. There are many appealing aspects of data-mesh. Probably because I have come across a lot of it before. In addition, its proponents are in the main reasonable, considerate and informed people – unlike some of the wide-boys of big data. However, what rankles are the gratuitous pot-shots being taken against data warehousing by people who are basing their assumptions and declarations on vague, inaccurate and weak evidence. We saw it with Big Data and Hadoop, then data lakes and lakehouses/outhouses, and now with data mesh.
It is like the fake-news of data; irritating, time-wasting and unnecessary. I do however hope that the data mesh folk rectify their view of data warehousing.
Many thanks for reading.
Pingback: Reality Check: Data Mesh and Data Warehousing | GOOD STRATEGY