Why I called bullshit on the data lakehouse nonsense
I wrote an article that took to task purveyors of the data lakehouse nonsense. It got plenty of positive reactions and quite a few negative opinions. But what surprised me the most were the number of people who had already swallowed their pitch. Hook, line and sinker.
So why did I bother to express my opinions on the subject?
It all started when I first heard the term Data Lakehouse. After five minutes of intense research aka googling, I was pointed in the direction of a blog article posted on the Databricks web site and carrying the title of What is a Lakehouse? It was written by Ben Lorica, Michael Armbrust, Ali Ghodsi, Reynold Xin and Matei Zaharia. Interesting, I thought. So I read it.
Aside from some vague and high-level claims what caught my attention was this diagram (Source: Databricks):
So, what does this diagram show? Yes, it shows a transition from 1980’s Data Warehousing through the data lake period to the present day data lakehouse. Which would be fine if this were true. To put it politely, the relationship between this diagram and reality is somewhat problematic and tenuous.
RED – In the text, the guys talk about data warehousing as if it was merely about technology. It’s not. It’s about many things, including business process, data architecture and management, and strategic initiatives. In the part of the diagram highlighted by the red border that I added, it doesn’t represent reality. What it shows is Ralph Kimbal’s Data Warehouseless approach to building siloed data marts, that was prominent around the 1990s and had nothing much to do with Data Warehousing.
ORANGE – Again, the data warehouse is conspicuous by its absence. A couple of data marts are shown. And that’s it. This is not data warehousing.
GREEN – Therefore, it comes as no surprise that the Lakehouse addresses the requirements of a 1990’s alternative to data warehousing. This is to all intents and purposes, a black box, full of features that can’t all be pointed at. And with a total absence of tangible benefits.
You see, I’m a great believer in the power of doing data warehousing the right way and for all the right reasons.
What I think is needed to handle the new data that cannot be processed using a data warehouse is an analytics data store. That can be used for data science activities on unstructured or highly-structured data and information.
However, I see lakehouse flimflam as akin to kamikaze fools. Driving down the freeway at 100 MPH towards a cliff and on the wrong side of the road. Hoping to slow their inevitable demise by deliberately running into oncoming traffic (i.e. the data warehouses).
It’s like the lakehouse folk are purposely elevating data lakes to the next level of gob-smacking ludicrousness.
That’s not a good look, and it’s just not on.
Many thanks for reading.
Laughing at Big Data – Usual price $9.98 – Free on the UN ICT Day 17th May 2020. https://www.linkedin.com/pulse/laugh-big-data-free-17th-may-only-martyn-jones
To celebrate the United Nations World Telecommunication and Information Society Day, my new eBook laughing@bigdata will be free to download from all Amazon sites on Sunday 17th May 2020.
Why laugh at Big Data? Informative, education and entertaining, Laughing@BigData gives you real insider views on big data, agile, AI, data, deep learning, data warehousing, data lakes, IT strategy, leadership and management and much more. If you have the theory, this gives you insight into the practice. #bigdata #data #datalake #dataintegration #datawarehousing #datascience #agile