Martyn Richard Jones
Gif sur Yvette, 27th December 2017
To begin at the beginning
First, a thought.
Just because it’s there, it doesn’t mean it has to be done. Just because you can, it doesn’t mean that you should. Just because you believe you are right, doesn’t mean that you are not wrong.
Read on, if you are new to Data Lakes.
Read on, if Big Data seems like the magic bullet to beat all magic bullets.
Read on, if you are inexplicably drawn towards the data-hype Kool aide mantra of anything can be anything.
Now, don’t get me wrong. It’s not really my place to discourage you from exploring further, but, if you are sensible, mature and wise then you really don’t need to ingest any more… of this.
Okay, so you’re still up for it. Well done!
Where to begin?
What to say?
Where to go?
To being at the beginning seems a very good place to start.
<<Cue, dramatic entrance…>>
Bring me your empathy-free, your bored huddled-masses and your incurably curious, and we’ll continue. Hold onto your hats, this is going to be a bumpy ride, Snowflakes.
It’s that winter festive season again! Pre-Christmas, “AS-IS” Christmas, post-Christmas, pre-New Year, “TO-BE” New Year and “BAU” post-New year. Phew! What a long litany of festivities we have at this time of year! Innit…
I’m just surprised there isn’t a Christmas DEV, TEST, UAT and PROD – with no budget for any in-depth testing whatsoever. As is traditional down my way.
So, undercover of Christmas and New Year, this is the ideal moment to get tanked up on ‘merry and fall-yea-down’ Grinch-tide laughing-juice and whoopee-doo flim-flam sauce and to prepare something really stupid. Like hiding Santa Corp’s Three Kings’ delivery list in Hadoop or putting a massive fake reindeer-log under the Yule tree, or spiking the kids Christmas punch with ‘Class A’ substances. Just the ticket! You’ll soon get the hang of this. Ho! Ho! Ho!
So from the safety of liberally old-Europe, I can liberally and safely predict that the anti-social and petty nit-picking aerosol faction, so prevalent in end-user IT communities – you must know them – will end up hating this piece of Christmas cheer.
Well, let’s be happy. It’s intentional. Shoo ’em!
Driving with your feet and choosing a Hadoop tech stack because it’s really cheap and it’s as ‘mature’ as ‘relational’
Let’s start easy and work up to the big one.
Between driving a car with your feet and using Hadoop as a surrogate relational database engine, the smartest thing must surely be to put your toes to the wheel. And seriously, that is very far from Smartville. In fact it’s so mainstream dumb it puts Hadoop-dumb way off the Richter scale of normal mainstream dumbness.
What do I mean?
Relationally speaking the Hadoop-takes on relational are as sophisticated, performant and reliable as flaky late 1970s relational database technology – when they were mere proofs of concept.
Yes, the hardware is much, much cheaper, and much, much faster, and can handle massive throughput and workloads and volumes, but heck, they can’t even get their basic functionality sorted, like coherent and resilient data typing or fail-soft or fail-over features. And, I’m being kind. You lost data because you misspelled an object name? Please…
Cost-wise, Hadoop ‘relational’ looks cheaper. Even theoretically it’s not as cheap as PostgreSQL, but it’s cheaper than a fully loaded Oracle stack – maybe. When it comes to Hadoop and ‘relational’, it’s really not cheap at all. Sure the software is open source, but, then you have to pay to have it supported, and flaky software needs a lot of support. Then you are going to have to find people who can actually use it, and those people will come at a premium as well. Add to that the extended analysis, design, development, test and implementation stages, and the higher probability of the need for redesign and rework, and you are really paying a premium for going the ‘cheap’ route. So, you may as well spend money on Exadata, which at least comes with the guarantee of maturity, reliability, scalability and performance.
Simply stated, compared to real relational, Hadoop ‘relational’ sucks.
Based on cost/benefit advantages, Hadoop ’relational’ sucks big-time.
In short, Hadoop ‘relational’ is an oxymoron, a bit like military music or Greggs the Patisserie, but, more so,
Picking up molten lead wearing woollen mittens or using Hadoop as a surrogate for business requirements
Junior engineers think they know all of the business requirements without even asking, simply by second-guessing.
Senior engineers know that this is BS.
What do I mean?
Business users have a need for data.
This is all the data that the business user should want.
Let’s stick it in one great-big sandbox.
There they can play with the data anyway they like.
They can even stick a colourfully-dumb interface on the front of it. Call it reporting and analytics, and, declare Mission Accomplished.
Unfortunately, in the real world, things are not that simple.
It might work… Unless of course, the access to the IT ‘maintained’ sandbox is by invitation only.
It might work… Unless of course the manipulation of this scrapyard of unreliable and disjointed data is so difficult to navigate that it renders the service worse than useless. I.e. a business liability.
Voila! The unintended consequence of IT driven stupidity and faith based data architecture and management.
Emulating Icarus and trying to do Data Warehousing with Hadoop
“Oh Icarus, fly not so near the sun lest thy waxy wings should melt.”
What’s the really stupidest thing you can do with a Data Lake? Well, here it is!
Try and implement a real mature enterprise or manufacturing Data Warehouse using the Hadoop tech stack.
What do I mean?
Imagine you have a full portfolio of high-level business requirement.
It’s from a world-class widget manufacturing plant.
They want a series of things to ensure that what actually happens in the plant is accurately represented in the data. They want to do comparative analysis, forensic data analysis and data discovery. They also want to do reconciliations and canned reporting.
The requirements have data controls, data reporting and data analytics written all over them. In fact, this is what business requirements represent.
The data volumes are modest. Less than forty terabytes of data, including lots of indexing.
The demand is for availability, reliability and availability. It’s got to be 100% reliable and it has to be fast.
Maybe what you have is not a Greenfield development but the development of, for example, a set of flexible controls, reports and analytics to replace a globally trusted and used set of home-grown business applications.
To reiterate. Things like that have 4th generation Data Warehousing, operational reporting, BI and analytics written all over it. So relatively speaking, money is no object.
So, given this scenario, Captain Sensible would choose Teradata or Oracle Exadata or SQL/Server, etc. right? Maybe even a combination of SQL/Server and EXAsol. Your end-user tools could be Tableau, Business Objects, Cognos or Microstrategy, etc. This is the way to go, right?
No! Hell, no!
You choose to do operational controls, operational reporting and operational analytics using Hadoop. Because, this is what it definitely wasn’t designed to do. So what better tool for the job? The tool that is least appropriate. It’s counter-intuitive. It’s cock-eyed and crazy. It’s on another planet. But it’s ‘innovation’.
But, what do you care? Maybe your CIO declared Big Data to be a crock-of-fertiliser that promotes growth. So, you do what you do. Because you know no better and you have no ethical principles or professional integrity.
Just when you think it couldn’t get worse we start making Hadoop the go to architecture on which everything will eventually converge.
“Four! Why four? But, Martyn, you said it was three!”
I kid you not.
There’s stupid, stupider and there’s stupidest. But, then there is the “what the, what the ****?” moment of truly Homeric proportions.
I thought that I’d seen it all. But, this idea is so screwed up it deserves a whole institution of its own – a mental institution. In fact, one hopes that there is a modern-day Jung or Freud waiting in the wings, fit and ready to call out this fetish nonsense for what it really is, and in language that cannot be misinterpreted.
Here it is.
Adopting a strategy that dictates that all future data management requirements, no matter where or what or for whom, will converge on a Big Data based Hadoop ecosphere called, euphemistically, the Data Lake.
Absolutely stark raving bonkers.
What do I mean?
Any CIO who advocates converging all data management in Hadoop is seriously delusional and clearly knows less than the square root of bugger-all about the Hadoop ecosphere.
Any CEO who allows their CIO to become such a technological loose cannon, deserves the heartiest of condemnations from the shareholders and stakeholders. In fact, that should earn a CIO and his or her close-advisers a well-deserved early bath.
That’s it folks
So, there you have it folks. The dumb, dumber, dumbest and WTF stages of Data Lake stupidity. Gonzo journalist Hunter S. Thompson is attributed with the phrase “when the going gets weird, the weird turn pro”. He could have just as well have been referring to Big Data and Data Lake nerds.
The Hadoop ecosphere is the shadow-app-land of the software industry. Yes, some tools may be useful for some companies, but not everyone has the needs of Google, Amazon or Tencent Holdings Ltd.
Anyway, it’s been an eventful year and I hope you enjoyed the ride.
I will be back next year with another series of rants, insider insights and strategic directions.
In the meanwhile, don’t be tempted to build a Data Warehouse in Hadoop, even if it is for kicks and giggles. You’ll thank me later.
Take care and many thanks for reading.