Martyn Richard Jones
Baile Átha Cliath
It’s Easter in Baile Átha Cliath, and thoughts naturally turn to purple-shrouded, sombre-shadowed and mystical aspects of bereavement, resurrection and bunnies.
Which may go some way to explain why people come up to me in the street, and apropos of nothing, ask me about what is the biggest reason why Data Warehousing projects fail.
Try as I might, I cannot convince people that the reasons I give have any validity. It’s as if the mundaneness of the stupidity involved in these failures is just too frivolous, banal and immature as to be believable or even worth listening to.
So, frustrated by the lack of credibility of stories of failure, I have tried to come up with the mummy and daddy of data warehousing narratives that should leave people in no doubt.
Many people will know that I am quite happy to bash the bejaysus out of Camp Hadoop when claims of the Hadoop Ecosphere’s mighty power comes even remotely near to encroaching on the areas of Data Warehousing and serious data management, of the business kind. But, I will, in the spirit of the moment, gladly embrace the hyper-wave and adopt, albeit temporarily, a contrary position to my otherwise contrary position.
Yes, Hadoop can help you revolutionise the way you do Data Warehousing. It can be used to make-over your jaded warehouse, so much so, that you won’t even recognise the replacement as being anything like that data warehouse you once had. It’s like not being able to recognise your own son.
You know, that boring Inmon Data Warehouse that just delivered business the data that business people working in it wanted. The Data Warehouse that just did the job, and nothing more. The data warehouse that couldn’t do interesting things like giving you a real-time view on who was spending too much time on their lunch breaks, what Sharon had for dinner last night or how many cups of coffee were consumed by the developers during the last sprint.
So here is the end-to-end process of revolutionising your Data Warehouse by embracing and energising all the power and sophistication of the Hadoop Meta-Space.
1. Unfortunately, most of your corporate data is in structured databases, such as relational databases, hierarchical databases and network databases. It’s all locked in Bedlam hell or data gaol or that infernal corporate IT landscape. This is a problem. Getting this old-fashioned, decrepit and lifeless legacy-data into an amazing, fantastic and total-value Hadoop base requires maximum conceptualisation, guile and cunning.
2. But don’t despair. Help is at hand.
3. NextGen data science, analytics and deep-dive machine learning all need data to be in an unstructured form, the more unstructured the better. This is to protect the nature of the data and to ensure its usability rating is as it should be and that everything is aligned in the Hadoop night-sky. Having structured data here would harm the process and the credibility of the tools being used.
4. Getting this legacy structured data into the NextGen data structure requires some manipulation, but it can be done.
5. First, find out a way to export each type of database data to a flat file format. (Flat file format is a revolutionary new database technology for advanced analytics and data science). Once you have managed to do that, then work out how to export each and every data item, record or set into a non-standard XML format. Yes, this will add a massive overhead to the amount of storage that you will need, but rest assured, this won’t hurt me at all.
6. As you export each item of data to your flat files ensure that you also encode that data using blockchain technology, and you include the generated keys with each data item and collection of data items.
7. It’s isn’t necessary to use bitcoin or to replicate the ledger to China, but, if that’s what takes your fancy, then go ahead. However, if you are going to do that, don’t forget to add this data to all the other data in your flat files.
8. Also, don’t forget to timestamp every piece of data and every piece of enriched or inferred data that you add into the ever-expanding big data-driven mixing bowl.
9. Now, once you have unloaded all of your data you will also want to ensure that there isn’t any nasty hidden uniformity in that data which would skew any down-stream data-citizenry style analytics.
10. Step ten. Your massive data files are loaded into an obfuscation engine that will guarantee the removal of all traits of high-level structuring, relational proximity and disintermediation.
11. Explanation time-out. Basically, the process harmonises the data to treat it as collection streams of (5*2)*8 bytes. Then it performs a degenerative Rubik’s cube geometrical shift and sort – astrolabe, random walk down Wall Street and wheels-within-wheels style. But, enough of the tech-talk.
12. Then you apply an IoT data formatting regime to each of the new 80 bit slices in order to bring the data into the 21st century. This will make your final staged data – let’s call it the Sophista Target.
13. From here on in, it’s plain sailing.
14. You connect all your commodity servers.
15. You spin up all the disks.
16. You install the flavour of Hadoop that you most fancy. Personally I fancy the Parallel CatGrepAwkCut flavour, but each to their own.
17. You install Spark and Python (aka Snake and Pygmy).
18. You load your Sophista Targets onto your Haddock Platform.
19. You open the Apache Kimono and ingest the Sophista targets into Haddock On Cloudback (the infamous data-warehouse-as-a-service platform alternative to yokes such as Amazonked, IBeenHad or Azure Were). Apache Mints is a great anti-data-quality tool for use here.
20. You marvel at the data now residing on your cloud-oriented Hadoop platform locale…We are almost done! Oops! Don’t forget to add your fave SQL engine to access the data – and don’t forget, the clunkier the engine, the better.
21. Now the only limitations are your own imaginations – and the power and quality of what-you-get from the ‘man’.
So. Bish, bash, bosh! Happy days!
Many thanks for reading.