Martyn Richard Jones
Bonn, Germany
27th June 2020
Heaven help us! Talk about taking a great idea and destroying its heart, spirit and soul.
Again, and in data architecture and management terms, that’s what the virtual data warehouse blaggers are doing with data warehousing, one more time.
These virtual data warehouse folk talk about it as if it were merely magic. No effort required.
And what’s one of the main reasons for this? The evangelists for virtualising data warehousing actually believe that a data warehouse is nothing more than a copy, yes a copy of operational data.
This is another case of “a little knowledge can be a dangerous thing.”
So just consider this.
Every time you run a virtual data warehouse query you:
- Access operational systems, and:
- Run queries on platforms that are finely tuned to balance transactional workloads.
- Are hitting those platform databases, memory and compute power with generalised, large or complex queries which might not be appreciated.
- Potentially compromising those folk who have the responsibility to ensure that service levels are within agreed tolerances.
- Place an extra and unpredictable load on the OLTP systems which might result in the purchase of additional compute, storage, memory and network capabilities. Of course, it’s all done behind the scenes. Still, each and every time you run anything requiring selection of data from multiple sources, even on the same box, you will have to:
- Bring big data sets across the network to provide collection, marshalling and processing at one integration point.
- Ensure the quality and cleanliness of data and fix it if it’s wrong.
- Integrate the data.
- Link the data.
- Generate a subset of the data.
- Provide that data.
- And you’ll do this each and every time you run this type of query.
That’s extra processing, extra complexity and additional cost, each and every time.
The idea of just going and getting the data is a compelling one, and there are reasons for doing it. In the days of ancient computing, it was reasonable to obtain operational reports by running reporting programs on the OLTP systems, usually outside of peak usage. And it’s still possible to do that. But that’s operational reporting, and it has very little to do with data warehousing.
These are the things that you can’t do with virtual data warehousing that you can do with real data warehousing.
Organise data by subjects: Data in the DW is organised in a way that can be understood even by the business user. It is held in a way that reflects the business. Data in OLTP systems are typically not represented in this way. To be fair, some tools can provide business-oriented views of operational data. And this is useful for operational reporting and ad-hoc, collaborative and speculative data exploration and investigation.
Integrate once, read many times: Data warehousing takes data from OLTP systems and data suppliers and integrates it. Once data has been incorporated in the DW, you don’t have to re-integrate it every time you run a query.
It’s done, it’s checked, and it’s ready for use.
And, once is enough.
Non-volatility and accurate reporting: data in operational systems is continually changing. That’s great if all you want to do is operational reporting or exploratory data work. But it’s not great if you’re going to use data for strategic and tactical reasons where everyone needs to be on the same hymn sheet when it comes to data.
Analysing the past and predicting the future: The OLTP systems will typically not store years of transactional data. And even if they did the platform owners wouldn’t take kindly to someone running a historical analysis query during peak business hours. Data warehouses are constructed in a way that makes it far easier to analyse the past, try and predict the future and to take multiple viewpoints of data. What I call “look back, look up and look ahead,” and from the views of the past, present and future.
Quality check and correct once: read many times: As for integration, the primary quality checks only need to be performed once on any of the data. It’s not something that must be done each time you run a query. This leads to less complexity and lower processing costs.
Slowly changing dimensions: Data and data structures in OLTP systems are designed to support operational applications. They represent the business as it is now and in the near past. They are not intended to reflect business changes in organisation or structure. That’s what well-architected data warehouses are also designed to do.
The simple fact of that matter is that data warehousing is a dedicated business-data and process platform.
In my opinion, data warehousing provides quality data that is subject-oriented, integrated, business-centric and analytics-enabled. If you do that stuff, then you need it. Virtual data warehousing is a totally bullshit term. It really is a nonsensical and ill-thought-out concept. One that actually has nothing to do with data warehousing.
And, if you are reporting off operational systems, even if that involves multiple sources, it’s still mainly about operational reporting.
And it is about that, no matter what gizmo you are using or what architectural and engineering terms you are abusing.
Thanks for reading. More significant thanks for not adopting Virtual Data Warehouse BS.
PS Don’t forget to grab my book on the world of data. It’s available as an ebook at Amazon. For more details go to this page: https://goodstrat.com/2020/05/08/laughing-at-big-data-ebook-great-new-insight-into-realities-of-it/