Tags

Martyn Richard Jones

Remastered for 2026

Intro

I aim to bring some semblance of simplicity to the Big Data debate. I also strive for coherence and integrity. Hence, I am sharing an evolving model for pervasive information architecture and management.

This is an overview of the realignment and placement of Big Data within a more generalised architectural framework, one that integrates data warehousing (DW 2.0), business intelligence, and statistical analysis.

The model is now referred to as the DW 3.0 Information Supply Framework, or DW 3.0 for short.

A recap

In a previous piece titled ‘Data Made Simple – Even ‘Big Data’ I looked at three broad-brush classes of data: Enterprise Operational Data, Enterprise Process Data, and Enterprise Information Data. The following is a diagram taken from that piece:

Fig. 1 – Data Made Simple

In simple terms the classes of data can be defined in the following terms:

Enterprise Operational Data – This is data that is used in applications that support the day to day running of an organisation’s operations.

Enterprise Process Data – This is measurement and management data collected to show how the operational systems are performing.

Enterprise Information Data – This is primarily data which is collected from internal and external data sources, the most significant source being typically Enterprise Operational Data.

These three classes form the underlying basis of DW 3.0.

The overall view

The following diagram illustrates the overall framework:

Fig. 2 – DW 3.0 Information Supply Framework

There are three main elements within this diagram: Data Sources; Core Data Warehousing (the Inmon architecture and process model); and, Core Statistics.

Data Sources – This element covers all the current sources, varieties and volumes of data available which may be used to support processes of ‘challenge identification’, ‘option definition’, decision making, including statistical analysis and scenario generation.

Core Data Warehousing – This is a suggested evolution path of the DW 2.0 model. It faithfully extends the Inmon paradigm to not only include unstructured and complex data but also the information and outcomes derived from statistical analysis performed outside of the Core Data Warehousing landscape.

Core Statistics – This element covers the core body of statistical competence, mainly but not only with regards to evolving data volumes, data velocity and speed, data quality and data variety.

The focus of this piece is on the Core Statistics element. Mention will also be made of how the three elements provide useful synergies.

Core Statistics

The following diagram focuses on the Core Statistics element of the model:

Fig. 3 – DW 3.0 Core Statistics

What this diagram seeks to illustrate is the flow of data and information through the process of data acquisition, statistical analysis and outcome integration.

What this model also introduces is the concept of the Analytics Data Store. This is arguably the most important aspect of this architectural element.

Data Sources

For the sake of simplicity there are three explicitly named data sources in the diagram (of course there can be more, and the Enterprise Data Warehouse or it’s dependent Data Marts may also act as a data source), but for the purpose of this blog piece I have limited the number to three: Complex data; Event data; and, Infrastructure data.

Complex Data – This is unstructured or highly complexly structured data contained in documents and other complex data artefacts, such as multimedia documents.

Event Data – This is an aspect of Enterprise Process Data, and typically at a fine-grained level of abstraction. Here are the business process logs, the internet web activity logs and other similar sources of event data. The volumes generated by these sources will tend to be higher than other volumes of data, and are those that are currently associated with the Big Data term, covering as it does that masses of information generated by tracking even the most minor piece of ‘behavioural data’ from, for example, someone casually surfing a web site.

Infrastructure Data – This aspect includes data which could well be described as signal data. Continuous high velocity streams of potentially highly volatile data that might be processed through complex event correlation and analysis components.

The Revolution Starts Here

Here I will backtrack slightly to highlight some guiding principles behind this architectural element.

Without a business imperative, there is no business reason to do it: What does this mean? For every significant action or initiative, even a highly speculative one, there must be a tangible and credible business imperative. This imperative supports the initiative. The difference is as clear as that found between the Sage of Omaha and Santa Claus.

A full and deep understanding of what needs to be achieved is the basis for all architectural decisions. Consider all available options. For example, you must consider sound reasons when rejecting the use of a high-performance database management product. Even cost can be a sound reason. It should not be based on technical opinions like “I don’t like the vendor much”. If a flavour of Hadoop makes absolute sense, then use it. If Exasol, Oracle, or Teradata make sense, then use them. You have to be technology agnostic, but not a dogmatic technology fundamentalist.

That statistics and non-traditional data sources are fully integrated into the future Data Warehousing landscape architectures: Building even more corporate silos, whether through action or omission, will lead to greater inefficiencies, greater misunderstanding and greater risk-

The architecture must be coherent, usable, and cost-effective: If not, what’s the point, right?

No technology, technique, or method should be ignored. We must incorporate relevant, existing, or emerging technologies into the architectural landscape cost-effectively.

Reduce early and reduce often: Massive volumes of data, especially at high speed, are problematic. Reducing those volumes, even if we can’t theoretically reduce the speed, is absolutely essential. I will elaborate on this point and the next separately.

Only the required data is sourced. Ship only the data that is necessary. This highlights why clear business imperatives must be tied to the logic of sending only essential data. Again, this emphasises the importance of having clear business imperatives. It also makes sense to only ship the data that needs to be shipped.

Reduce Early, Reduce Often

Here, I expand on the theme of early data filtering, reduction and aggregation. We are generating increasingly massive amounts of data. But, we don’t need to hoard all of it to get some value from it.

In simplistic data terms this is about putting the initial ET in ETL (Extract and Transform) as close to the data generators as possible. It’s the concept of the database adapter, but in reverse.

Let’s look at a scenario.

A corporation wants to carry out some speculative analysis on the many terabytes of internet web-site activity log data being generated and collected every minute of every day.

They are shipping massive log files to a distributed platform on which they can run data mapping and reduction.

Then they can analyse the resulting data.

The problem they have, as with many web sites that were developed by hackers, designers and stylists, and not engineers, architects and database experts, is that are lumbered with humungous and unwieldy artefacts such as massive log files of verbose, obtuse and zero-value adding data.

What do we need to ensure that this challenge is removed?

We need to rethink internet logging and then we need to redesign it.

We need to be able to tokenise log data in order to reduce the massive data footprint created by badly designed and verbose data.
We need to have the dual option of being able to continuously send data to an Event Appliance that can be used to reduce data volumes on an event by event and session basis.
If we must use log files, then many small log files are preferable to fewer massive log files, and more log cycles are preferable to few log cycles. We must also maximise the benefits of parallel logging. Time bound/volume bound session logs are also worth considering and in more depth.

So now, we are either getting log data to the point of use either via log files, log files produced by an Event Appliance (as part of a toolkit of Analytic Data Harvesting Adapters) or sent by that appliance to a reception point via messaging.

Once that data has been transmitted (conventional file transfer/sharing or messaging) we can then move to the next step: ET(A)L – Extract, Transform, Analyse and Load

For log files we would typically employ ET(A)L but for messages of course we do not need the E, the extract, as this is about direct connect.

Again the ET(AL) is another form of reduction, which is why the analytics aspect is included to ensure that the data that gets through is the data that is needed, and that the junk that has no recognisable value, gets cleaned out early and often.

The Analytics Data Store

The ADS (which can be a distributed data store on a Cloud somewhere) supports the data requirements of statistical analysis. Here the data is organised, structured, integrated and enriched to meet the ongoing and occasionally volatile needs of the statisticians and data scientists focusing on data mining. Data in the ADS can be accumulative or completely refreshed. It can have a short life span or have a significantly long life-time.

The ADS is the logistics centre for analytics data. It can be used to provide data into the statistical analysis process. It can also be used to provide persistent long-term storage for analysis outcomes and scenarios. This is important for future analysis, hence the ability to ‘write back’.

The data and information in the ADS may be augmented with data derived from the data warehouse. This augmentation uses data stored in the data warehouse. It may also benefit from having its own dedicated Data Mart specifically designed for this purpose.

Results of statistical analysis on the ADS data may also result in feedback being used to tune the data reduction. Filtering and enrichment rules are adjusted further downstream. This happens either in smart data analytics, complex event and discrimination adapters, or in ET(AL) job streams.

That’s all, folks.

This has been necessarily a very brief and high-level view of what I currently label DW 3.0.

The model doesn’t seek to define statistics or how statistical analysis is to be applied. This has been done more than adequately elsewhere. The model focuses on how statistics can be accommodated in an extended DW 2.0 architecture. There is no need to come up with almost reactionary and ill-fitting solutions to problems. These problems can be solved more effectively with good sense. Sound engineering principles and the judicious application of appropriate ways, technologies, and techniques are key.

If you have questions, suggestions, or observations about this framework, please feel free to contact me here. You can also reach out via LinkedIn mail.

Many thanks for reading.

File under: Good Strat, Good Strategy, Martyn Richard Jones, Martyn Jones, Cambriano Energy, Iniciativa Consulting, Iniciativa para Data Warehouse, Tiki Taka Pro

Discover more from GOOD STRATEGY

Subscribe to get the latest posts sent to your email.

GOOD STRATEGY

~ DATA, INFORMATION & KNOWLEDGE

Consider this: Aligning Big Data

Martyn Richard Jones

Intro

A recap

The overall view

Core Statistics

Data Sources

The Revolution Starts Here

Reduce Early, Reduce Often

The Analytics Data Store

That’s all, folks.

Discover more from GOOD STRATEGY

Please leave a reply Cancel reply

Martyn Richard Jones

Intro

A recap

The overall view

Core Statistics

Data Sources

The Revolution Starts Here

Reduce Early, Reduce Often

The Analytics Data Store

That’s all, folks.

Discover more from GOOD STRATEGY

Share this:

Related

Please leave a reply Cancel reply