Historical big data will save the future

Martyn Richard Jones

Prediction is very difficult, especially if it’s about the future.

Nils Bohr

Can we predict the future of humankind in the same way that we forecast the weather?

In an article published in The Guardian, the journalist Laura Spinney discusses how historical data could be used, not only to predict the future of humankind but to save it as well. The piece titled “History as a giant data set: how analysing the past could help save the future” discussed how a small group of academics had come up with the notion that the analysis of historical big-data could be used in useful ways for the common good. At least that was my initial takeaway.

The estate

But after some basic research, I established that the article is somewhat misleading, to say the least; because it appears to contradict the views of those academics that it discusses. So, contrary to the ambiguous title it isn’t about merely analysing the past, but testing established historical explanations of the past by using historical data of the past, which is quite a different thing altogether.

That stated as I have found that many people also interpret and relate the story in the way that Laura has done, I am still compelled to argue against the viewpoint that is possible or even desirable to predict future regional, national and cross-national tensions, conflicts and upheavals just by analysing historical data.

I also have a bit of an issue with the generalised belief in big data because the incompleteness, imprecision and discriminatory biases of data, and not just historical big-data, will quite possibly lead to flawed analysis and the setting of unrealistic expectations.

However, a lot of academic historians, for example, Paul Kennedy, have done an excellent job of analysing the past and present to contribute to our better understanding of the essential lessons, explanations and strategies that work for the common good. And without resort to the casting of runes nor the consultation of historical big-data.

The data and information that falls between the cracks

Now let’s turn our attention to data and information that falls between the cracks.

How many people have had great ideas, thoughts and insights that were never captured, even on paper, and are then forgotten and lost to history? It happens more often than we might realise. So, what about data and information that goes ‘absent without leave’?

When I first started in computing, I referred to data and information that is not captured digitally as the data that falls between the cracks. It applies to the unrecorded reference, transactional, organisation and environment data that is rarely documented and archived, many times not digitally and not even on paper, and as such it is data that we won’t be able to retrieve. It’s gone, it’s lost to history, and it’s like as if it never existed.

Or maybe what is worse than the idea of something never existing is the frustration of knowing that something did exist, but no matter how hard we think about it or how diligently we search for its content we are never going to reencounter it. Or to put it slightly differently, we might recall the existence of a container (e.g. a notebook) of data or information that we remember from the past, but we are unable to recall the detail, nor will Google provide us with the answers. This becomes amplified when we are considering lost artefacts of sentimental value.

“Do you remember warehouse manager Fred and the notebooks he used to keep? What did he write in them?”

“No idea, mate.”

This lost data includes things like the identity of a person in the process chain – a worker was a worker and not usually identified as an individual. We very rarely kept detailed result data from individual quality control tests – much reporting about quality was highly aggregated with very little by way of verifiable and actionable detail. And in a complex manufacturing environment, we couldn’t trace finished products back to the suppliers of its components – we didn’t even think it was much of a requirement. These days we leave our digital footprints everywhere. We record almost every aspect of quality assurance and testing. And, a car manufacturer can trace a faulty component in a specific car back to a supplier and a batch of parts.

Other data which has fallen between the cracks include data contained in the informal systems of record used in business and society. In journalism, this could be a reporter’s notepad, in manufacturing it could be a production controller’s written journal, in the military, it could be written despatches. It could also include crib sheets, how-to cookbooks and diagrams of real rather than official and formal hierarchies and spheres of influence.

The granularity of data and information

At what level of granularity do you look at the data, for example local, regional, national, continental or global levels? Getting the granularity of the data right is essential for correct analysis, but with historical data, even historical big data, this might not be possible.

The granularity of the data in focus may also be a factor in the quality of the analysis. It may be that our level of scrutiny is based on data that is too detailed, therefore leading to difficulties in identifying clear trends, notable movements and discernible social behaviour patterns. Or the data could be aggregated to such a degree that all nuance is removed.

The false relationships between data and information

There is quite a widespread notion that correlation implies causation, which is further strengthened by long-held beliefs about specific data points and their relationship with other data points. With the absence of causation, correlation can produce some bizarre examples. Here are a few:

A dearth of pirates caused climate change.
The more sunspots there are, the more republicans there are in the US senate.
The more bee colonies there are the more juvenile arrests for marijuana there are.
The more hours of Californian sunlight there is the more visitors to SeaWorld there are.
The more the consumption of sour cream goes up the more the number of motorcycle deaths go up.

Indeed, taking information from the past that was ostensibly based on some long-lost dataset, is also potentially problematic.

Also, there are quality issues with regards to data and their original constituency boundaries and what we assume were their original boundaries. Again this takes us into the realm of hypothesis and unicorns and is like assigning data to New Jersey when, in fact, it belongs more to New York.

I should mention that the lineage of data items and their association with other data items isn’t written in stone, it’s not a linear continuum, and both lineage and affinities can change over time, so for example data of 1956 and 2020 might ostensibly appear to be the same, but may not necessarily mean the same thing or have the same affinities. The lenses of 1956 are not the lenses of 2020, and there are social changes and different ways of looking at data and what that data represents, from one epoch to another.

The imprecision of data and information

With over 913 pages, the Domesday Book (1085-1086) is a fascinating and unequalled historical public record of ownership of property and resources and is still useful in the UK as proof of title to land. As the BBC put it “produced at amazing speed in the years after the Conquest, the Domesday Book provides a vivid picture of late 11th-century England.”

Historians frequently use the Domesday Book as part of their research into specific regions of the UK. However, the Domesday Book is incomplete. It excluded many large cities, such as London, Winchester and Bristol. It also omitted most of what was then Wales and all of independent Scotland. Indeed, work on the book was abandoned during the reign of William Rufus.

What this illustrates is that at a certain level of detail, the record is valuable to the historical researcher. However, it certainly doesn’t paint an accurate or complete picture of Britain as a whole.

And this is true of public records, around the world.

Also, just because something has been recorded doesn’t make it accurate or even right. Maybe some data and information form part of a large and elaborate historical lie or perhaps they are historical eccentricities, the creation of serendipity. Unfortunately, both examples can lead to undesirable consequences.

We believe what we see, but is what we have seen what has occurred? The idea that a record made near to the actors, event or process is necessarily correct is in itself erroneous. Some of the most unreliable witnesses are first-hand witnesses. People who were there when it happened, but still recall events, time, place and actors erroneously – that’s how humans are.

The data and information that is just wrong

What about fake news, most truth, interested revisionism and big lies? How do they affect the value of data and the veracity of analysis?

The spreading of disinformation has been the activity of intelligence services and pressure groups since almost forever. Even Mark Anthony was the subject of a smear campaign.

During the World War I, the British journals The Times and The Daily Mail published articles claiming that Germany was compensating for the shortage of fats by boiling down the corpses of their soldiers for fats, bone meal and pig food.

Before the USA-led invasion of Iraq in 2003, The New York Times carried an article detailing a camp where it was claimed biological-weapons were being produced. These claims were part of the broader disinformation about Weapons of Mass Destruction, said to be allegedly spread by people such as the reporter Judith Miller, again through the medium of the New York Times.

There are also claims and counter-claims regarding the manipulation of social media and the influence on the press and electorate as part of the interference campaigns run in the 2016 US Presidential election and the 2016 Brexit referendum in the UK.

And remember, misinformation is accidental, but disinformation is intentional.

The discriminatory bias of data

March 2019 saw the publication of Invisible Women: Exposing Data Bias in a World Designed for Men, a book penned by a British feminist, activist, author and journalist Caroline Criado Perez which explored gender bias in data.

As Undercover Economist Tim Harford writing in The Times put it “Caroline Criado Perez, explores countless cases in which everything from the height of the top shelf to the functionality of an iPhone is predicated on the assumption that the user will be male.” Eliane Glaser writing in The Guardian reiterated the truth that “Data not only describes the world, it is increasingly being used to shape it.”

As well as gender bias in data, there is also ethnicity, economic and age bias in data and more importantly in biased algorithms that analyse those data sets and as a result create even more subjective data, while reinforcing the prejudice of the already biased data.

Again, this is data not reflecting the realities of society but of the bias, noise and prejudice of certain segments of society. Hardly the sound footings of robust, trustworthy and desirable data analysis.

The survival bias of data

The data we have is the data we have. The things that the lost data could have told us are lost and irrecoverable. For example, if we analyse the performance of active Hedge Funds, how well they do on betting on financial instruments, then we are ignoring the data with regards to Hedge Funds that closed and for whatever reason. This survival bias skews the data and from a historical perspective does not give a full and accurate account.

Indeed, some of the great libraries of the world provide us with an excellent source of data and information. But what about the Great Library of Alexandria? The Library of Pergamon? The Imperial Library of Constantinople? Or the Library of the Hanlin Yuan? All lost to history. Their loss has potentially contributed to the increase of bias in certain historical data and information.

The data and information that gets disappeared

Here’s a thought. The legacy of the computer manufacturer that was Univac stretches back to 1951 through to the present day mainframe products of Unisys. And it goes even further back if we include the computing pioneers of ENIAC in the story.

I joined Sperry Univac in the March of 1980 and left thirteen years later.

Recently I was researching my old company. To my surprise, I was able to explore the history of Univac from the beginnings up to 1980, but then there was an information gap (at least from a Google perspective) between 1980 and 1986 when Sperry Corporation merged with Burroughs. Hardly a brief hiatus. I know for a fact that at one time this information was available on the internet, but now it seems to have been to some extent disappeared. I am not a conspiracy theorist, and I’m sure there might be valid reasons their its absence, but it also means, that in terms of historical research into IT, that at least from the internet perspective, there are growing lagoons of incomplete, imprecise and missing data and information.

And that’s just one example of incomplete, misleading and erroneous data and information. And if that skews history, think of the volumes of revisionism that does even more than that.

The data that can’t be derived

Here’s a thought: How do you capture and measure the effects of hubris, irrational exuberance and wilful ignorance in a population with regards to the processes, events and political sentiment of those populations? These factors have significant consequences, but we’re not capturing this as data, not even big data. So they get ignored.

And how do you measure the effects of drug usage on a population? According to the experts, people turn to drugs for all sorts of reasons, some of them legitimate and legal, but it isn’t necessarily about money or lifestyles, and this element in society isn’t represented in data in anything but a detailed but ultimately superficial manner.

Predicting the future by the past

Trying to forecast the future by the past is a flawed endeavour, but at least it might be better than nothing. But is that true? Can we prove it? Does it stand up to scrutiny? Is something always better than nothing?

Well, it probably depends.

Opinion polls are a part of everyday life, and they are frequently wide of the mark, even with something as straightforward as an election. These incongruencies can be due to several things, such as unrepresentative samples and margins of error – data and analysis issues. Or maybe the ‘sample’ just lied to the poll taker.

So here’s another question: If we have problems in predicting the near future based on the sampling of data from the present, what chances will we have of effectively using the past to address future strategic challenges accurately?

Very little, right?

False analogies, logical fallacies and charlatanry

But not only do we have issues with the accuracy, completeness and veracity of historical data and information but we also have the false analogies, logical fallacies and charlatanry of those who would have us believe otherwise.

Take, for example, the claim that if we can forecast the weather based on historical data and information, then we can predict our social and political futures by reference to and analysis of historical data. There are things about this that concern me. In the first instance, I am compelled to ask, and to what ends? We can make educated guesses about the weather, but why would we want to use historical big-data for making guesses about the future of society, nations and communities. Indeed, what is wrong with the existing techniques for doing precisely that?

I wonder if this isn’t about anticipatory intelligence, command and control, and without putting in the leg work. Sure, for the example the CIA might have been able to improve its prediction of significant geopolitical events based on better data and analytics (including machine learning), but it is still not a match on the informed and intelligent early-warning signals that can come from aid agencies, who are involved on-the-ground with all of the actors and organisations (including government and opposition), in problematic situations in resource-poor countries and regions.

Neither does it require big data to know that wars for resources will grow, and especially as a result of the need for freshwater, for example. Neither do we need big data to understand what the adverse effects of climate change will be and who will it most affect. These issues are already very well documented.

As for other geopolitical phenomena such as the spread of populism in the western world, we have witnessed a shift to the right happening for the best part of four decades and this populism is part of the process and we don’t need big data to point this out to us nor inform us of how to fix it.

Poor context makes historical data-driven analysis a Sisyphean task

Those who ignore the past are destined to repeat it. So, don’t be like King Sisyphus.

As punishment for being a naughty boy and pooh-poohed context, Zeus made King Sisyphus roll a massive boulder of big data endlessly up a steep analytics learning curve. This exasperating form of rebuke was handed down to King Sisyphus due to his arrogant belief that his skill and cunning with big data and analytics bested the data and statistical smarts of Zeus himself. Zeus accordingly demonstrated his cleverness by bewitching the boulder of big data into rolling away from King Sisyphus just before he expected the results of his fabulous analytics to materialise, which ended up consigning old Sisyphus to an eternity of hopeless exertion and endless frustration.

So much for big data and data science, eh?

What’s the takeaway from this? If you want to do big data and analytics, then don’t piss off the Greeks.

Culture eats big data and analysis for breakfast

The issue is this: analysing data and anticipating significant political, economic and social challenges must necessarily take into account the prevailing culture – even if we believe we have the best of data, information and analytics – as culture is not easily understandable, representable or applicable in terms of data integration, statistics or data science.

But, it’s not just the power of culture that threatens our analysis. If we approach the study of our historical big-data without applying knowledge of behavioural economics, for example, then we are nothing better than ill-equipped charlatans.

The rap wrap

The world of data, information and analytics can be quite passionate at times, despite itself, and its contradictions can be rather frustrating, irritating and obtuse. Indeed, because it’s one of the flavours of the age, it draws in a lot of chancers, charlatans and bodgers. Which rather than always being negative brings colour, vibrancy and challenging absurdity to the experience.

That said, big-data-driven historical analysis is in danger of being historical revisionism of the worst kind. Because it necessarily removes humanity, empathy and morality from the equation. In my view, it’s a postmodern aberration gone mad – which is code for “it’s a crock of crap.”

So, can we predict the future of humankind in the same way that we forecast the weather?

In short, no, and it’s again a false analogy.

The issues with regards to historical data, information and analysis as highlighted in this chapter bring sharply into focus the intrinsic deceit in the claims that big data can be used to analyse history properly and to provide us with valuable insight which can then be applied to significant challenges going forward.

At the moment, the best way of analysing history is not through big data but the tried and trusted academic research methods that have employed in historical inquiry to date. At best the usage of big data in this field is a sideshow, a perverse and fanciful postmodern aberration that should rightfully be consigned to irrelevance.

The problem with just predicting the future by historical data is that this data can rarely provide its narrative, so we create stories based on historical data which we then combine with our own opinion and speculation. I find that approach to be somewhat subjective and problematic, and in a way, self-defeating.

I also have issues with the mathematic modelling and rationalisation of our values and our history. This approach smacks of an instrumentalisation of a crude form of reason that lacks any sobering and calming counter-weights of empathy, humanity and ethics, and this will ultimately lead us back to Hannah Arendt’s banality of evil. Which nobody should want.

Discover more from GOOD STRATEGY

Subscribe to get the latest posts sent to your email.

Discover more from GOOD STRATEGY

Share this:

Related

Please leave a reply Cancel reply