Did big-data kill the statistician?
Note: A version of this piece originally appeared in StatsLife, a publication of the Royal Statistical Society.
Without a grounding in statistics, a Data Scientist is a Data Lab Assistant.Martyn Jones
Hold this thought: There are big lies, damn big lies and big-data science.
Statistics is a science, and some would argue that it is the oldest of the sciences.
Statistics can be traced back in history to the days of Augustus Caesar, statesman, military leader and first emperor of the Roman Empire. Some set its provenance in even earlier times.
Indeed, if we accept that censuses are a part of statistics, we can go back as far as the Chinese Han Dynasty of 2 AD, the Egyptians of 2,500 BC and to the Babylonians of 4,000 BC.
Nonetheless, the first statistician in recorded history is Al-Kindi, a ninth-century Muslim polymath and intellectual from Kufa, a city and centre of learning on the banks of the Euphrates in the land now known as Iraq. Al-Kindi, educated in Baghdad, used frequency analysis in cryptography and code-breaking, which he wrote about in his book Manuscript on Deciphering Cryptographic Messages. A book which was lost to civilisation until 1987 when fortunately the treatise was rediscovered in the Süleymaniye Ottoman Archive in Istanbul.
Of course, being educated in the UK, the first statistician I became aware of as a young person was the Lady with the Lamp, the eminent Victorian best known as Florence Nightingale. Amongst other things, she was a pioneer in the graphical representation of statistical data, something very much back in vogue these days.
In 1998, The US scientific journal The American Statistician published an article by Lynn Billard, an eminent Australian statistician and professor. It laid out the role of the statistician and statistics. She wrote that “no science began until man mastered the concepts and arts of counting, measuring, and weighting.”
I first became aware of the role of the statistician while studying a combination of philosophy, politics and economics back in the late seventies.
Later, in the world of work, my first two bosses were also enthusiastic and pedagogic members of the Royal Statistical Society (RSS). Among whose founders were the polymath James Babbage, the Belgian founder of the Brussels Observatory, Adolphe Quetelet, the economist Richard Jones, and the English cleric, scholar and economist, the great Thomas Malthus. Other notable members of the RSS have included the politician Harold Wilson and the social reformer and statistician Florence Nightingale. The highly laudable aim of the RSS, founded in London in 1834, is in “advancing the science and application of statistics, and promoting use and awareness for public benefit.”
While I believe that RSS does an excellent job of raising awareness about statistics and statisticians, I also feel that perhaps they aren’t getting people’s attention enough. After all, many folks seem to think that statistical methods and quantitative analysis got discovered somewhere around 2001. Which, and sorry for raining on anyone’s parade, is not the case.
What motivated me to write this chapter was the notions that the rise of data science would see the demise of statistics and the need for statisticians. In particular, it’s a response to disturbing claims such as “data science is more than statistics: it also encompasses computer science and business concepts.” And “a data scientist is someone better at statistics than any software engineer and better at software engineering than any statistician.” As if statisticians never engaged with business, understood computer science or programmed computers, ever. These comments are like puerile claims at best. In my view, a well-trained statistician would have no problem in quickly developing excellent programming skills. After all, programming is hardly rocket surgery.
It may not be immediately intuitive, but for me, a great statistician as well as being a great scientist is also like a true artist. Creating, practising and demonstrating their art. On face value that may be a controversial position to take, so let me try and explain what I mean by that.
Picasso was perhaps the most celebrated painter of the 20th century. He is on record as saying “It took me four years to paint like Raphael, but a lifetime to paint like a child.” But that’s not the same as a child painting, with little or no technique, skill or experience. Picasso took the time and trouble to learn how to paint like a child intentionally.
Picasso could recreate the visions we ascribe to a child through the hands of a genius. He could paint like Raphael, a child or anybody else. Many would argue that he was above all a true artist, painting as he wanted to, with purpose and above all intention.
The way Picasso painted isn’t the same as someone with no artistic or creative ability splodging some abstract and random colours and shapes onto a canvas. That doesn’t automatically make someone an artist. Not in any modern formal sense. Although, that said, in the age of postmodern drivel, some folks believe that everything can be anything or that everyone can be anybody. Which makes sense, considering the number of Lionel Messi’s, Rafa Nadal’s and Michael Jordan’s we have in the world.
But what about a statistician as a great composer of great symphonies?
My sentiments about the statistician-as-artist concept can be summed up by Lynne Billard when she said: “May the future roles of statistics and of statisticians be that beautiful (Beethoven) symphony that brings music to our ears!” The reason why I believe that statistics will continue to lead through the force of its personality and creativity. And its practitioners will expand the influence of contemporary statistics into new areas. And they will do so by taking simple and proven methods and applying them on a grand scale to address significant – sometimes involving the orchestration and analysis of large data sets.
Those thoughts about art and culture, lead me to the modern domestic shrine to the entertainment industry, the gogglebox.
Those who watched the American medical TV drama House might also connect with this following sentiment. In the series, Hugh played the part of Dr Gregory House. In entertainment terms, Laurie convinces viewers that he is a credible physician. The only thing is, Laurie isn’t a physician. He is an actor pretending to be a physician, and he does a great job of pretending to be a physician. He learned his lines well, and he knew how to interpret them to perfection. But as an actor, not as a doctor.
So why do we think big-data is more than just a new name for a collection of old ideas? Why do we believe that data science is forward-looking, modern and sexy while simultaneously thinking that statistics are only about dealing with the past? And why indeed do we lend more credibility to rebranding, smoke and mirrors, and sexing-up than to historical fact, current evidence and critical appraisal?
More to the point, why do people clamour to self-define themselves as data scientists rather than as the more recognisable, measurable and manageable role of a statistician? A modern statistician who can interpret the past, monitor the present and try to forecast the future?
I am well aware that there has been a proclivity to hire enthusiastic amateurs and certificate harvesters in place of trained, experienced and qualified professionals; especially if the price is right. But it is an inclination firmly planted in the absurd, incoherent and irrational. As silly as the dialectic notion that two-a-halfpenny qualifications are more important than knowledge and experience.
So, call me old fashioned, but when I need a haircut, I will go to a barber, and not to a hair artiste or a mop-follicle scientist. When I need a person who knows how to do a wide range of statistics, I will hire a professional and experienced statistician.
A statistician understands that “not everything that counts can be counted, and not everything that can be counted, counts” — a quote attributed equally to Albert Einstein and William Bruce Cameron, and would probably not be swayed by fact-free boloney. So, getting down to fundamentals, why would a statistician prefer to call themselves a data scientist, and why are some data scientists oblivious to or misinformed about the nature of contemporary statistics?
I think the biggest problem is in the way that the IT industry relentlessly flogs new fads. It’s ‘new lamps for old’, but no matter how much obfuscation and marketing get churned into the mixture, it’s still recognisably a massive overload of flimflam and hyperbole.
The other big problem is how so many people are willing to jump on the flimflam trend wagon to wing their way into a data scientist niche. Or are intent on rebranding themselves as data scientists as a knee-jerk reaction to the IT industry’s crude downgrading of the role of statistician – quite often backed by a long concatenation of meaningless clichés, logical fallacies, inaccuracies and blatant misrepresentation.
But “Ah,” the blaggers will say “with big-data, we can now see what we couldn’t see before, and we can even predict the future, so there!” But that’s also a flawed argument. Using the past to predict and shape the future is nothing new, and neither is the identification of hidden patterns. So why I ask, do people go out of their way to pretend that this is all so recent?
I think it’s fairly clear where this is leading. And if not, I hope it soon will be.
My forecast or better said my guess, is that big-data will not kill the statistician. Not due to any benevolence on the part of the data science communities, but because it won’t be allowed to disappear so easily. There is a broad appreciation, where it matters – in government, academia, the public sector and industry that statistical insight is exceedingly valuable, and statistics is a vital part of modern thinking about data.
Besides, I am quite sure that in 2025, or thereabouts, the data scientists of the day will be criticising the next giant data-like fad and especially its hodgepodge of evangelising carpetbaggers.
Hopefully, by 2025 the data scientists having acquired all the necessary skills, knowledge and experience in statistics, will be able to make it clear that this is about something with a very long and rich history. Statistics is a discipline with antiquity going back as far as the Babylonians of central-southern Mesopotamia – modern-day Iraq. So, not exactly the new kid on the block.
The probable and unavoidable downside is that there will also be a surfeit of bullshit babblers like there is today but more so. Those with the penchant to start every new piece of populist tripe with adjectives such as amazing, fantastic and biggest. For example: ‘amazing data feeding fantastic algorithms to create the biggest claptrap.’
That said, the industry hype, arbitrary zeal and dreary nonsense expounded by data analytics, artificial intelligence, big-data and data science pundits can be summed up by two pertinent quotes from Ben Goldacre’s book Bad Science:
“These corporations run our culture, and they riddle it with bullshit”, and “You cannot reason people out of a position that they did not reason themselves into.”
And, despite best efforts, big-data will not kill the statistician and big-data will become mere data – just like it always was.
Thanks for reading.
My new ebook is titled Laughing@BigData (Kindle Edition) and you can take a look inside for free – is now available at the following Amazon country sites:
United Kingdom: https://www.amazon.co.uk/dp/B086HS6VWX
This is my attempt to oil the wheels-of-industry during COVID-19.