Please note: This is an edited version of a previous piece with a similar name, but focusing solely on the three main Vs of Big Data.
What we’ve been told
We’ve been told that business Big Data is the greatest thing since sliced bread, and that its major characteristics are:
- massive volumes – so great are they that mainstream relational products and technologies such as Oracle, DB2 and Teradata just can’t hack it, and
- high variety – not only structured data, but also the whole range of digital data, and
- high velocity – the speed at which data is generated, transmitted and received
Which is a simple and straightforward means of classification. Big Data is about massive volumes, high variety and high velocity. Right?
It’s not about big
I have never bought into the idea that more data is necessarily better data, or that it provides better focus or leads to increased insight, in fact I have been quite vocal with my contrarian opinion, but now this view is getting some additional support, and from some surprising corners.
In a recent blog piece on IBM’s Big Data and Analytics Hub (Big data: Think Smarter, not bigger), Bernard Marr wrote that “the truth is, it isn’t how big your data is, it’s what you do with it that matters!”
Over at Fierce Big Data it was Pam Baker who stated that “the term big data is unfortunate because it’s really not about the size of the data”. (Big data is not about petabytes, but complex computing).
Elsewhere, SAS echoed similar sentiments on their web site: “The real issue is not that you are acquiring large amounts of data. It’s what you do with the data that counts.”
Well, apparently Big Data isn’t about “massive volumes” of data.
Strike 1!
It’s not about variety
It is claimed that 20% of digital data is structured, it is based on the problematic suggestion that structured data is uniquely relational.
It is also said that unstructured data includes CSV files and XML data, and this makes up far more than the 20% of the data generated. But this definition is wrong.
If anything, CSV data is structured, and XML data is highly structured, and it’s typically regular ASCII data. So there it does not add variety, even though it is not structured in the ways that some someone might expect, especially if that someone lacks the required knowledge and experience. Simply stated, CSV data is structured, it’s just that it lacks rich metadata, but that doesn’t make it unstructured.
“But”, I hear you say “what about all the non-textual data such as multi-media, and what about the masses of unstructured textual data?”
Take it from me, most businesses will not be basing their business strategies on the analysis of a glut of selfies, juvenile twittering, home videos of cute kittens, or the complete works of William Shakespeare. Almost all business analysis (whether done by a professional statistician or a data scientist) will continue to be carried out using structured data obtained primarily from internal operational systems and external structured data providers.
Variety, Sir? No problem.
Strike two!
It’s not even about velocity
So, if we accept that Big Data isn’t really about the massive data volumes or high data variety then that leaves us with velocity. Because if it isn’t about record breaking VLDB or significant data variety, then for most commercial businesses the management of data velocity becomes either less of an issue or just is no issue.
Even in some extreme circumstances, one can explore the suggestion that data sampling can remove issues with data volume as well as velocity.
However, the fact that some software vendors and IT service suppliers set up this‘straw man’ velocity argument and then knock it down with the ‘amazing powers’ of their products and services, is quite another matter.
So, is it really about velocity?
Strike three!
So what is it really about?
Big Data is a dopey term, applied necessarily ambiguously to a surfeit of tenuously connected vagaries, and its time has come and gone. Let’s dump the Big Data moniker, and the 3 Vs along with it, and embrace the fact that data is data, there will always be more of it.
So, let’s consider ‘all data’ and principally for its time and place utility.
If there is something that you are not sure about or have questions with then please leave a comment below or email me.
Thanks very much for reading.