To begin at the beginning
Fueled by the new fashions on the block, principally Big Data, the Internet of Things, and to a lesser extent Cloud computing, there’s a debate quietly taking please over what statistics is and is not, and where it fits in the whole new brave world of data architecture and management. For this piece I would like to put aspects of this discussion into context, by asking what ‘Core Statistics’ means in the context of the DW 3.0 Information Supply Framework.
Core Statistics on the DW 3.0 Landscape
The following diagram illustrates the overall DW 3.0 framework:
There are three main concepts in this diagram: Data Sources; Core Data Warehousing; and, Core Statistics.
Data Sources: All current sources, varieties, velocities and volumes of data available.
Core Data Warehousing: All required content, including data, information and outcomes derived from statistical analysis.
Core Statistics: This is the body of statistical competence, and the data used by that competence. A key data component of Core Statistics is the Analytics Data Store, which is designed to support the requirements of statisticians.
The focus of this piece is on Core Statistics. It briefly looks at the aspect of demand driven data provisioning for statistical analysis and what ‘statistics’ means in the context of the DW 3.0 framework.
Demand Driven Data Provisioning
The DW 3.0 Information Supply Framework isn’t primarily about statistics it’s about data supply. However, the provision of adequate, appropriate and timely demand-driven data to statisticians for statistical analysis is very much an integral part of the DW 3.0 philosophy, framework and architecture.
Within DW 3.0 there are a number of key activities and artifacts that support the effective functioning of all associated processes. Here are some examples:
All Data Investigation: An activity centre that carries out research into potential new sources of data and analyses the effectiveness of existing sources of data and its usage. It is also responsible for identifying markets for data owned by the organization.
All Data Brokerage: An activity that focuses on all aspects of matching data demand to data supply, including negotiating supply, service levels and quality agreements with data suppliers and data users. It also deals with contractual and technical arrangements to supply data to corporate subsidiaries and external data customers.
All Data Quality: Much of the requirements for clean and useable data, regardless of data volumes, variety and velocity, have been addressed by methods, tools and techniques developed over the last four decades. Data migration, data conversion, data integration, and data warehousing have all brought about advances in the field of data quality. The All Data Quality function focuses on providing quality in all aspects of information supply, including data quality, data suitability, quality and appropriateness of data structures, and data use.
All Data Catalogue: The creation and maintenance of a catalogue of internal and external sources of data, its provenance, quality, format, etc. It is compiled based on explicit demand and implicit anticipation of demand, and is the result of an active scanning of the ‘data markets’, ‘potential new sources’ of data and existing and emerging data suppliers.
All Data Inventory: This is a subset of the All Data Catalogue. It identifies, describes and quantifies the data in terms of a full range of metadata elements, including provenance, quality, and transformation rules. It encompasses business, management and technical metadata; usage data; and, qualitative and quantitative contribution data.
Of course there are many more activities and artifacts involved in the overall DW 3.0 framework.
Yes, but is it all statistics?
Statistics, it is said, is the study of the collection, organization, analysis, interpretation and presentation of data. It deals with all aspects of data, including the planning of data collection in terms of the design of surveys and experiments; learning from data, and of measuring, controlling, and communicating uncertainty; and it provides the navigation essential for controlling the course of scientific and societal advances[i]. It is also about applying statistical thinking and methods to a wide variety of scientific, social, and business endeavors in such areas as astronomy, biology, education, economics, engineering, genetics, marketing, medicine, psychology, public health, sports, among many.
Core Statistics supports micro and macro oriented statistical data, and metadata for syntactical projection (representation-orientation); semantic projection (content-orientation); and, pragmatic projection (purpose-orientation).
The Core Statistics approach provides a full range of data artifacts, logistics and controls to meet an ever growing and varied demand for data to support the statistician, including the areas of data mining and predictive analytics. Moreover, and this is going to be tough for some people to accept, the focus of Core Statistics is on professional statistical analysis of all relevant data of all varieties, volumes and velocities, and not, for example, on the fanciful and unsubstantiated data requirements of amateur ‘analysts’ and ‘scientists’ dedicated to finding causation free correlations and interesting shapes in clouds.
That’s all folks
This has been a brief look at the role of DW 3.0 in supplying data to statisticians.
One key aspect of the Core Statistics element of the DW 3.0 framework is that it renders irrelevant the hyperbolic claims that statisticians are not equipped to deal with data variety, volumes and velocity.
Even with the advent of Big Data alchemy is still alchemy, and data analysis is still about statistics.
If you have any questions about this aspect of the framework then please feel free to contact me, or to leave a comment below.
Many thanks for reading.
Catalogue under: #bigdata #technology big data, predictive analytics
[i] Davidian, M. and Louis, T. A., 10.1126/science.1218685