Martyn Richard Jones, Gif sur Yvette 23rd September 2017
Hi, Friends. As always, it’s great to be able to engage with you again. I am writing to you from the splendorous, wooded and verdant French town of Gif sur Yvette, and I wanted to take this opportunity to address aspects of data for an audience beyond the interests of data management and architecture people.
There is confusion about some of the fundamental aspects of business data, and there shouldn’t. If we strip away all of the boloney, it’s a subject that is quite approachable. In short, ‘it is not rocket surgery’.
Now, that doesn’t mean that there isn’t a degree of complexity, ambiguity and fuzziness, and I will not attempt to simplify and trivialise important aspects of business data, but, gaining a basic understanding of various types of data is something that is within everyone’s grasp.
This is my attempt to provide a brief primer on what I consider to be key classes of business data.
So, with your active collaboration, we will be covering the following topics:
- Transaction data
- Reference data
- Master data
- Data Warehouse data
- Big Data data
As I have tried to emphasise before, the basics are really quite straightforward, even though a lot of the information about the subject can be frequently ambiguous and misleading.
That stated, just ignore the snakes and ladders of data and stick with me. In the end it will make sense… and if not, write to me and we can discuss it until we get it. Better still, leave a message on the blog.
Either way, I look forward to your comments, clarifications, criticisms and queries.
I will start here, because this is the data that I first came into contact with when I started out in data processing.
Transaction data are data in transactions that we are interested in capturing. We’ve all heard of transaction, for a lot of us it’s a daily activity. Paying for a coffee, is a transaction. Transferring money from one bank account to another, is a transaction. Ordering a book from Amazon, is a transaction. Reserving a seat on a train, is also transaction.
In business data terms we can view transaction data as a record of a business event, even though there is an implied reciprocity in the deal.
When we buy a cup of coffee the transaction will consist primarily of what is being sold, the café latte, espresso or macchiato, or whatever (with or without a product code); maybe an indication of the quantity, single or double, etc.; the price we pay for the coffee; details of any taxes covered and the rates applied; and the date and time of the transaction. In addition, we may have data such as the ID of the barrister; the unique bill number; and, the ID of the client, especially if there is a loyalty scheme in operation.
If we are buying drinks for a group, we will essentially have a set of similar information with net and gross totals (including taxes and discounts) for the totality of items on the check. Simple stuff. Here we are not so interested in the details of who is involved in the transaction so much as what is being transacted and when. For example, one espresso at €1.50 on 2nd September 2016. Although we might be interested in who was involved in the transaction and where the transaction took place, it is not mandatory for all transactions.
The following is an example of some transaction data that a cafeteria might like to capture.
This is a simple bill. In this example we are interested in a group of items that were provided and the individual items that were sold. The order and the order items. The individual items, quantities and prices. We are also interested in the payment details.
As you can see, the core transaction data are more typically about items and numbers, which we can tie to involved parties, geography, sales channels, product lines and organisational structure.
Think of it this way. Transactional data are about what we sold (a product), by whom (a party), to whom (a party), via whom (a party), where (geographical location); in what quantities (volume); and, at what price.
Transactional data are detailed data transactions. These transactions are typically linked to reference data (e.g. a unique product name or ID) and master data (such as the details of the customer, the details of the product, and a pricing catalogue).
CHF, CLP, CNY, EUR, GBP, ILS, JPY, MAD, SAR, USD, ZAR…
Have you guessed what they are yet?
Reference data are typically categorisation, classification and lookup data. Data such as currency codes (examples of which I have used above), country codes and the ‘Classification of Financial Instruments and Financial Instrument Short Name’ codes, are all examples of reference data. Fixed conversion rates (weights, temperature and length) are also reference data, together with calendar structures and constraints.
When compared to other classes of data, reference data in general are the data that are exposed to the least volatility. Product codes, organisation structure codes and international standard codes are added far less frequently than, for example, new customers, or other involved parties.
Reference data datasets are not complex, and typically consist of natural and surrogate keys, structural and hierarchical data (where applicable) and reference data entry descriptions.
For example, a business oriented currency table for a particular enterprise might contain the following entry for the Euro.
- Surrogate key: 101210121012978
- Coded alpha key: EUR
- Coded numeric key: 978
- Country/Region: European Union
- Currency: European Currency Unit
- Description: The legal currency of Eurozone members
Additional reference data may be held to indicate the countries in which a currency is legal tender, for example, countries that form part of the currency union known as the Eurozone.
Another example of the use of reference data is the Pantone Colour Matching System (PMS) from Pantone Inc. a New Jersey pioneer in the proprietary colour space. Pantone maintain reference data on a massive range of colours used across a wide range of industries. Pantone colours are categorised by a unique PMS number (for example, “PMS 239”). PMS colours are used extensively in advertising and especially in branding and are also used in government regulation, legislation and military standards. The Pantone system also allows for many special colours to be produced, such as metallic and fluorescent colours.
Here is an example of Pantone classification:
However, reference data does not have to be an internationally agreed set of codes or coding regime, even the smallest of businesses may create their very own reference data.
Before the advent of master data management, this type of data was referred to according to the subject area that it represented. Therefore common datasets in business would be called things like Customer Information File, Customer Master and Product Information Database, and they would generally far less integrated and enriched than, for example, the involved party and product data that we may find in some contemporary enterprise databases.
Although master data is non-transactional, it is used heavily to support transaction processing.
Typical business-oriented master data includes:
- Involved parties
- Consultants / Contractors
- Assets / Liabilities
- Contact Channels
The purpose of having master data is to provide a single source of the truth in terms of an enterprise’s subject area data.
For example, master data for a customer will include a wide variety of information related to the customer, such as:
- Customer surrogate key
- Customer short name and long name
- Customer geographical, location and channel details (addresses, emails, etc.)
- Customer contact details – individuals and their master data
- Customer account management details
So, you might ask, isn’t master data a bit like reference data?
Well, not quite.
Reference data is generally used solely to categorise and classify, whereas master data supports the entirety of an enterprises operational systems, and may also be used in conjunction with Data Warehousing, Business Intelligence and Statistics / analytics.
One important thing to remember is the need to separate the understanding of master data from the quagmire of good intentions and impractical ideas that much of master data management has become.
Master data refers to the data itself, the customers, the products, the suppliers, etc.
Master data management is frequently a misguided, costly and badly-architected attempt to create a near-real-time master-data consolidation machine and virtual system of record for the entire range of data subject areas and applications in an enterprise, typically starting with customers and products. That’s fine. It’s a good ambition.
However, there are as many variations on the theme of Master Data Management as there are MDM consultants, and the numbers on both sides are growing fast. A typical malaise of contemporary IT.
For me, Master Data Management has one key role: to remove all business data ambiguity that isn’t irrelevant. This has to be done without compromising the essential ongoing processes of the business. If it can’t do that, then it isn’t worth a dime.
Data Warehouse Data
There are all types of data in the data warehouse, including: transaction data; event data; reference data; classification and categorisation data; analytical data; atomic data; lightly and highly summarised data; enriched data; metadata; and, master data.
IBM has an excellent approach to the conceptual modelling of data in a financial industry data warehouse. Here is my representation and explanation of that model.
This is a simplified high-level example of business data objects found in certain organisations. In the above diagram I have reused an industry example of nine business data objects to represent operational data.
What follows is a summary list of the nine key groups of business data – identified in the previous diagram – needed to have coherent and cohesive operational awareness. These data groups are also frequently referred to as business data objects.
A: Party embodies all of the participants that may have contact with the organisation or that are of interest to the organisation and about which the organisation maintains data. This includes data about the organisation itself; data about external organisations; data about external and internal individuals; and, data about the roles of involved parties.
Party data is simply data about organisations and people.
B: Arrangement represents a prospective or existing agreement, between two or more individuals, organizations or organizational units that provides and affirms the rights, rules and obligations associated with a transaction between parties.
Arrangement data is simply data about formal (and sometimes informal) arrangements. You give me beer, I give you money. I commission you to build a bridge, you build the bridge.
C: Condition describes the specific requirements that pertain to how the business is conducted and includes information such as prerequisite or qualification criteria and restrictions or limits associated with the requirements. Conditions can apply to various aspects of an enterprise’s operations, such as the operational parameters of a resource item, the sale and servicing of products, the determination of eligibility to purchase a product, the authority to perform business transactions, the assignment of specific general ledger accounts appropriate for different business transactions, the required file retention periods for various types of information kept by an enterprise and the selection criteria for a market segment.
What does that mean? For example, a condition may be ‘if a customer purchases so much in a month, then next month they get an additional discount’.
D: Product/Service describes the services, merchandise or facilities that can be offered, sold or purchased by the enterprise, its competitors and other Involved Parties during the normal course of its business. This concept also includes goods and services that are of interest to the enterprise such as supplies for manufacture.
What does that mean? For example, a sports shirt from Adidas might have a product name, it even might have information about the team, for example, and any customisation, e.g. having the name of Bale on the back, and may even list the individual parts that the shirt is made up of. In fact, product data could include data from the entire design, development and manufacturing chain.
E: Location covers a place where something can be found, a destination of information or a bounded area, such as a country or state, about which the enterprise wishes to keep information.
What does that mean? You are in a supermarket, you can’t find the Crottin de Chavignol so you ask. You get back instructions on how to find a store location where the product is on display; usually. Location could also mean, location of the banking branches, the product outlets, the train stations, in fact, anything to do with location.
F: Classification is used to organize and manage specific business information by defining structures that represent classification categories. Classification also organizes and manages groups of business concepts that apply to multiple concepts.
What does that mean? Cristiano Ronaldo is fantastic, Gareth Bale is fast, and Sergio Ramos is tough, intelligent and calm under pressure. Classification. We also classify wines, by taste, colour, aroma, variety of grape, age and origin, amongst other things.
G: Business Direction/Organisation Direction refers to and records expressions of a party’s intent with regard to the manner and environments in which it wishes to carry out its business. Business direction items contains, keeps data about, and is used to support the enterprise’s business and financial plans, policies, procedures and schedules.
What does that mean? Basically, this is data that supports business strategy. For example, you know what you should be doing and what effects this activity should have, you also want to ensure that what you are doing is producing the desired outcomes, or indeed, if what you are doing is in effect what you planned to do.
H: Events describe a happening about which the organisation wishes to keep information as a part of carrying out its mission and conducting its mission.
What does that mean? Records ‘things that happen’, also records anticipated events that don’t happen.
I: Resource object includes and describes any value item, either tangible or intangible, that is owned, managed or used by, or of specific interest to the organisation in the course of carrying out its business and working towards accomplishing its mission.
What does that mean? For example. People’s time, knowledge and experience (not the people themselves); good assets (cash in the bank, fast moving goods, ownership of sought-after property and valuable liquid financial instruments); and, bad assets (liabilities, assets that are money pits, goods that don’t sell, property that can’t be sold and are losing value).
The key facets of operational awareness detailed above constitute a potential of fundamental importance in the formulation of organizational strategy.
Timely, accurate and appropriate data at this level can temper ambition with the facts on the ground, with operation insight, and with the effectiveness of time and place utilisation.
So, what is a Data Warehouse?
Bill Inmon, the father of Data Warehousing, defines it as being “a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process.”
Subject Oriented: The data in the Data Warehouse is organised conceptually (the big canvas), logically (detailing the big picture and) and physically (detailing how it is implemented) by subjects of interest to the business, such as customer and product.
The thing to remember about subject areas is that they are not created ad-hoc by IT according to the sentiments of the time, e.g. during requirements gathering, but through a deeper understanding of the business, its processes and its pertinent business subject areas.
Integrated: All data entering the data warehouse is subject to normalisation and integration rules and constraints to ensure that the data stored is consistently and contextually unambiguous.
Time Variant: Time variance gives us the ability to view and contrast data from multiple viewpoints over time. It is an essential element in the organisation of data within the data warehouse and dependent data marts.
Non-Volatile: The data warehouse represents structured and consistent snapshots of business data over time. Once a data snapshot is established, it is rarely if ever modified.
Management Decision Making: This is the principal focus of Data Warehousing, although Data Warehouses have secondary uses, such as complementing operational reporting, financial planning and statistical analysis.
Metadata is simply information (and data) about data. Metadata can cover the ‘who, what, why, when, where and how’ of data. It doesn’t yet quite cover the ‘eh?’ of yet, but maybe, given time, when the discipline of metadata management matures then we can incorporate that as well.
Some examples of metadata are:
- Employee ID – A unique key used to uniquely identify a past or present employee
- Product group – A unique identifier used in grouping products according to a predefined business criteria
- Credit rating – an estimate of the ability of a person or organization to fulfil their financial commitments, based on previous dealings.
- Event code – a code that uniquely identifies and event in a business process
- Delivery option – an attribute that captures how the goods will be delivered
- Component colour – a unique colour code as defined in the Pantone Matching System
- Invoice grand total – the sum of the order line item total prices plus all required taxes, etc. and modified by fixed discounts and other adjustments
There are a number of classes of metadata. Here I will describe four main types of metadata.
This is arguably the most important metadata for an enterprise. Business metadata describes business data in business terms. Business metadata covers the ‘who, what, why, when, where and how’ of business data.
Business metadata is provided at multiple levels of abstraction. In simple terms this means that we will have business metadata about classes of business data (involved party, product, materials, etc.) as well as business metadata down to the granularity of individual business data attributes such as ‘invoicing date’ or derived values, such as ‘total invoice value’ and even reference data values, such as those previously mentioned in the Pantone Matching System reference.
Business metadata can also include full textual descriptions of data items, data objects, and data attributes and data domains.
Business related metadata can also include business rules, data quality rules and valid values for business reference data. Business metadata also includes business requirements and functional requirements, but these are not within the scope of this primer, which is primarily data oriented.
The language used to capture business metadata is primarily business oriented.
Technical metadata is used to describe the technology related infrastructure, supply management, production, logistics, structure and content aspects of data.
Examples of technical metadata may include:
- Database connectivity information
- Database schemas and data dictionary information
- Technical descriptions of data transformations
- Source to target mapping, when moving data from one system to another
Management metadata might be data collected that informs us about the usage of data. Management data tends to be technical data that can be used to manage the infrastructure that supports data in motion and data at rest, and to inform the quality appraisal of an organisations data governance. For example, if there are performance issues with a database, management metadata might be used to identify the cause and effect.
In simple terms, audit metadata is simply data about data that us used to reassure an organisation that what should be done (in terms of data), is getting done, and what should not be done (again, in terms of data), isn’t being done.
Data quality issues can also be highlighted through audit metadata.
Of course, there is more to metadata than this. For example, when data are exchanged between distinct systems we can have metadata that describes how that data is translated.
Imagine a European corporate headquarters that wants to consolidate data from each of its subsidiaries, and for example each subsidiary has its own unique sales and marketing system. How does metadata help?
You have issues of different names for the same thing, same names but for different things, and a whole range of issues waiting to trip you up. Having metadata that shows how the centralised data relates to subsidiary data, is absolutely essential in order to provide trust, viability and usability. So, if your data integration metadata is accurate, it will greatly ease the task of today’s data integration whilst also building up a coherent and cohesive platform for longer range goals of continuous information integration and a more well-formed view of the competitive elements of a business.
Despite all of the evidence racked up against the vague, mercenary and otiose peddlers of big-data snake oil, not everything positive about big data is boloney, fabrication and flimflam.
Doug Laney of Gartner describes big data in terms of major characteristics that turns data into big data: volume, velocity and variety.
In simple terms this implies that data becomes big data when:
The volumes of data generated is massive.
The speed at which data is generated is immense.
If you record the temperature of the environment every second rather than every minute you will naturally generate more data. Over a twenty four hour period this amounts to five million additional data points; a sixty fold expansion in data volumes and data velocity. That’s volume and velocity in a nutshell. If in addition to recording the temperature of the sea every second, we wish to capture the sounds of the sea and the image of the sea at a particular point, then we are expanding the variety of the data we capture, as well as adding to the volumes and velocity of data generation.
The varieties of big data formats is colossal.
How useful this all is might be is anyone’s guess. But, useful or not, it’s still data.
Data. It’s all just data
At a European Big Data conference in Madrid a couple of years back I was interviewed by the organisers (the video is on YouTube), in it I was asked what term I preferred to use, such as ‘Big Data’, ‘Small Data’, ‘Smart Data’, etc.
My reply, “just data”.
At the end of the day, no matter what novel, interesting and innovative ways we come up with for classifying and categorising data, data are just data.
That’s all folks
As you can see, there is a bit of overlap amongst these different classes of data. But this shouldn’t be a worry. It’s something to be understood and taken into consideration. Our understanding is important. That we can communicate well, moreso. being able to communicate imperfectly but well is more important than not being able to do that at all.
Business transactions and events will generate transaction data, but it also typically depends on reference data and master data, and will become increasingly tied in with metadata, data warehouse data, analytics data store data and unstructured data.
I’ll leave you with this handy little diagram that I use in order to prompt and provoke debate and dialogue around the subject of enterprise data. Because, for me, the biggest problem of enterprise data is that we don’t talk about it anything like as much as we should, so provoking discussion, and succeeding at that goal, is in my opinion something well worth trying.
That’s all from me for today. Many thanks for giving me your time and please don’t hesitate to leave your comments below. So, until our paths cross next time.
Bye for now.
You can follow us on Twitter @GoodStratTweet
You may also like to consider:
Martyn The one thing that all data management categorization schemas leave out is a mechanism or model for managing individuals and the need to ‘pin’ them all together as and where needed. Your coffee transaction is a good example. “Ricky Raveons” should have the same Class (Person), ID (1234ae46c33f267) and Status (‘Active’) independent of his appearance in transactional, reference, master, meta or ‘x’ data collections. Yours is an elegantly exhaustive description of the various functional classes that data can occupy. Adding a universal method of orchestrating them would be the cherry on top.John O’GormanPrincipal and Chief Disambiguation Officer