Martyn Richard Jones
Mountain View, 22nd January 2015
This article is the first in a series of articles that discuss aspects of the use of architectural patterns in the Cambriano Information Supply Framework 3.0
The term architectural pattern may sound grand, misleading or daunting, but it’s really quite a simple concept. It’s like writing a function in a programming language to log in to a database, check that the connection is alive and working and report back the success of the connection request. If that function can be reused either in the same application development, in the same IT shop or in IT in general (e.g. Java code to connect and test the connection to SQL Server) then it’s well on its way to becoming an architectural pattern. Of course, there are much more sophisticated architectural patterns. But generally a pattern is a simplified and generic template for address a generally occurring problem. But as with much in architecture, less usually turns out to be more.
In this article I will be looking at patterns for the process known as ETL (Extract, Transform and Load), which is the typical mechanism used to take date from source systems (which may be systems of record) through transformation (to put the data into a conformed format that real people can understand and use) and finally to load the resulting data into the Enterprise Data Warehouse and subsequently to be loaded and used in departmental oriented, discipline focused or special-project Data Marts.
The Cambriano Information Supply Framework 3.0 is the architectural blueprint for the integration of Big Data, Analytics (including Textual Analytics) and Structured Intellectual Capital (Knowledge Management) with Bill Inmon’s definitive Enterprise Data Warehouse architecture and process.
Links to more information about the Data Supply Framework (including 4th generation Data Warehousing) are listed at the end of this article.
In subsequent articles I will be looking at architectural patterns across the DSF 4.0 landscape, including patterns for Big Data and Analytics.
The following diagram illustrates the overall drivers for the Information Supply Framework.
Fig. 1 – DW 3.0 Information Supply Framework
From the right, the consumers and prospective consumers of data and information create the demand for information and data.
The middle process of ‘data processing, enrichment and information creation’ strives to meet the business demands and also ‘provokes’ business demands for data and information.
From the left the data sources provide data to meet real and secondary demands.
As stated previously, this article will focus on ETL architectural patterns as they apply to conventional landscape architecture patterns in Enterprise Data Warehousing (EDW / DW, Data Marts, Operational Data Store) and will not include patterns such as Data Governors or the Analytic Data Store (which I shall cover in later articles).
For those unfamiliar with the term ETL, it is the most common form in IT of moving data from a source of data (such as an operational database containing customer information) to a target database (such as one used for reporting). It’s a logistics method for data (it copies data from where it is to where it is needed) and it’s a selection, manufacturing and packaging method (for selecting the data that is needed, enriching the data as needed and delivering the data in a form that can be used).
Anyway, the ETL I am focusing on in this article is highlighted (by the green boxes) in the following high-level conceptual view.
Fig. 2 – DW 3.0 Information Supply Framework
Analysis Patterns for ETL
I will not dwell too long on Analysis Patterns for ETL as this requires separate treatment from the discussion of ETL architectural patterns.
Needless to say, the usage of ETL Analysis Patterns should be a constant in Data Warehousing, regardless of whether the patterns used are client specific, industry specific or generic.
Typical analysis patterns cover:
- Structured requirements gathering and knowledge elicitation
- Data steward / platform owner exploration workshops and data provision negotiations
- Comprehensive source to target mapping, including the collection, enrichment and communication of business, management and technical metadata
I will be dealing with these and other pattern areas in subsequent articles.
Why do we use ETL patterns?
Creating reusable code structure and logic is significant in most IT development, and just as important in ETL processes and structures. The modularisation of the ETL process helps avoid unnecessarily rewriting the same tested code, and reduces the total effort required to build and maintain ETL architectures.
ETL patterns can also be used by developers who have little or no idea of what they are doing, but will know enough to be able to drop in a component knowing what that component’s inputs and outputs should notionally be. This is probably why the approach is so popular with offshored project development.
Architectural Patterns in ETL
Now we have reached the main focus of this article. So, the ETL architectural patterns I wish to briefly address are:
- Data extraction patterns – including early change data capture patterns
- Data mapping patterns
- Data cleansing patterns
- Data transformation patterns
- Data loading patterns – including change data capture patterns
- Post-load processing patterns – including late change data capture patterns
- Metadata management patterns – business, technical and management metadata
- Logging and auditing patterns
- Error handling
- Process and data integrity
Given the nature of this type of article, the list is focused on major parts. Of course, there will be some areas of overlap in the process areas mentioned above, and some of these overlaps will be identified and explained shortly.
As hinted at previously, this article does not include a detailed discussion of streamed, queued or governed data mechanisms. These topics will be addressed in later articles in this series.
So, without more ado let’s get the ball rolling.
Simple data extraction patterns – Source / Intermediate / Target EDW
Simple data extraction patterns are required where data is not sourced directly from the system of record in a three step process, but where data is staged (in one form of another) in a staging area that typically sits between the source systems and the target EDW platform.
The following patterns are frequently used in staging:
- Full or partial export from operational databases to intermediate file store staging area as (ASCII) files
- Full or partial export of operational database reference data and operational transaction logs to intermediate file store staging area as files
- Full or partial export replication of operational databases to an equivalent database management system platform for staging
- Full or partial export of operational database reference data, transaction data and/or operational transaction logs to an equivalent database management system platform for staging
- Database specific export to export format file copied to a staging area for subsequent unpacking and processing
- Patterns for taking data off queues or streams and stored in text files or directly to an intermediate database management system such as Greenplum, Postgres, Oracle, SQL/Server, EXASol, Vertica or Teradata.
There are also patterns for using Distributed File Systems (such as MapR or Hortonworks) for intermediate ETL processing, whereby the commodity platform is used as a staging pattern.
Most of this patterns are relatively simple and straightforward, and don’t actually require an immense amount of work – even if they are reinvented as patterns specific to the IT shop of a particular business. They can come in handy when there is little to no experience in how to go about doing these things.
Data mapping patterns – EDW
There are a number of architectural artefacts available that provide patterns for the process of mapping data through from source to target.
These patterns can take a number of forms:
- Spreadsheet layouts for source to target mapping – these can be used in conjunction (or not) with code to extract the metadata and generate some of the ETL processes. Teradata consultants in the Nordic countries have used ETL generation processes which basically take source-to-target mapping defined in MS Excel and convert that data to simplified ETL processes.
- On a more sophisticated level, IBM, for example, provide a software utility (Graphical Data Mapping Editor – IBM Integration Bus V9.0) to use in the analysis phase which allows for the creation and management of partially directed data mapping patterns. This approach is no specific to EDW.
- Another tool sub-class, also provided by IBM and others, are straight EDW oriented mapping tools. IBM’s mapping tool is named InfoSphere Data Architect (IDA). These tools provide an environment to easily create and modify mappings; to provide a good level of version control; a tool for easily reporting on mapping and mapping progress; and that allows for mapping data to be exported to other formats.
- All major ETL tools provide a full set of mapping capabilities, and this includes InfoSphere (as mentioned previously), SAP BusinessObjects (Data Integrator / Data Services), Talend, Oracle ODI, Syncsort, Tibco, Pitney Bowes, Open Text, Sybase ETL, Informatica, Ab Initio, Altova, Clover ETL, Pentaho, SAS Data Integration Studio and Microsoft SSIS. ETL architectural patterns provided with these tools also cover most of the other areas as mentioned in this article (from extract, through transformation, to load – EDW and Data Marts).
Some versions of mapping patterns, templates and software are available for free (as in community releases or express versions) or versions made available on a fixed-term or use-oriented trial basis.
Another facet that must be taken into account, moving forward, is the rise of Textual Analysis. There are Textual ETL tools on the market, but as yet there has not been much by way of Textual Analysis and Textual ETL architectural patterns.
Data cleansing patterns
IBM and Pitney Bowes have spent time and effort on building up knowledge and experience of data cleansing patterns, not only in Data Warehousing but also in the broader area of data integration.
IBM described data cleansing patterns in these terms:
“The data cleansing pattern specifies rules for data standardization, matching, and survivorship that enforce consistency and quality on the data it is applied against. The data cleansing pattern validates the data (whether persistent or transient) against the cleansing rules and then applies changes to enforce consistency and quality… This pattern specification helps you, as a data or application architect, to make informed architectural decisions and to improve decision guidelines.” Source: http://www.ibm.com/developerworks/library/ws-soa-infoserv3/
Typically the recycling of data cleansing patterns is achieved through the delineation of cleansing rules that can be applied at the field, record and data object level through a service, stream or through a batch process for data. The bulk of which should ideally occur before and not during the transformation steps. The second area of pattern usage is the reuse of these rules in an OLTP/operational environment for the execution of the cleansing rules.
Data transformation patterns – EDW
What I want to focus on briefly here are transformation pattern types and transformation rule patterns.
There are many formalized architectural patterns for theoretical source to target transformation, however, they are generally not available as plug, modify and play patterns and tend to be untested approaches to unstated or exotic requirements.
As mentioned previously, some of the major ETL (and Data Integration) software vendors do provide generic sets of patterns for data transformation.
Data loading patterns – EDW
EDW data patterns cover a wealth of pattern areas, including related to full and partial loads; stereotyping and partitioning; parallelism; replication: change data capture/recapture; data ageing and archiving; historicity; categorization and classification, etc. They will also encompass patterns used to connectivity; security; logging; fail-safe or fail-soft recovery; management metadata collection; and, auditing.
Please note that I am not addressing the creation of hub and spoke dimensional models in this segment. For clarity and coherence, the loading patterns for the EDW will not be of the same class and nature as for the data loading patterns for Data Marts and Dimensional Models.
Typical patterns in EDW loading involve the use of fast, short and simple paths to data ingestion. These patterns will be determined quit significantly by the nature of the underlying infrastructure technology. It is not the same having a superfast and reliable load pattern for Oracle12C as it is for Teradata or for SQL/Server, for example. Or for that matter, the platforms they are running on.
Patterns may also be provided to load data into data models created using 3NF, ER modelling and Data Vault modelling. Typically, dimensional modelling will not be a part of the EDW and will therefore not have associated patterns.
Data transformation patterns – Data Marts
These are patterns that are used to transform data located in the EDW into data that is used in Data Marts.
Generally speaking there should be a limited degree of transformation of data carried out. What transformation that is done should be generally constrained to the aggregation of data and the creation of additional dimensional facets. Therefore, there is scope for the use of a small amount of highly reusable patterns for dimensional transformation, but not for data transformation itself.
Also, no data cleaning should be contemplated at this stage. Data quality needs to be fixed before the data is stored in the EDW (although data may be staged on the same platform and in the same DBMS).
Data loading patterns – Data Marts
There will be initial and incremental loads of Data Mart data.
Alternatively, one may to choose to completely refresh the Data Mart data, as and when required.
Most Data Marts will be built using the database structure commonly known as the Dimensional Model, and made famous by Ralph Kimball, the father of complex business SQL and a leading dimensional model evangelist.
Ralph has come up with a whole raft of ways that data can be ingested into a Data Mart, and all of these types can be captured and communicated and reused using ETL patterns.
In brief. Here are the list of the slowly changing dimension types as defined by Kimball:
Fig. 3 – Slowly Changing Dimension Types
Post-load processing patterns – EDW
Not all transformation will be done before data is loaded. Sometimes it is more expedient and performant to load data and then transform it. This will depend on needs.
To some extent the ELT process lends itself to patterns, but be very wary of trying to over-pattern everything. If the cardinality of the applicability of any pattern is low, it might imply that the cost of maintaining such a pattern actually outweigh the benefits.
An interesting aspect of patterns has emerged since the PaaS trend has picked up. For example, there are data loading guidelines for SQL Data Warehouse. Several common loading options are described, such as SSIS, BCP, Azure Data Factory (ADF), and SQLBulkCopy, but also for PolyBase technology, which is apparently the preferred and fastest loading method for ingesting data into SQL Data Warehouse.
ETL and database technologies provide many options to load data as discussed in this article. Each tool and method has its own advantages and disadvantages. In Micsoft’s case, it’s easy to lift-shift-and-drop your existing ETL packages, scripts and other “Control-node client gated methods to mitigate migration effort. However, if you require higher speeds for data ingestion, consider rewriting your processes to take advantage of PolyBase with its high throughput, highly scalable loading methodology.”
Other vendors have similar approaches, with their own unique selling propositions.
Meta-data management patterns
Unsurprisingly, one of the ways of automating ETL pattern development has been via the metadata route. For example, Microsoft SSIS can be used in conjunction with a rich set of metadata in order to generate reusable SSIS packages for data extraction, transformation and loading. SSIS control flows are generated directly from the metadata and imported into SSIS.
In the Microsoft SSIS example (published by Microsoft), the run-time environment of the platform is based on a controller/worker architecture. The following sections provide more detail on distributed execution of SSIS packages and unified logging by using controllers and workers.
The dearth of meta-data management patterns in ETL compounds the fact that there is in general a dearth of ETL patterns. This may have something to do with the lack of widespread genericity in Data Warehousing ETL processing, which unlike much of solutions architecture at the application level, goes beyond a clear set of enterprise application building blocks associated with CRUD activities, transaction data, reference data and master data.
The fact of the matter is that although there has been much research into the automation of Data Warehousing, end especially ETL, the coverage provided the efforts to date does not meet the minimum needs of many organisations allured by the promise of such technologies. Yes, they do have their niche, but they are certainly not receiving widespread acclaim, and that’s for a reason. Applicability, usability and cost – blind ETL automation can also lead to foreseen risks and unforeseen costs that wipe out any financial benefit up-front.
Logging and auditing patterns
Use logging and auditing patterns throughout your ETL process landscape.
Ensure that all generic aspects of exception handling have corresponding exception handling patterns.
Ensure that generic patterns are available for generic aspects of ETL process auditing.
Remember, there may be a legal requirement to show that the ETL process has been correctly audited – each and every time.
Better still, buy tools or add-ons that provide comprehensive logging and auditing.
There is no point in reinventing the wheel in exactly the same way, and there is no point in reusing patterns for the sake of development dogma.
Also, before I forget. Ensure all aspects of any developed or provided ETL functionality is well documented, especially if it is reusable pattern functionality and structures.
Last, but not least, don’t forget to use version control on all development aspects of ETL.
Testing patterns
I want to deal with the subject of Data Warehousing and ETL testing (including test patterns) separately. But for completeness it also appears here.
Needless to say. When it comes to table to table and field to field testing, this is where reusable dynamic patterns should come into their own.
At the moment there is not a wealth of available ETL test patterns in the public domain, as testing has always been a bit of an afterthought in Data Warehousing, however, there is a growing realization of the importance of testing, especially in cases where regulatory reporting depends on Data Warehousing.
It is hoped that the ETL tools themselves will provide the test pattern functionality built-in which will remove the need for alternative means to design, build, and test and document ETL test patterns.
Let’s see if the ETL vendors step up to the plate.
Automatization patterns
ETL and data warehouse scheduling and automisation patterns will be addressed in a separate article on Data Warehousing Automisation and Scheduling.
That’s all folks
As for much of these aspects of architectural patterns, there isn’t a wealth of information and usable patterns in the public domain, and this is further compounded if the tool you happen to be using is rather esoteric.
As this first of a series of articles I would like to make a personal observation regarding patterns. Patterns are not new, they are another name for something that professionals in the IT and other industries have been creating and applying for decades, even the COBOL shop in the company where I had my first IT related job used patterns, which they called copy-books. Patterns are not new nor revolutionary, they are just another way to create, reuse and extend functionality, process and structure.
As an architect I am familiar with building blocks, components, libraries, blue prints, reference architectures, industry models, nonspecific artefacts, genericity and polymorphism, and a long list of etceteras. So the use of the term pattern does not actually introduce anything new. Which in the big scheme of things does not matter too much.
However, and to reiterate. As a Data Architect I think that documentation of design patterns aligned to the definition and use of architectural building blocks should be used to direct and guide deployment of solution building blocks and reusable solution functionality and structures. Patterns, models or library functions.
Bottom line? Reusable architectural patterns are useful, but don’t expect to be able to do more than 20% of your ETL with generic patterns. Unless your business is super-generic or your ETL needs are minimal.
Until the next time. Have fun!
Many thanks for reading.
If you have any questions regarding this article then please feel free to connect with me here on LinkedIn or GoodStrat.com, or leave a comment below.
I tweet as @GoodStratTweet and my public email is martynineurope@gmail.com