Using an ETL platform for your Data Warehouse, is it still relevant?
When I started my career in Data Warehousing and Business Intelligence 15 years ago there was a massive push towards adopting ETL software. Traditionally, specialised ETL software such as Informatica Powercenter, Oracle Warehouse Builder (later superseded by Oracle Data Integration), Microsoft DTS (later superseded by SSIS) and many similar platforms were very successful because of two main reason:
- ETL software provided a way to ‘explain’ or document what was happening in a way that made it easier to understand by most people (often deceptively so, but still)
- ETL software provided capabilities that were not easily implemented using procedural SQL. Examples are shared and/or reusable caching, shared sequence generation and things like that. Depending on the ETL software and RDBMS platform combination some very neat tricks could be leveraged including optimized sorting, improved latency, pivoting and normalisation
In a way the performance balance between the ETL server and RDBSM server was different as well and ETL was often implemented to take some load off the database.
I got hooked and worked as a Powercenter specialist for a couple of years during this time.
At this point in time, anno 2014, we are perfecting ‘model driven design’ where we can read the data models to auto generate the required ETL processes following the predefined rules and (ETL) patterns. And we can generate these for any desired ETL tool, including SQL. The pattern-based approaches significantly change the way ETL is documented: gone are the big complex ETL processes that require custom documentation. They are replaced by many atomic ‘templated’ ETLs which are indeed so simple they do not require any additional documentation: just the description of the template will suffice as all ETLs are based on this!
In a similar wave the performance balance seems to have shifted back in favour of the database. With new ever scalable and Massively Parallel Processing (MPP) database, Cloud Databases, Appliances and even NoSQL databases it appears the need to have a separate ETL server is going away. Indeed, it often slows down the processing because the data has to be read from the system into memory, processed, and then inserted back into the database. With all the scalability now available for the data platform in many cases it’s better to just keep the processing there. This move has been going on for quite some time, which is visible by the increasing support for SQL generation (pushdown of SQL); allowing the ETL platform to act as documentation tool while letting the database handle the processing. Off topic, a similar thing is happening now with Hadoop / MapReduce as well.
This brings up the following question: if we have all the metadata we need to generate every ETL process, do we still need an ETL tool?
I’m tempted to say ‘no’ but perhaps it is better to make the distinction for applying ETL software functionality for separate and dedicated parts of the solution design. I think for the time being an ETL tool has a place in the following areas:
- Interfacing between various sources (flat files, different databases, web services, messages, CDC software etc.), because you still want a heterogeneous approach to interfacing if possible. Although it is acknowledged that various native functionality is present to create some, or all, interfaces in a specific situation most ETL tools provide easy connectors
- Deploying the Information Mart / Data Mart, as this still requires business logic to provide the data into a meaningful and accessible format. The way I see it ETL tools still have their original function here to provide additional documentation and controls in how information is presented – assuming you have a proper core Data Warehouse layer underneath your Presentation Layer! The key rule is that the Information Mart is ‘rerollable’, and can be regenerated with exactly the same data
However, this last bullet is in dispute. The recent push towards virtualising the Information Mart mean you don’t need a physical copy of your data in the shape of a Mart anymore. This is supported by upcoming tools such as AnalytixDS Mapping Manager that can store business rules / transformation logic and provide APIs that can hook into the Data Warehouse views.
My approach would be to seriously consider if you need a full platform with their complete functionality these days, and if you consider ETL focus primarily on the interfacing and delivery. The core Data Warehouse layer, such as a Data Vault model, can go either way but will probably benefit most from SQL processing.
The ability to forward engineer your ETL is a great benefit here since it gives you the flexibility to even re-roll your ETL from a dedicated tool to SQL and vice-versa, making you truly future proof and ‘agile’.