Do we still want to automate against ETL tools?
In the various Data Warehouse initiatives I have been involved with recently I tend to use ETL software less and less. Over the last few years I spent a lot of time figuring out how to automate/generate ETL in various tools – most notably in SSIS, Pentaho, Powercenter and recently SAS (but various others as well). To the point that most of the intended solution can be generated from metadata.
But as recently outlined in this post and the opportunities provided by virtualisation (posted in various articles on this blog) I feel that I really don’t need (want) ETL anymore for the ‘back-end’ of the Data Warehouse / Data Platform solution. This back-end (usually a staging / Data Vault or similar combination) needs to be rock solid and ‘just work’ and is highly standardised and pattern-based and while we can achieve this using ETL (tool) automation, I have come to believe this back-end is much better handled in SQL. For me a as a long-time ETL developer this is almost coming full circle in how I use(d) ETL. The main reason for this move is the fact that database performance is more and more outweighing ETL server performance, and the patterns are so easy that even investing energy in pushdown (generating SQL from ETL) isn’t really worth it.
So I’m at the point that while I can generate ETL in say SSIS, in many cases I’m not sure whether I still want to do it.
You want to minimize the data that leaves (and is written back to) the database and generally this means filtering and processing in SQL / the database. Ideally you would want to virtualise everything, which is a real option in many cases.
Dedicated ETL tooling can still have its place in the solution design, but specifically in:
- Providing connectivity to source systems; some tools have great built-in functions or connectors to various applications. Specifically when it comes to integration SaaS type applications. In other words: by providing data delta into the Staging Area only. Of course there are many applications that do not need a specific connector, in which case you don’t need dedicated ETL software.
- Implementing specific manual business logic in the Presentation Layer only (Information Marts). This is also up for discussion with the broader implementation of Managed Self Service BI and better business rules engines.
But not by default in the Persistent Staging Area or core Data Warehouse (Integration Layer, Data Vault and similar). It has actually become better, easier, portable and more scalable to actually do this type of ETL in SQL.
I do realise there are various ETL automation tools out there that generate SQL. Software such as Quipu (open source), WhereScape, BIReady etc. etc. But as far as I know they don’t do the virtualisation for the core Data Warehouse layer yet, and in some cases (to the extend that I have worked with the tools) have some limitations that I’m not comfortable with.
What I want is to create a Persistent Staging Area, virtualise the Data Vault, virtualise a raw data mart and go. All from metadata of course. And when persistent storage is needed I want to either swap or apply this in-place or generate ETL to do this for me. I’ve done some thinking on how this can easily be programmed and will post this shortly.
For the record, this still means you can implement Managed Self Service BI. It just means the data would be recorded in Staging / Persistent Staging as opposed to a dedicated area in the Integration Layer (such as the Business Data Vault). Same thing, different way of achieving it.