Exciting new ways of thinking about Persistent Staging
A Persistent Staging Area (PSA) if often associated with databases, but this is not how it needs to be. This post covers alternative ways of thinking about what a PSA really is, and what it can do for you.
More than just a database
I am a huge fan of having a PSA as a part of a Data Warehouse architecture. In fact, I would even go as far as stating that you should consider a PSA for any data solution architecture.
A PSA can loosely be defined as an insert-only time-stamped administration of all original transactions that have been presented to the data solution through feeding systems or interfaces, ordered by time of arrival.
This can be implemented as a database, or series of databases, but also in many other ways – which is the topic of this post.
A PSA allows you to refactor your upstream designs such as a Data Warehouse (‘models’, ‘business rules’) by being able to reprocess the events (transactions) and ‘replay history’ in a deterministic way. Reprocessing events simply means running the input through the (updated) logic again, in order to recalculate the output.
I still consider this a fundamental requirement for many data solutions, including Data Warehouse systems. I continue to be surprised about how long it takes for the industry to appreciate this and move away from the notion that data solutions can be delivered correctly in one go only through good data modelling and business requirements.
I’m not saying data modelling and business requirement gathering are not important – they very much are. I’m simply stating that it seems unlikely that the inherent overall complexity and required maturity of a given business will be solvable in a single pass – even for small scopes or subject areas. These things take time to evolve, and usually a lot.
I view data solutions as long-term assets, organisms that morph with an organisation and continuously evolve using different interpretations and technologies. Having a reprocessing capability that allows not only the business requirements to iterate but the entire data solution enables this.
A common argument against refactoring and reprocessing is that analysis was just not done thoroughly enough. I don’t subscribe to this view. Any system will incur technical debt over time, things you look back on and would like to have done differently. Or, some components just need a little bit of refurbishment after a while.
My view is that it is not possible for any individual to fully understand the complexities of modern businesses to have all the wisdom upfront, and that is reflected in the defined data models as well as how technology is implemented.
Why not allow for some room to refine these things over time? A PSA caters to this, and as we shall see can in come in handy for other reasons as well.
Scale does matter
There is still a lot of discussion on how a PSA should best be implemented, with significant focus on scalability in particular and size / volume in general. To facilitate these conversations, I believe it is helpful to think about the PSA as a log.
With a ‘log’ I don’t mean an extract of events that are meant to be viewed and actioned by people (i.e. for debugging purposes etc.). The kind of log I am referring to is closer to the transaction logs that databases use to guarantee consistency and durability. This log is a fundamental data set containing insert-only series of events, captured in the sequence of their arrival (i.e. ordered in time).
The transaction log is central to how relational databases work. In short: any data (event) is first written to the log, from where it gets propagated to (potentially) various data structures such as tables, indexes and even other systems (i.e. log shipping, transactional replication). The log is the source of truth, the master record of all occurred events – everything that happened.
This is the implementation of the ACID principles for relational databases. As a mature product, the Relational Database Management System (RDBMS) handles things like these fairly autonomously in the background – but it is fundamental functionality to enable a consistent representation of data. The log is where changes accumulate when events are triggered (updates, insert, deletes) and tables and indexes are representations of these events.
To use even more abstract terms: think of this approach as having external factors (systems) stream events (changes) into an application-readable log that allows any upstream (consuming) system to pick up these events at whatever rate is acceptable and supported by the available technology.
This creates a distinction between the more narrowly defined ‘informative’ log as used in the introduction. Let’s stick to the definition of a ‘application-readable’ log that allows potentially many systems to read from for the purposes of this post.
Persistent Staging is an application-readable log
If you think about it this way, there is a strong parallel between the role of the PSA in a Data Warehouse architecture, and how an application-readable log operates.
This is best outlined in the post explaining how to use the PSA to configure continuous loading and eventual consistency and the more in-depth paper covering the details of delayed consistency (latency) in continuous loading environments. This is a robust approach and while it is fairly technical, I recommend having a look.
These papers explain how data consistency is asserted in a fully parallel environment by using the control framework (ETL) metadata. The control framework orchestrates the many independent ETL processes that populate the Data Warehouse directly from the PSA. Every single ETL process manages its own load window against its mapped PSA counterpart, and can therefore run in parallel and in any order or frequency.
In a sense, they are are all individual consumers of the application-readable log.
As I explained in these previous papers, the combined load windows can be used to automatically calculate what data is ‘ready’ depending on the scope of the upstream query – I refer to this as ‘ACID at application level’. I still find this to be a very elegant and easy-to-manage scalable solution for more traditional Data Warehouse configurations.
The implementation of the PSA in these examples was done using database tables, but for this scenario there is no conceptual difference between a file and table.
In both cases, the PSA is an insert-only time-stamped ordered set of files. In the database context these are physically tables, but in other settings they can take on a number of formats like Parquet, Avro, JSON, HDFS, Spatial formats, etc. A PSA can be easily implemented as an application-readable log and achieve the same outcome in functionality by using different technology and options to scale out.
There really is no different in principle between the PSA and a log in this sense. Indeed, a log and a table can be used interchangeably – both are ways to store data in a certain order. The structure nor the principal architecture would need to change.
If you were to replace the ‘table PSA’ with the ‘application-readable log PSA’ you can see that the upstream consuming applications are the many individual processes that are running as their own little state-machines orchestrated by the (ETL) control framework. Again, each atomic process can be considered as an individual consuming system that has its own commit points to read from. Eventual consistency from a data delivery point of view (i.e. to a Data Mart) is asserted the same way as before.
Technology is available to support this, ranging from bridging technologies such as PolyBase to new and established distributed commit log processing technologies such as Apache Kafka, Microsoft Azure Event Hub, Amazon Kinesis and Google Pub/Sub.
PolyBase allows things like accessing (compressed) archives on Azure Data Lake or Blob Storage that PSA / log files are continuously loaded into through a SQL interface – bridging the file and database world. The PSA in this case is the zip file on the HDFS instance.
Distributed commit log technologies such as Kafka, Event Hub, Kinesis, Pub/Sub allow for massive scalability and throughput of reads and writes. They also offer features such as managing log retention, partitioning across nodes and managing throughput to different classification of consumers – with fault tolerance and fail-over. Consuming applications can subscribe to feeds exposed via APIs / services. The PSA in this case is the log file directly.
Best of both worlds?
There are additional and intriguing possibilities when viewing the PSA as an application-readable log, similar to how logs are used in distributed systems. Besides the myriad of automatically-generated atomic ETL processes that consume the log, the log can be subscribed to by any other information consuming system as well.
This means the log becomes more of a central ‘data hub’ that provides the ‘plumbing’ that unlocks raw data (change events) to a scalable centralised system – often an early win in data projects. Following this line of thought, there is a lot of merit in using event based (log) approaches to be the heart of any data solution.
Can the PSA be one huge table, or a series of tables in a database? Sure, and this can work for you – it has worked for me in many projects. It is, as always, all about options and considerations.
But, if scale and size are critical don’t let this be the barrier for adoption. Instead, consider the alternative view of treating the PSA as an application-readable log with all the associated benefits for multi-consumer, distributed- and stream processing.