Foreign Keys in the Staging Layer – joining or not?
Warning – this is another post in the ‘options and considerations’ context, meaning that some people will probably disagree with this based on their personal convictions or ideas!
One or two Satellites?
The case in question is how to handle complexities that may arise if you want to simplify loading by joining tables in the Staging Layer. You may want to do this depending on the design choices made for the source system your are receiving your data from (some systems are better designed than others). An example of this could be a ‘Customer’ table which contains an ID (a foreign key called ‘Type ID’ or something like that) to a generic reference table with the value for the ID for example being ‘Good Customer’.
In the corresponding Satellite I would ideally like to include the label (description) instead of the ID, otherwise I’d have to join later to get this. This is because I prefer the Integration Layer model (Data Vault) to be as self-explanatory as possible.
The example used here is a very simplified case, of course. But, real life scenarios occur often when there are central ‘generic’ code tables with IDs, and I have seen things such as ‘code’ with ‘name’ in the one table, and ‘code’ with ‘description’ in the other… Another common scenario is that the (real) Business Key you need is not in the table you intend to process, so you need to get it from somewhere else.
Consider the following situation:
The first decision to make is how to model this to begin with. Does it make sense to model out the codes as its own (unconnected / reference) Hub / Satellite? There are a couple of advantages to this:
- If the meaning or value for the reference code changes and it is captured independently in a dedicated Satellite, it can be combined with the Satellite at any (later) time to represent the correct point-in-time value relevant to that Satellite. For instance in the extreme example that the label (descriptions) change for the ID from ‘Good Customer’ to ‘Do Not Call Anymore’ this only will lead to a change in the reference Satellite as the code itself doesn’t change – only the meaning / description. This reference Hub / Satellite can be joined with the Customer Satellite to display the correct series of events occurred / history, including the change of the label.
- It is easier to develop 🙂
- You can (opt to) avoid a potential massive data redundancy when code labels change. Incidentally, this advantage here – when separating the Satellites – is exactly the same as the disadvantage when you choose to combine the Satellites as we will explore later on. As a quick introduction: some updates on codes (when combined into a single Satellite) may have a massive impact on row counts. If my ‘Customer’ table has a billion rows with the same ID, a change in the code label would blow this out to two billion just with this single change. It’s a consideration to make on things such as expected rate of change etc.
There are also some disadvantages:
- The model becomes harder to understand as you need to provide additional content to make it ‘readable’. It may not be immediately obvious what all the IDs mean or where they should join to.
- You usually end up with more ETL processes.
- You need to join the various tables in /after the Data Vault to get the complete picture of what really happened.
So, if you opt to model the Customer Satellite as one (without the generic Hub/Satellite) does that make things easier? It means you will need to find a way to load the combination of tables into the Integration Layer / Data Vault which effectively leaves you two options: joining the tables in the Staging Layer or defining a template that looks for changes for part of the Satellite (similar to the traditional conformed Dimension updates from various sources ).
The second option means you typically calculate a checksum for the change and inherit the rest of the attributes from the Satellite record into a new one if there is a change. For this post today I wanted to explore the first option though: how to join Staging Layer tables together to populate the combined Satellite in one go. A quick note: these kinds of ideas are explored in the context of Data Vault / Integration Layer virtualisation which supports rapid refactoring of the Integration Layer if required (no lock-in).
After the above conceptual introduction it’s time to look deeper into what it means to define joins in the Staging Layer in order to get all the attributes together before presenting them to the Integration Layer / Data Vault.
Using Views in Staging to combine data
Remember, in the architecture I defined the Staging Layer consists of the Staging Area and Persistent Staging Area (PSA). The reason I point this out is that you can’t join tables in the Staging Area because the information in this area is transient; only deltas are available. Joining two delta sets together gives you incorrect results as keys on both sides of the join may be missing (because no change was detected at the point of processing). What you can do, however, is to leverage the PSA to join against – as the full history of raw changes is maintained and available here. To use this in its simplest form you need to designate a ‘driving’ Staging Area table which delta is used to load to the Data Vault, and outer-join this against the PSA table(s) that has the rest of the attributes you need.
This is especially handy if all you need is the Business Key from another table since it may not be in the table you want to process. Joining in the Staging Layer in this instance eliminates the need for key-Satellite type exception logic.
The approach outlined here is easy to do as it allows you to develop a ‘raw’ Staging view that has the information you need to present to the Integration Layer in the format that makes sense to you (without changing the contents – no business logic should be added ever). This makes uploading information to the Data Vault incredibly straightforward. However, the obvious downside of using these kinds of views for your design is that you introduce dependencies: the PSA table you join to needs to be up-to-date first.
This is a potential serious issue and fundamental design decision as scaling beyond micro-batch will be difficult once loading dependencies are introduced. But – horses for courses as they say in Australia – and the PSA / virtualisation / refactoring safety catch is always present if you change your mind. I will go into more technical details in the next post but suffice it to say there are two main variants of the Staging Layer view as mentioned above:
1) Use a (one) Staging Area table as driving table and left-outer-join to (the point in time) equivalent in the PSA for as many tables as you need.
This is the simple option, and will be OK in most cases. The thing to remember for this approach is (when taking the example from the introduction of this post) that code label (= PSA table) changes for the customer will only be picked-up by the ETL when the customer record (= Staging Area driving table) changes. This may seem trivial but an extreme case example would be that the customer records never change after the initial load, but the code label changes after a while. In this example the Data Vault customer records would still show the old label value (until something else changes for the customer).
Why didn’t the customer record change? Because the Foreign Key ID stayed the same, so for specifically the customer record there was never a change to detect.In short – best used with volatile driving Staging Area tables and relatively static reference PSA tables. Good to avoid the aforementioned Key-Satellite logic though, and it’s easy to add as many PSA tables as makes sense.
2) Join everything bidirectionally. Enter the second option, which is created to counter the above issues related to triggering changes. This option uses all in-scope tables as driving tables, joins them against their respective PSA tables and unions the results to generate change records. Sounds a bit cryptic, but it is fairly easy.
Using the same example again: if you are worried that changes of the code label aren’t coming through quickly enough because the customer records aren’t changing that rapidly, you can use option 2. To do this use the ‘Customer’ table as driving Staging table and outer-join to the PSA ‘Generic Reference Code’ table. This results will then be unioned with the following: use the ‘Generic Reference Code’ table as driving Staging table and outer-join to the PSA ‘Customer’ table. This will pick up changes both ways, and will update Customer records when the code label has changed. Beware of potentially huge row count increases as mentioned previously: if you have a lot of customers, a simple description change may create new rows for all of them.
Data Vault provides solutions for all of this though, and you can easily separate Satellites into their rapid and slow-changing equivalents. In the next posts I’ll copy some of the logic to develop this with some examples to bring it to life a bit more.
My personal preference is not to let IDs into the Data Vault, I just like it better to be able to ‘read’ the model at a glance and having a clearer integrated model to expose upstream. I favour view-based approaches and like the ability to pull attributes together to simplify loading. Especially when all you are missing in your Staging Area table is the Business Key that you need. But, there’s options…