Staging and interpreting XMLs, BLOBs and similar
Recently I had a couple of conversations about the ‘staging’ (loading data delta into your environment) of multi-structured datatypes such as JSON, XML (and some blobs). Today I had one of these conversations with my esteemed ex-colleagues Bruce and Glenn which made me think some additional information and considerations about the recent post to parse XML using XQuery would be a good idea.
These conversations focused on where the interpretation of XML should happen: storing the unmodified (raw) XML in the Staging Area or run logic across the data delta to extract the elements out into (potentially many) separate Staging Area tables, normalising the file in other words. As a default option I recommend storing the original as-is (e.g. XMLs in the XML datatype). This is for the following reasons:
- Storing the unmodified XML in Staging (and the Persistent Staging Area) means we can use the default patterns to extract, stage and archive this type of data, but additional logic needs to be placed on top of this to extract the elements you need
- You are more future-proof as you are not impacted by schema changes in the underlying source system. Changes still impact you of course, but this now happens in the processes from Staging to Integration which are in your own environment and easier to handle compared to interface changes
- If you adopt the Persistent Staging Area (PSA) / Historical Staging Area (HSTG) you can come back later and add more information from the original content if required
If you interpret / normalise the files into the Staging Area (the alternative option) you have the benefit of neater formatting and arguably additional storage and performance benefits. Accessing raw XML can be slow (although special XML indexes may be an option), and in certain scenarios this may not meet requirements so normalising could be an option.
A potential issue is the delta detection, specifically for logical (soft) deleted within segments. This may be a requirement for loading certain parts of the XML into the Integration Layer in the correct context so it should be considered. The short answer to this is that complete delta detection essentially requires the content to be interpreted (more on this a bit later) whatever approach you take towards staging the data.
I encountered this situation recently and opted to store the data as unmodified XML as we weren’t sure which elements we needed out of the entire structure (it was large, varying and fairly convoluted). The idea is that if we kept the original as-is we could go back and get some more information if required – because the original is archived in the PSA / HSTG – and it made the Staging process easier because we could use the standard patterns.
I implemented a view on top of the Staging Layer (using the approach here) that interpreted the XML into a larger denormalised (virtual) structure. This view acted as ‘staging’ for the upstream processes (to the Integration Layer) while the original XML was archived in a PSA/HSTG as usual. This enabled us to use the normal Staging patterns and it is also consistent with a pattern I use that defines views on Staging tables for further processing to the Integration Layer – for instance if you need to join other tables to the Staging table to retrieve the business key. This means that various Staging tables supports views that are used to load the Integration Layer, and the XML interpretation is one of them.
Deriving the logical delete still is an issue, which can be countered by executing Full Outer Joins against the virtual (=view) normalised structure. It goes without saying that a physical table has the same result with the usual complexity / performance balance considerations.
We didn’t need to do implement the delete detection due to the pure transactional nature of the specific data we received. However, conceptually it is no different if you normalise things ‘before’ or ‘after’ the Staging Area when it comes to detecting deletes.
Because of all of the above I ultimately felt storing the original XMLs in Staging was the best way to go. It worked for us, let me know what works for you.