Denormalise (flatten) XML using Xquery
ETL software have ways to interpret XML files or structures that vary in maturity, but recently I had to find another way to import XML structures that are received as blobs using (transact) SQL only. Because the upstream patterns (Staging to Integration) can easily handle denormalised source data sets the easiest way was to cast each blob as the XML file it is anyway and completely denormalise it (‘flatten’, more rows…). This is opposed to normalising each segment into a separate table (either in memory or physical) and (optionally) joining them back together.
There is not that much out there on the internet to accomplish this, so I thought I’d post an example here. I’ve tried this with fairly large XMLs and it works for me as the data delta is not too big – so the performance is fit for purpose. In this example a sample policy file is denormalised from a selected starting point. Essentially once the root node is defined every segment that needs to be included is added using an OUTER APPLY function. This is needed to return records for empty leaf nodes. In a given XML file you can basically navigate to your starting (root) node and denormalise from there.
declare @XML xml set @XML =' <premiums> <premium> <plan_id>1000</plan_id> <plan_name>CANCELLATION COVER - INTERNATIONAL</plan_name> <internal_name>CANCELLATION - SINGLE</internal_name> <cover_level_brief>CANX</cover_level_brief> <cover_type_brief>DEF</cover_type_brief> <value>27.5</value> </premium> <premium> <plan_id>2000</plan_id> <plan_name>COMPREHENSIVE - INTERNATIONAL</plan_name> <internal_name>PLAN A - COMPREHENSIVE</internal_name> <cover_level_brief>COMP</cover_level_brief> <cover_type_brief>XYZ</cover_type_brief> <value>222.38</value> </premium> </premiums>'
SELECT --segment1.premiums.value('local-name(.)','varchar(100)') as [RootTagName], segment2.premium.value('plan_id','varchar(100)') as [PlanID], segment2.premium.value('plan_name','varchar(100)') as [PlanName], segment2.premium.value('internal_name','varchar(100)') as [InternalName], segment2.premium.value('cover_level_brief','varchar(100)') as [CoverLevel], segment2.premium.value('cover_type_brief','varchar(100)') as [CoverType], segment2.premium.value('value','numeric(38,2)') as [Value] FROM @XML.nodes ('premiums') segment1 (premiums) -- to set the root node OUTER APPLY premiums.nodes ('premium') segment2 (premium) -- needed for empty leaf nodes --Add deeper segments here…
You can basically add as many segments as you want, either in one go (as above) or as separate segments (with varying levels) and join them in a CTE or something.
This example uses a variable to select from, but if you load from a XML stored as text or the XML datatype you can select from the table as well. In this case you need to start with a CROSS APPLY to get all the nodes you want. The FROM clause will then look like this:
FROM ( SELECT POLICY_ID -- for instance, CAST(XML_DOCUMENT AS XML) XML_OUTPUT FROM <table> ) sub CROSS APPLY XML_OUTPUT.nodes ('premiums') segment1 (premiums) -- to set the root node OUTER APPLY premiums.nodes ('premium') segment2 (premium) -- needed for empty leaf nodes --Add deeper segments here…
The root node (segment1) in this example needs to be specified completely. In the above example premiums is in fact the top-level node, but if you want to extract data with a deeper node as starting node the full path needs to be provided. E.g. if I want to start collecting data from the ‘bulk premiums’ node onward the segment 1 node needs to be fully pathed as for instance ‘policy_response/calculated_multi_values/bulk_premiums’.
In both cases the output needs to be declared so you need to know what information you want to retrieve from your XML and what the structure is. A screenshot from the results is posted below.