Data Vault ETL Implementation using SSIS: Step 5 – Satellite ETL – part 3 – End Dating
For various reasons including (but not limited to) ease of maintenance, speed, reusability, parallelism and reducing of complexity in general, End Dating is implemented as a stand-alone generic ETL process to support the Data Vault Satellite ETL.
Conceptually, End Dating is a redundant step because the information is derived from already available information: the Effective Date. However, in most situations it makes upstream ETL processing easier to manage and may be worth the storage/performance trade-off. But, because its redundant nature End Dating has to be considered as an optional step and therefore needs to be designed to be removed (or optional) if required.
End Dating in this context is defined as the process that ‘closes off’ changed records by setting the Expiry/End Date and sometimes even an Active Record Indicator (if implemented). The End Dating process is defined for each Satellite entity and is therefore generic; it can be shared between the various ETL workflows / batches if for some reason a Satellite entity is populated from different data sets (exceptions, but it can happen). It goes without saying that it needs to be run after the regular Satellite ETL process.
In SSIS the End Dating process for Data Vault Satellites is created as a dedicated Package with an Execute SQL Task. Following standard ETL conventions it requires the same logging of records as the other processes do which means updated records will be flagged with the Process ID of the (instance of the) job that actually did the update. Earlier ventures to define End Dating were based on using a normal Data Flow, but this creates a row based operation which is quite performance intensive for SSIS to handle. Using an Execute SQL Task forces execution on database level which is set based. Thanks to SSIS’ ability to pass parameters into the Execute SQL Task we are able to provide the Process ID information into this SQL before execution, so it will be a relatively efficient task to run (for a redundant step, that is).
In any case it is a very straightforward process with only one operation in the SSIS Control Flow:
All the logic is buried in the SQL query which identifies any record that needs to be end-dated and also does the corresponding updates; the selected Expiry Date (the Effective Date from the next record). This means that this process can be run at any time and closes off all records which need to be closed. An example SQL is provided below. Please note the ‘?’ symbols are mapped to parameters via the ‘Parameter Mapping’ in the Execute SQL Task properties. Also, in my view of the world the high end date/times are always set to ‘9999-12-31’ so the logic is built around that. These high dates are inserted by the regular Satellite ETL and already in place from the perspective of End Dating.
With MyCTE AS ( SELECT A.<Hub Key>, -- Add additional key for multi-active satellites here! A.<EFFECTIVE_DATETIME>, B.<EFFECTIVE_DATETIME> AS EXPIRY_DATETIME FROM <Satellite table> A INNER JOIN <Satellite table> B ON A.<Hub Key>=B.<Hub Key> -- Add additional join condition for multi-active satellites here! AND B.EFFECTIVE_DATE = ( SELECT TOP 1 EFFECTIVE_DATETIME FROM <Satellite table> C WHERE A.<Hub Key> = C.<Hub Key> -- Add additional where clauses for multi-active satellites here! AND C.EFFECTIVE_DATETIME > EFFECTIVE_DATETIME ) JOIN ( SELECT <Hub Key> -- Add additional key for multi-active satellites FROM <Satellite table> WHERE EXPIRY_DATETIME = '9999-12-31' GROUP BY <Hub Key -- Add additional key in grouping condition for multi-active satellites HAVING COUNT(*) > 1 ) dupquery ON dupquery.<Hub Key>=A.<Hub Key> -- AND additional key for multi-active satellites AND A.EXPIRY_DATETIME='9999-12-31' ) UPDATE A SET A.EXPIRY_DATETIME = B.EXPIRY_DATETIME, A.ETL_PROCESS_ID = ?, -- If required A.ACTIVE_RECORD_INDICATOR = 'N' -- If required FROM <Satellite table> INNER JOIN MyCTE B ON A.<Hub Key>=B.<Hub Key> AND A.EFFECTIVE_DATETIME = B.EFFECTIVE_DATETIME