Implementing User Managed data (User Managed Staging Tables)
Related to the handling of reference data it is sometimes required to feed information into the Data Warehouse that does not have a formal source in the organisation’s information landscape.
This information can be needed in the form of reference data (additional information about attributes provided by a source), relationships or really anything that is required to ultimately meet reporting requirements. It can be a vital element to ‘glue’ information together or to provide details which are not currently stored in any system.
To keep the discussions and use of the concept clear it is helpful to make the distinction between reference data in itself and user-managed data. Reference data is not necessarily user-managed and can be provided by formal external bodies that provide context for a specific code or label. However, it can be user-managed if such a data source cannot be found. User-managed data is a potentially wider collection which is treated as a single source application that resides within a persistent section of the Staging Area.
The implementation of user-managed data is done as a persistent (set of) Staging Area tables with the same structure as the regular Staging Area tables. This set of tables is viewed and managed as a separate source system for the Data Warehouse meaning that all the subsequent (ETL) processes will be similar to the regular ETL processes from the Staging Area onwards.
While the structure in terms of attributes is the same between normal and user-managed Staging Area tables there is a difference in the way the attributes are populated:
- Insert / event date time; in normal Staging Area tables this attribute is provided by the source to staging interface but for User Managed Staging tables this date/time should be updated when table contents change. I.e. it is system generated (triggered). Similar to regular Staging Area processed the insert / event date time is propagated through the Data Warehouse to manage timelines (history).
- Record source; defaulted or hard-coded to ‘USER_MANAGED’ or any other name the tables should be referenced to as an operational system.
- CDC operation; similar to the insert / event date time information this attribute is system managed (triggered) for User Managed Staging tables. The information serves as metadata whether the record has been updated, inserted or deleted.
- (Source) row identifier; similar to the normal Staging Area this is still an auto generated sequence number.
- Any attributes that form the User Managed content. Only these attributes should be visible through a front-end solution.
It is recommended to provide a front-end solution to manage the table contents to support easy changes of the contents as well as to shield the metadata attributes for incorrect updates. Most Business Intelligence platforms provide some kind of data entry or direct-update functionality which can be used, but a lightweight web- or service interface is also an option.
The User Managed tables do not have an associated source to staging ETL process but all subsequent ETL processes are the same as they are for other source systems. This includes managing the changes over time in the History Area (part of the conceptual Staging Layer) and loading the data to the (Operational) Data Vault or any other core EDW model. The design decision to keep the table structure the same for User Managed and regular Staging Area tables pays off here since it keeps the upstream ETL processes generic and provides a historical view on changes made in the User Managed tables. And of course this consistency enables these processes to be template or generated while providing all the benefits from the EDW ETL architecture including disaster recovery, rollbacks and general robustness.
From a broader Information Management perspective it is recommended to continue working towards finding a suitable ‘home’ for the information that is stored in User Managed tables. When this information becomes available from a formal source/process at some point this dataset should replace the User Managed solution.