Best practices on developing Data Vault in SQL Server (including SSIS)
Sharing is caring, so today’s post covers some technical details for the Microsoft world: implementing Data Vault models on the SQL Server database and corresponding ETL using SSIS and technologies such as BIML. This is based on experiences gained developing many Data Warehouses (both Data Vault based as well as using other methodologies).
Physical modelling (for Data Vault-based Integration Layers):
- Don’t use clustered indexes on Primary Keys! This is the single biggest tip to be aware of since Hash keys basically act as ‘random’ seeds for an index. Hashes are not random of course, but their non-sequential nature means that they usually need to be inserted (in order) somewhere in between existing Hashes if you would use a clustered index. Because of this loading new keys leads to a continuous altering of the index (keys get moved up and down all the time) which severely degrades I/O performance when loading. It also causes massive fragmentation for the index when reading which further impacts updates (e.g. end-dating). Non-clustered indexes should be considered mandatory for Hash-based keys as this will make sure new keys are added ‘at the end’ of the index. This is fine, as there is no logical order for the hash keys anyway.
- Setup an index maintenance plan and run it on a schedule as per the MSDN best practices.
- Use clustered indexes on Unique Keys in Hubs. This (obviously) creates a separate index in the order of the real business keys for Hubs which improves joins / lookup performance. For Links you have the option to use a clustered or non-clustered index on the ‘unique key’ (the combination of Hub hash keys in the Link). You want to have this constraint in place to guarantee uniqueness but how you cluster the index or not depends on balancing the query select performance. If you use a clustered index be mindful of fragmentation caused by the reshuffling of keys. At least you don’t need to update a link so you have options here.
- Make your date/time attributes datetime2(7), as this is the highest precision date/time SQL Server provides (after 2008 from memory). This makes everything more future proof as datetime2(7) has a range from 0001-01-01 to 9999-12-31 while the datetime only supports 1753 onwards (to 9999). Additionally datetime2(7) can be more specific at time level: datetime2(7) can capture time down to 100ns whereas datetime starts rounding at 3ms.
- Don’t enable Foreign Keys anywhere. You need to do this to enable the parallel loading that using Hash keys make possible. This means you can’t generate a diagram in SQL Server as it read off these constraints. In the Data Vault world there are other ways to enforce referential integrity.
- Use Page compression on everything, pretty much everywhere (e.g. tables, indexes). I haven’t encountered a situation where the CPU hit doesn’t pay off against the I/O gain. Especially in historical tables compression gains are huge.
- Consider Unicode. I know, it takes up more space but you can handle the (conceptual) scenario that a unicode-based OLTP system actually uses different casing for different meaning (e.g. Abc = customer 1, ABC = customer 2).
- Always ‘sand’ Hash keys with delimiters (I use a pipe delimiter) as per Data Vault best practices. The built-in SQL Server MD5 or SHA1 algorithms deliver the results in upper case. An example is:
CONVERT(CHAR(32),HASHBYTES('MD5', ISNULL(RTRIM(CONVERT(NVARCHAR(100),[AGENT_CODE])),'NA')+'|'+ ISNULL(RTRIM(CONVERT(NVARCHAR(100),[POLICY_NR])),'NA')+'|' ),2) AS POLICY_AGENT_HSH
- Push-down SQL as much as possible when working with SSIS. As an ETL tool, SSIS isn’t particularly good at lookups and caching in general. Techniques such are (easy) shared caching are still foreign to SSIS. Any joins or checks are better done in the source SQL override, leaving the rest of the Data Flow to add mid-stream (ETL process) attributes or any passive transformations that are needed such as interpreting join results. The Control Flow is great to use to handle the ETL process itself (ETL Framework / exception handling). The result is that in SSIS most templates are really SQL templates with an SSIS wrapper.
- Investigate techniques such as BIML to automate development. I tend to use some C# code to iterate through metadata and pass variables into BIML code in order to generate pretty much everything, but at the very least the ETL process control handles. You can use the free BIDS Helper plugin for this, or alternatively Varigence sells an IDE that also covers the wider MS BI stack. There are various sites with code snippets, and it’s easy to setup.
- Use an ETL Framework! This is a more generic statement, but you really can’t manage and scale a solution without having a consistent framework and approach to handle ETL exceptions.
- Lastly, I can’t help it but don’t use the SCD operation. It doesn’t scale! But I leave the rest to consider based off the Project REAL insights, otherwise there’s too much in here that’s already covered somewhere else 😉
Hope this helps!