Using a Natural Business Key – the end of hash keys?

You may also like...

2 Responses

  1. dlinstedt says:

    An interesting article for sure, and yes, it does beg discussion. Unfortunately I disagree with Roland Bauman on some of the issues he has raised in his article and I have not yet prepared a response. He is not completely correct in some of his statements. That said, I wish to offer a few thoughts and possible corrective responses to your post (or considerations at the very least)..

    1) Go back to the ORIGINAL definition of a Hub… The full and base level definition of a hub is: A UNIQUE LIST OF BUSINESS KEYS… Never Ever Ever have I said that Hash Keys, or Sequences or Surrogates are “necessary”. Now, that stated, I have made statements that they are required for “technical reasons”. But they are NOT and never have been part of the original definition of a Hub. So why use hashes or sequences anyway?

    2) Clustered Indexes stink at volume or high velocity data – cause tremendous I/O problems on ingesting data (that arrives out of order to begin with) on real-time streaming queues. That said, Clustered Indexes don’t scale to the Petabyte / tens of petabyte levels. If they did, they would be applied or in use today on MPP machines in the NoSQL space. They DO cause problems. Yes, the performance of a clustered index on a join IS faster (as long as the data isn’t split over MPP divisions), but this is a highly platform specific and volume specific argument. The Data Vault model and methodology along with its standards is built for HIGH Scalability without re-engineering. Unfortunately your tests have not been attempted at the 1 petabyte levels. I’d be curious to see what happens to the clustering at these volumes, and I’d be curious to see what happens to 100k transactions per second on incoming feeds – against a standard clustered table, that requires an incremental nature. In my experience, scale and volume KILL clustering dead as an architecture.

    3) Unfortunately you might have missed the latest statements about switching your hashes in Oracle and SQLServer over to Binary(16) (for MD5), and Binary(20) for SHA-1. These will save tremendous amounts of space, as well as avoid any “issues” with improperly assigning a National or Unicode character set to the Hash Key.

    4) Regarding Natural Keys / Business Keys: Yes, they can and do outperform Hashes IF and ONLY IF a) they are concatenated to a single field (multi-field joins cause bigger performance issues over large volumes than single field joins), and b) the length of the concatenation is shorter in bytes than the length of the BINARY stored hash. That said, Teradata cannot store the Hash result in a binary format in a primary key (if it can, please correct me, this is the current research Ive found). Now that said, if your “natural / business keys” are in unicode because they need it, then they almost assuredly will end up in a longer concatenated length than the result of an ASCII based 8 bit hex Hash. Which translates in to binary(16) for MD5 and Binary(20) for SHA-1.

    5) what if you have a DATE as part of your business key? Interval? Time? or Float/Decimal? They need to be cast as Character / Varchar – same as it would have to be, to be included in the hash, somewhere you will eat the computation time to trim, and align the data sets.

    6) lets talk about business key / natural key variation…. Nothing to say here – because it is a wash. If it’s in the hash in the Hub or not, it won’t matter. As long as it’s concatenated to a single field.

    So, at the end of the day, yes hashes can be effective techniques for mitigating lots of issues. I’m not saying Hashes are the only solution, and I’m not saying they will always be the right one. In fact, IF you follow the purist design of the Data Vault you would never be using surrogates or hashes to begin with. That said, this is a Platform Issue – the question that is raised is: how does the platform best handle natural / business keys, and can it do it with massive volumes? and can it do it in true real time???

    These are the test cases that must be proven out, before a solution (such as the one suggested in this blog entry) can actually be applied properly.

    Just my two cents,.

    Hope this helps,
    Dan Linstedt

     
  2. Roelant Vos says:

    Thanks Dan. I’d like to think that discussions like these help in making the right decision (a platform decision as you indicate here) depending on the scenario. You see a lot of blanket application of standards and this may help in pointing out that it’s all about options and considerations to suit the intended outcome.

     

Leave a Reply