Loading Messages into ETL staging tables

Messages are composed of one header and one or more segments which hold the actual data. These are demands a DWH should set for loading messages into ETL staging tables:

  • The header should contain information about the version of the message definition and version of the header, the source system where the message originated from and a description.
  • The message segments can occur not, one or many times. This depends on the type of information the segment is carrying. The topmost segment is usually the information about event which triggers the message in the first place. This segment should contain the datetime event which indicates the moment the event was triggered by the change in the source system data. It should also contain information if the information is an update of a previous record, a delete or an insert. 

Other segments carry the specific information for which the message was modelled.

The way the staging tables for a message are modelled depends on the cardinality of the message. Every segment has an indicator of how many times it can occur (repeat) as part of a message. This should be documented in the message definition and can at least be viewed in the XSD of the message.

An example of message definitions how they may be documented in the Word or Excel (if done properly by system control):

Level Segment Name Minimum Repeats Maximum Repeats Remarks
0 Event 1 1 Message triggering event
1 InsertUpdateDelete_Flag 1 1 Change in data
3 Shipment 1 1 Main message container
3.1 NonBackHaul 0 1 Regular shipment for order
3.2 BackHaul 0 1 Return order shipment
3.3 Unit 1 Unbounded Groups of items in shipment
3.3.1 Item 0 Unbounded Items in shipment

 

Example of the message XSD:
(note: same example as above but in different –XSD– format)

As you can see in both presentations of the same message: segments one up until three occur only once. It goes without saying that a header also occurs only one time per message. The idea to use when loading messages into ETL staging tables is using the message cardinality to define a minimum number of separate staging tables for the message.

Because of the cardinality in the example message, the group of single-recurring segments and the header should be be modeled as part of a header table / highest level table of the staging message structure!

The levels 3.1 and 3.2 also have a maximum occurrence of one and therefore can also be integrated in the same table.

Because they do not have to contain data (i.e. the segment itself may be empty), all the attributes that belong to these segments could be considered as nullable. However, I would advice against using constraints in the staging area because the fewer constraints are implemented, the less chance of error there is. And in datawarehousing it’s usually better to continue processing, especially in the staging layer.

I would suggest to disable every attribute in the message staging tables to try to achieve a faultless processing. This is especially important with messages because (depending on the broker architecture) it may not be possible to resend the message (easily). The one exception would be the the header ID in the header table itself.

The segment 3.3 (ShipUnit in this example) should always be present once but possibly many times (unbounded or ∞) so this will have to be modeled in a separate table. The same is true for the last level (3.3.1) which is provides further detail on 3.3).

The end result for this message will be three tables. One header table which will always be modeled and two segment tables:

  • MESSAGE_<message number>_HEADER  (header tables and segments 0 up until 3.2)
  • MESSAGE_<message number>_<alias> (segment 3.3)
  • MESSAGE_<message number>_<alias> (segment 3.3.1)

The header table will always contain the following attributes:

  • HEADER_ID, meaningless key sequence which is defined and provided by the DWH staging area. The same ID is also inserted in the segment tables so the different segments can be related to one another.
  • MESSAGE_ID, if the message has its own ID this can be stored here.
  • DML_ACTION indicates if the records concerns an update, insert or delete.
  • DATETIME_SOURCE_EVENT, the datetime of the source event trigger.
  • DATETIME_INSERTED, the sysdate when the record is inserted into the message staging table.

The segment tables (varying alias) always contain the following attributes:

  • HEADER_ID (inherited from the header table)
  • DATETIME_INSERTED, the sysdate when the record is inserted into the message staging table.
  • The (disabled) primary key in the segment table always contain the header ID. For the header table this is the only attribute of the primary key, for the subsequent segment tables the primary key consists of the header ID and any logical key if available.

Ravos

Roelant Vos

You may also like...

Leave a Reply