Roelant Vos

Roelant Vos

You may also like...

6 Responses

  1. hennie7863 says:

    Hi This looks great. I’ve built my own Data integration model but not as extensive as you did. thnx for sharing.
    Hennie

     
  2. 00 says:

    Hi Ravos, looks nice. I am trying to fit this in with an Sql Server SSIS approach.

    Could you explain a couple of things on this.

    1. You say “special mention has to be made for the parameter tables, these can be used to provide changing input parameters in a way that allows for retracing any used parameters in the ETL processes. In some case input (parameter) files are created at runtime and these tables support this while retaining history of the used variables or parameters.” however I can’t see how from the structure you can record the runtie module parameters, it appears that this structure can only support recording design time values. I would have expected a relationship from MD_Parameters to MD_Module_Instance?

    2.a It would be useful to see the structure populated in some areas, for example how do you decide what is a module and what is a batch? What kinds of things are architectural layers (I guess staging, core, loader?) and area’s (ODS, BDS etc?) however I have built SSIS packages (= Batch?) that contain sections for Extract and Load so this is confusing me (there is a 1->many relationship here).
    2.b Also it would be good to see the kinds of things you are using for ModuleType, ExecutionStatus, EventType etc.

     
  3. Roelant Vos Roelant Vos says:

    Hi Brett,

    Thank you for your reply, you are absolutely right, the structure shown would only support the capture of parameter values at design time. Thanks for pointing that out. It’s more of an error in the documentation than it is in the design because (upon rethinking) there does not seem to be a strong reason to store the runtime parameters / variables separarely. They will be the same most of the time. Except for the things like run datetime, start- and enddates which are stored in the Data Warehouse anyway. The parameter table would hold information that will be presented to the ETL process without wanting to hard-code it directly into the ETL. Examples could be and specific commit point, db connections, the names of the mappings/workflows if required, mapping specific flags that may need to be maintained and so forth. The requirement for this varies from tool to tool though, but it can give a good overview of these ‘quirky’ filters or special logic that is in some mappings.

    I’ll update the post & documentation. Meanwhile, do you see a good situation to maintain parameter values at runtime other than the ones already captured in the DWH?

    Regarding the structure, I was planning to add the documentation as part of the EDW structure on the right side of the page to write down what the reasoning is for every entity. After all, it’s all meant to become part of a collection of best practices. A module would be the smallest excutable entity; it can be a script, mapping, file transfer and so forth. A batch is the sequence or grouping of modules. The layers are the layers and areas defined for the EDW (staging layer with areas, integration layer with areas) so modules and batches are ‘labelled’ as part of the DW architecture.

    In SSIS terms there is no clear distinction because a package can call other packages. I would design the packages as standard mappings and label the ‘wrapper’ packages as batches. As a sidenote, things are speeding up on the ETL generation proof of concept and we are using SSIS as a first attempt to get everything to work. This is because it can be created (generated) programmatically better than most ETL tools. If I can match the OWB demo that would be great. Once I get some packages generated I’d like to compare the structure with your structure if possible. See if I forgot something.

    Other than that I am planning to work on updating the documentation over the week / weekend and continue on the DDL which should include the types and such.

    Kind regards,
    Roelant Vos

     
  4. 00 says:

    Hi Roelant, thanks for the response. I got the feeling that you weren’t using SSIS so I will detail a bit more of my etl architecture here which may or may not be interesting for you to consider. I’m open to suggestions and I’d quite like to tie your work up with this, however as I am learning with DWH there are at least 4 million ways to do everything with pro’s and con’s for each.

    For the parameters I’m still trying to establish how this would fit in our case, so I can’t immediately tell if runtime values of parameters would be useful. If parameters are really just Connection strings and mainly fixed or inferred values it may not be of value, especially if this information can be obtained in better form from the data itself (lineage in the data tables for example are better than tracing the connection string)

    Currently I am using SSIS Configurations to configure the packages and SSIS Logging to catch the events, so I think this effectively replaces the parameters and event log in your design and I am trying to connect those into the model. The other thing that I would mention is that I am using uniqueidentifier data types for the instance, batch and module PK’s as SSIS naturally has these id’s for all the packages which it uses in the SSIS log so I would suggest you consider using these also. As long as you have a GUID generator for any non-ssis ETL pieces it can be used for both and then you don’t need to have 2 id’s for the ssis packages/executables. Alternatively these could become unique contstraint surrogate keys on those tables which may be more flexible.

    I am using a self-modified version of the Enhanced Execute Package Task that allows packaging packages up and reusing easier than standard ssis. So I model each destination table as an Entity package with a standard template of containers for extract, transform, load and process, so actually it would be the guid’s of those containers that would become a module. I then in turn have a “Full” package which has the same set of containers (etlp), but that call each table’s package (with variables to ensure only that container is run). The dependencies of individual entities are controlled within the “full” ssis package itself (as it is simplest to do with ssis control flow). this ensures maximum parrallelism on Load but means that it would be much more difficult to generate as the dependencies could become quite complex to ensure referential integrity at the load side in the generated package – however it could be done.

    Where tables are populated from common extracts, a single extract package is created (usually one for each source system with variables to control the exact table to run) and then this is used in both the entity package and the full package

    The Full package contains a bool variable for each destination table that can turn on/off action for that entity and also bool variables to allow isolating a single layer of the operation (etlp). This way it is a simple set of SetVar statements to control which entities and layers you want to run and can schedule those up as necessary – the dependencies are then automatically followed as dictated by the control flow in SSIS. So actually the Entity packages don’t get executed manually except in a dev environment, they are more a organisational container for development rather than execution.

    I have also considering using cozyroc’s enhanced dataflow tools to allow building the packages by hand but then having the transform metadata defined in data structures. However there is a cost to this that I would need to justify and my project currently probably isn’t quite large enough to do so.

    Your approach of generating the entire etl suite could be better if a rich enough model was built with extension points to allow safely tweaking the model and you believed that the design would deal with all cases. I’m not experienced enough really in this particular domain to be sure that this is practical but in previous projects I’ve been involved in where the whole process is generated it is tricky as one issue in the generation can cause a lot of problems so extension points are required to ensure that workarounds can be employed while the issue is fixed rather than being dead in the water or to find a totally new approach. The next version of SSIS will be a bit more friendly with copy/paste etc so perhaps simply modifying the packages themselves would be ok and each new generation can then have the tweaks grafted across if they are few but otherwise more formal methods of tweaking might need to be thought through.

    Just some thoughts, looking forward to seeing some more of your guidance documentation

     
  5. 00 says:

    Oh and of course, due to my design I put above, it would make sense for the variables that the package was run with to be stored for both the “DoEntity” flags and the “RunExtract”, “RunTransform” etc flags, however I think I would be identifying the batches as the “Entity Packages” and the modules as “Key Executables” inside each so this would already be logged for audit purposes – but it might be useful to have a record of which parameters were actually run (for example if the package failed), however I think I would probably try to get that information logged in the ssis Event log on startup of each parameter rather than a more formal schema for it. Still thinking this one through..

     
  6. Roelant Vos Roelant Vos says:

    Hi Brett,

    You’re right, there are absolutely millions of ways to everything. And to makes things worse all the open toolboxes (database, ETL, BI tool, datamodel) influence the design. It’s essentially the reason why I started building my personal beliefs and best practices into this ETL generation tool.

    I usually work with Informatica and here it can be useful to have parameter files created which contain connectivity information so you don’t have to change anything when moving between environments. But as always, there are other ways around that.

    I tend to grab the module and batch names and load them into the ETL metadata tables as a source for these parameter files because that way I only have to document it once. Otherwise I would have to hardcode it in the ETL. But these are smaller concerns, I was wrong to put the focus on the parameters initially 🙂 It does depend on the tool and I think you’re better off using the SSIS built-in options for that such as you’re doing. Thanks for the info about the generated ID’s, it should definitely relate to anything stored outside SSIS. The way I was looking at it is that the module and batch tables are essentially static and are loaded only once. No issue in using a GUID for the instances.

    The instance IDs would then be generated from the ETL with some generic meaningless key generator and relate to the PK of the module or batch. It would be great if an outside module such as a DOS / Unix script would use the same sequencer, it should be possible to call some package but I haven’t really thought about that.

    The idea behind the ETL generation is somewhat ‘brute force’, one of the reason I started to like hybrid (semi normalised) modelling approaches is that it makes the patterns in ETL easier to implement. The trick is to issue the DWH key very early and this makes it possible to run stuff in parallel without having to worry about referential integrity (the process includes placeholders). I also try to make the various load jobs as small as possible and schedule them independantly to run as often as possible. The ETL generation is aimed to do this as well as possible. It is possible to read the source data structure and follow a three layered setup (as described here) up to and including a generated star or snowflake schema. How generic the ETL is and what the dependencies are are very much related to the model and process steps. That’s why I worked on fixing the process steps first and then using the resulting patterns to generate the ETL. Needless to say, this follows a very fixed approach and cannot be used in all situations because of existing approaches, logic and so forth.

    Data Vault is very straightforward and easy to use but doesn’t solve integration or business logic issues. The way I see it is if I can generate all the ‘housekeeping’ bits such as storing history, key distribution I have a manageable base DWH to build the logic on. This will (for now) still be custom development to an extent, but at least I don’t have to worry about manageing the technical side of things such as the changes in the source systems and EDW, managing deltas and things like that.

    The high road of ETL generation is to describe things in an abstract way and translate these objects to tool specific code. I’ve worked on this for some time but felt that I first had to finish the basic things. Perhaps in due time I can pick this up and use this to generate the logic that is still custom development in the current approach.

    There are some very interesting white papers available from HP labs on this topic (ETL semantics).

    Your SSIS implementation sounds good, my definition of module is a process that picks up something and writes it somewhere so that would fit the container in your approach (which is more generic). Usually a single mapping / process since we’re trying to define the processes from one or more sources to typically one target.

    I am to give a presentation next week which takes most of the time but I can’t wait to start on the coding for SSIS. I’ll check with our resident SSIS expert what he thinks as well.

     

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.