101 uses for ETL job parameters

最新推荐文章于 2024-02-07 22:09:49 发布

junli0310

最新推荐文章于 2024-02-07 22:09:49 发布

阅读量1k

点赞数

分类专栏： DataStage 文章标签： parameters passwords character database date variables

DataStage 专栏收录该内容

16 篇文章 0 订阅

订阅专栏

This is my third post in what I am now calling job parameter week. In future it will be better if I come up with an idea for a theme week at the start of the week instead of the end of the week. I'll chalk that one up to experience.

Welcome to job parameter theme week, where we bring job parameter to life. (Note to marketing department, please provide better mottos for future theme weeks). I have tracked back to my previous two job parameter posts from this week. We looked at Project Specific Job Parameters on Monday and saw how they made management of parameters a lot easier, we looked at the exposure of unencrypted passwords on Wednesday by the NSW police.

Today I am going to talk about what I use job parameters for.

ETL job parameters are values you can change when you run the job, as opposed to properties you can only change by modifying and recompiling the job, don't know if I am going to get to 101 but thought it was an eye catching title.

Static Parameters
In programmer speak you could also call these constants. They are values you never expect to change. So why make them a job parameter? Because they might change, or because you don't want to have to try and remember the value every time you use it in a job.

In my current project I have a static parameter called MAX_DATE which holds a high end date value 2999-12-31 23:59:00. I use this on tables that track variant data such as a slowly changing dimension type 2 table, it provides a high end date for an open or current item. This parameter gets copied into all jobs that update dimension tables, the good thing about having it around is that I don't need to remember the value of the field, I just enter MAX_DATE into my transformer.

Another good thing is that come July of 2999 when a project team is assembled to address Y3K problems they only have to update this date in one place and will not have to recompile jobs.

Slowly Changing Parameters
The most common example is the database password that **should** be changed every month. At the current project the DBA team change the password on the database and the DataStage support team change the project specific environment variable. I would prefer to see IBM provide a tool that lets users maintain these environment variables without having to go through so many Administrator screens so they can change passwords more easily.

Environmental Parameters
These are the job parameters that change value as you move from one environment to another, for example from development into testing and into production. Typical values that change are directory locations and database login details.

The Customer Keeps Changing Their Mind Parameters
On a previous project I had a QualityStage plugin cleansing names and addresses for delivery of marketing campaigns to mail houses. During testing there were some doubts about company name cleansing and name cleansing. Sometimes the cleansed name was worse then the raw name.

There was much umming and aahing in meetings as to whether to turn it on. I create a set of job parameter flags that could be set to true or false that turned on and off personal name, company cleansing and address cleansing.

These parameters were used in a transformer that wrote out the final fields and had original or cleansed fields to choose from. Now in production they can define what type of cleansing the job will perform from the set of flags for each run.

Dynamic Parameters
These are parameters that get changed with every job run.
- A shell script starts a job with the dsjob command, it can set a dynamic parameter using the Unix scripting language and set it in the dsjob command.
- A sequence job calls parallel jobs that require a dynamic parameter, the value can be set in a User Variables stage or in the Job Activity Stage using BASIC commands or a BASIC routine.

Currently I have two parameters PROCESS_ID and PROCESS_DATE. The first is a unique number generated for every execution of a job. It goes into the ODS tables to indicate which job execution loaded the data so we can rollback loads or identify data volumes. The PROCESS_DATE is the delta data we are processing as we do daily loads.

The PROCESS_DATE is retrieved from a database table via a Unix script. The script uses a simple DB2 command line statement to retrieve the date and then uses the dsjob command to set the parameter and call a sequence job. It populates PROCESS_ID with an eight character string representation of the PROCESS_DATE.

The Sequence job uses a User Variable stage to append a unique 2 character sequence job code to the PROCESS_ID to make it unique for this sequence job. This User Variable is then used in each job activity stage with an additional 2 character code added that is unique for each job activity stage within the sequence job. Therefore each parallel or server job across the entire project gets a unique PROCESS_ID each time they run.