DataStage Parallel Routines

Abstract

 

DataStage is a powerful ETL tool with lot of inbuilt stages/routines which can do most of the functionalities required; for those things DataStage EE can’t do, there are parallel routines which can be written in C++. Parallel routine are invoked by parallel jobs.

Compared with server routine, it is mainly used in transform stage but can not be used in a job sequence as a kind of job control method.

The paper mainly introduce how to create and use an parallel routine in parallel job.

 

Introduction

 

We can use the Parallel Routine window to create, view, or edit a parallel job routine

There are two types of parallel routine:

  • External Function.

      This calls a function from a UNIX shared library, and may be used anywhere an expression can be defined. Any external function defined appear in the expression editor operand menu under Parallel Routines.

  • External Before/After Routine.

      This calls a routine from a UNIX shared library, and can be specified in the Triggers page of a transformer stage Properties dialog box.

 

Tutorial of creating and invoking a Parallel Routine

 

Parallel routines are C++ components built and compiled external to DataStage. Note - they must be compiled as C++ components, not C. It is that we can only compile the C/C++ program with g++ instead of gcc.

 

Here's the typical sequence of steps for creating a DataStage parallel routine:

Create --> Compile --> Link --> Execute

 

1) Create

 

Create a C/C++ program with main()

Test it and if successful remove the main()

The following c file ParaTest.c:

#include <stdio.h>

int trans(int i)

{

   if(i>5)

     return i;

   else

     return i+5;

}

main()

{

   int a = 6;

   printf(“%d”, trans(a));

}

Testing the program and if it runs successfully. Then rewrite the program without main()

#include <stdio.h>

int trans(int i)

{

   if(i>5)

     return i;

   else

     return i+5;

}

And saved as IntTest.c.

 

2) Compile

 

Compile using the compiler option specified under “APT_COMPILEOPT”.

g++ -O -fPIC -Wno-deprecated -c IntTest.c

and will generate an object file named IntTest.o.

 

Note:Compiler and compiler options can be found in "DataStage --> Administrator --> Properties --> Environment --> Parallel --> Compiler" and create an object (*.o) file and put this object file onto this directory.

 

3) Link

Use the Parallel Routine window to create, view, or edit a parallel job routine

And link the above object (*.o)  as IntTest.o to a DataStage Parallel routine by making the relevant entries in General tab:

 

Routine Name: myRoutine

Type: External Function

Object Type: Object
External subroutine name: trans

Function Name specified inside your C/C++ program

Library Path:

/home/dsadm/4Train/ParaRoutine/IntTest.o

Also specify the Return Type and if you have any input parameters to be passed specify that in Arguments tab.

Because the function will return an int value so we choose the return type as int.

 The arguments tab:

The job will transfer an argument to the function trans, we can give an argument name i.

The native type is the argument type will is transferred by the job which will invoke the routine. The default type is int. If the data type we handle in the job is char or other types, we should define the type such as char*.

 

4) Execute

 

Now your parallel routine will be available inside your job. Include and compile your job and execute.

Create a testing job and call this parallel routine inside your job. In the transformer call this routine in your output column derivation. Compile and run the job.

Create a job named paraRoutine1 as the following snapshot shows:

 

 

After ran the job successfully, we can view the result. It is obviously that the data which value <5 has been added 5.

 

General knowledge and practical usage of parallel routine

 

In above we get a brief idea of how create and use a parallel routine. We now will give a specification about Parallel Routine window General page.

Use the Parallel Routine window to create, view, or edit a parallel job routine

There are two types of parallel routine:

  • External Function.

      This calls a function from a UNIX shared library, and may be used anywhere an expression can be defined. Any external function defined appear in the expression editor operand menu under Parallel Routines.

  • External Before/After Routine.

      This calls a routine from a UNIX shared library, and can be specified in the Triggers page of a transformer stage Properties dialog box.

 

Note: Functions must be compiled with a C++ compiler (not a C compiler). Example parallel routines are supplied on the Client Installation CD in the directory Samples/TrxExternalFunctions. The readme file explains how to use the examples on each platform.

The General page contains the following controls and fields:

Routine Name

The name of the routine as it will appear in the repository.

Type

Choose External Function if this routine is calling a function to include in a transformer expression. Choose External Before/After Routine if you are defining a routine to execute as a transformer stage before/after routine.

Object Type

Choose Library or Object. This property specifies how the C function is linked in the job. If you choose Library, the function is not linked into the job and you must ensure that the shared library is available at run time. For the Library invocation method the routine must be provided in a shared library rather than an object file. If you choose Object the function is linked into the job, and so does not need to be available at run time. Note that, if you use the Object option, and subsequently update the function, the job will need to be recompiled to pick up the update. If you choose the Library option, you must enter the pathname of the shared library file in the Library path field. If you choose the Object option you must enter the pathname of the object file in the Library path field.

External subroutine name

The C function that this routine is calling (this property must be the name of a valid routine in a shared library).

Return type

Choose the type of the value that the function will return. The drop-down list offers a choice of native C types. This is unavailable for External Before/After Routines, which do not return a value.

Library path

If you have specified the Library option, type or browse on the server for the pathname of the shared library that contains the function. This is used at compile time to locate the function. The pathname should be the exact name of the library or object file, and must have the prefix lib and the appropriate suffix, For example, /disk1/userlibs/libMyFuncs.so, /disk1/userlibs/MyStaticFuncs.o. Suffixes are as follows:

Solaris - .so or .a

AIX - .so or .a

HPUX - .a or .sl

Tru64 - .so or .a

If you have specified the Object option, enter the pathname of the object file. Typically the file will be suffixed with .o. This file must exist and be a valid object file for the linker.

Short description

Type an optional brief description of the routine.

Long description

Type an optional detailed description of the routine.

Example job:

The example of a usage of how to deal with char/varchar data.

The data is shown as the following.

"Parallel1","a"

"Parallel2","b"

"Parallel3","c"

We need to add a string  “Hello Testing” to column1,

So the result should look like as “Parallel1 Hello Testing”. It needs we write a program

named ParaObj3.cpp to concatenate the string:

#include <iostream>

using namespace std;

 

char *ParaObj(char *s)

{

    //char *OutStr = "Hello Parallel Routine testing";

    //cout << OutStr << "/n";

    //char *append = " Hello Testing"; //Segmentation fault

    //char *OutStr = strncat(s,append,14);//Segmentation fault 

    char *OutStr = new char[50];

    //char *OutStr = "";

    strcpy(OutStr,s);

    strncat(OutStr,append,14);

    return OutStr;

}

Use the following command to generate the shared libraries libParaObj3.so

g++ -O -fPIC -c ParaObj3.cpp -o ParaObj3.o

g++ -shared -Wl ParaObj3.o -o libParaObj3.so

 

By the way in the development process, we usually write makefile to compile the file:

We can write the makefile as the following:

and we only input the command it will compile the source code and generate the object file and library file separately.

 

Then we create an parallel routine named ParaRoutineTest.

Choose Object Type as library and return type as char*.

 

The native type should also be char*. Actually it corresponds to the function argument type.

 

We create a job named ParaRoutine to test the routine. The job will can not run successfully.

The reason is that now we use the shared library and it is necessary to set the Library Path (LD_LIBRARY_PATH).

One method is to specify the lib path in the LD_LIBRARY_PATH variable.

export LD_LIBRARY_PATH=

$LD_LIBRARY_PATH:/home/dsadm/4Train/ParaRoutine

Another method is we can use Administrator->Project Name->General->Enviroment to set the LD_LIBRARY_PATH variable.

 

Now rerun the job and view the result:

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

八目智库

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值