Pipeline Partitioning

A pipeline consists of a source qualifier and all the transformations and targets that receive data from that source qualifier.

You can set the following attributes to partition a pipeline:
¨ Partition points. Partition points mark thread boundaries and divide the pipeline into stages. The Integration
Service redistributes rows of data at partition points.
¨ Number of partitions. A partition is a pipeline stage that executes in a single thread. If you purchase the
Partitioning option, you can set the number of partitions at any partition point.When you increase or decrease the number of partitions at any partition point, the Workflow Manager increases or decreases the number of partitions at all partition points in the pipeline.

¨ Partition types. The Integration Service creates a default partition type at each partition point. If you have the
Partitioning option, you can change the partition type. The partition type determines how the Integration Service redistributes data across partition points.
 

 

A partition is a pipeline stage that executes in a single reader, transformation, or writer thread.The number of
partitions in any pipeline stage equals the number of threads in the stage
. By default, the Integration Service
creates one partition in every pipeline stage.

 

You can define up to 64 partitions at any partition point in a pipeline.The number of partitions remains consistent throughout the pipeline. If you define three partitions at any partition point, the Workflow Manager creates three partitions at all other partition points in the pipeline.In certain circumstances, the number of partitions in the pipeline must be set to one.

 

You can define the following partition types in the Workflow Manager:
¨ Database partitioning. The Integration Service queries the IBM DB2 or Oracle database system for table
partition information. It reads partitioned data from the corresponding nodes in the database. You can use
database partitioning with Oracle or IBM DB2 source instances on a multi-node tablespace. You can use
database partitioning with DB2 targets.
¨ Hash auto-keys. The Integration Service uses a hash function to group rows of data among partitions. The
Integration Service groups the data based on a partition key. The Integration Service uses all grouped or sorted
ports as a compound partition key. You may need to use hash auto-keys partitioning at Rank, Sorter, and
unsorted Aggregator transformations.
¨ Hash user keys. The Integration Service uses a hash function to group rows of data among partitions. You
define the number of ports to generate the partition key.
¨ Key range. With key range partitioning, the Integration Service distributes rows of data based on a port or set
of ports that you define as the partition key. For each port, you define a range of values. The Integration
Service uses the key and ranges to send rows to the appropriate partition. Use key range partitioning when the
sources or targets in the pipeline are partitioned by key range.
¨ Pass-through. In pass-through partitioning, the Integration Service processes data without redistributing rows
among partitions. All rows in a single partition stay in the partition after crossing a pass-through partition point.
Choose pass-through partitioning when you want to create an additional pipeline stage to improve
performance, but do not want to change the distribution of data across partitions.
¨ Round-robin. The Integration Service distributes data evenly among all partitions. Use round-robin partitioning
where you want each partition to process approximately the same number of rows.

 

1)分区单位是pipeline,partition point把pipeline分成若干个stage,在partition point处可以设置partiton type,the number of partitions,
the number of partitons 在整条pipeline中数目必须一样,在partition point处将按照partition type做redistribute datas among the partitons.然后将分区后的数据交由partition point transformation处理,数据分区在到达下一个partition point之前保持不变,如果下一个partiton points使用的partition type不是pass through(或者和上一个partiton points相同的partition type),则数据重新分区。
在一些pipeline stage,可以把all source datas放在一个partition中,其他partition数据为空,这样可以在一个partition中sort all the datas,然后pass the sorted datas to 需要sorted data的transformation,像:sorted Joiner Transformation,sorted Aggregator.

2)有些Transformation默认设置好了partition point,比如:
Source Qualifier,Normalizer
Controls how the Integration Service extracts data from the source and passes it to the source qualifier.
You cannot delete this partition point.

Rank,Unsorted Aggregator:
Ensures that the Integration Service groups rows properly before it sends them to the transformation.
You can delete these partition points if the pipeline contains only one partition or if the Integration Service passes all rows in a group to a single partition before they enter the transformation.

Target Instances
Controls how the writer passes data to the targets.
You cannot delete this partition point.

Multiple Input Group
The Workflow Manager creates a partition point at a multiple input group transformation when it is configured to process each partition with one thread,
or when a downstream one input group Custom transformation is configured to process each partition with one thread.
For example, the Workflow Manager creates a partition point at a Joiner transformation that is connected to a downstream Custom transformation configured to use one thread per partition.
This ensures that the Integration Service uses one thread to process each partition at a Custom transformation that requires one thread per partition.
You cannot delete this partition point.

3)一些需要重新整合数据的Transformation需要自己设置partition point和partition type,保证传给该Transformation的数据符合该transformation的数据要求像:group data,sorted data,cache data的要求。如果设置的partition type或者the number of partitions设置不正确,会导致session fail.

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Pipeline in Java refers to the concept of creating a sequence of operations that can be executed one after another to process data. It's commonly used in functional programming and Apache Beam, a distributed processing framework, where it enables developers to define complex data processing pipelines as a series of stages. Here's a simple example of how you might use a pipeline in Java: ```java import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.io.TextIO; import org.apache.beam.sdk.options.PipelineOptionsFactory; import org.apache.beam.sdk.transforms.DoFn; import org.apache.beam.sdk.transforms.ParDo; public class WordCount { public static void main(String[] args) { // Create a new Pipeline options instance PipelineOptions options = PipelineOptionsFactory.create(); // Create a new Pipeline with the given options Pipeline pipeline = Pipeline.create(options); // Read input text from a file String input = "input.txt"; PCollection<String> lines = pipeline.read(TextIO.from(input)); // Apply a ParDo transform to tokenize the words PCollection<String> words = lines.apply(ParDo.of(new TokenizerFn())); // Count the occurrences of each word PCollection<KV<String, Long>> counts = words.apply(Count.perElement()); // Write the result to an output file String output = "output.txt"; counts.writeTo(TextIO.to(output)); // Run the pipeline pipeline.run().waitUntilFinish(); } } // Custom DoFn for tokenizing words class TokenizerFn extends DoFn<String, String> { @ProcessElement public void process(@Element String line, OutputReceiver<String> receiver) { String[] tokens = line.split("\\s+"); for (String token : tokens) { receiver.output(token); } } } ``` In this example, a `Pipeline` is created, which reads text from a file, tokenizes the words using a custom `TokenizerFn`, counts the occurrences of each word, and finally writes the results to an output file.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值