Pipeline Partitioning

最新推荐文章于 2023-01-12 15:09:26 发布

吃鱼的羊

最新推荐文章于 2023-01-12 15:09:26 发布

阅读量486

点赞数

CC 4.0 BY-SA版权

分类专栏： INFORMATICA 文章标签： Pipeline Partitionin

本文链接：https://blog.csdn.net/hellojoy/article/details/46820057

INFORMATICA 专栏收录该内容

28 篇文章

订阅专栏

本文深入探讨了数据管道中的分区概念，包括分区点、分区数量、类型及其对数据流的影响。介绍了如何通过不同分区类型优化数据处理效率，并强调了在特定转换节点设置分区的重要性。同时，阐述了不同分区点的默认设置以及如何根据需求进行调整。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

A pipeline consists of a source qualifier and all the transformations and targets that receive data from that source qualifier.

You can set the following attributes to partition a pipeline:
¨ Partition points. Partition points mark thread boundaries and divide the pipeline into stages. The Integration
Service redistributes rows of data at partition points.
¨ Number of partitions. A partition is a pipeline stage that executes in a single thread. If you purchase the
Partitioning option, you can set the number of partitions at any partition point.When you increase or decrease the number of partitions at any partition point, the Workflow Manager increases or decreases the number of partitions at all partition points in the pipeline.

¨ Partition types. The Integration Service creates a default partition type at each partition point. If you have the
Partitioning option, you can change the partition type. The partition type determines how the Integration Service redistributes data across partition points.

A partition is a pipeline stage that executes in a single reader, transformation, or writer thread.The number of
partitions in any pipeline stage equals the number of threads in the stage. By default, the Integration Service
creates one partition in every pipeline stage.

You can define up to 64 partitions at any partition point in a pipeline.The number of partitions remains consistent throughout the pipeline. If you define three partitions at any partition point, the Workflow Manager creates three partitions at all other partition points in the pipeline.In certain circumstances, the number of partitions in the pipeline must be set to one.

You can define the following partition types in the Workflow Manager:
¨ Database partitioning. The Integration Service queries the IBM DB2 or Oracle database system for table
partition information. It reads partitioned data from the corresponding nodes in the database. You can use
database partitioning with Oracle or IBM DB2 source instances on a multi-node tablespace. You can use
database partitioning with DB2 targets.
¨ Hash auto-keys. The Integration Service uses a hash function to group rows of data among partitions. The
Integration Service groups the data based on a partition key. The Integration Service uses all grouped or sorted
ports as a compound partition key. You may need to use hash auto-keys partitioning at Rank, Sorter, and
unsorted Aggregator transformations.
¨ Hash user keys. The Integration Service uses a hash function to group rows of data among partitions. You
define the number of ports to generate the partition key.
¨ Key range. With key range partitioning, the Integration Service distributes rows of data based on a port or set
of ports that you define as the partition key. For each port, you define a range of values. The Integration
Service uses the key and ranges to send rows to the appropriate partition. Use key range partitioning when the
sources or targets in the pipeline are partitioned by key range.
¨ Pass-through. In pass-through partitioning, the Integration Service processes data without redistributing rows
among partitions. All rows in a single partition stay in the partition after crossing a pass-through partition point.
Choose pass-through partitioning when you want to create an additional pipeline stage to improve
performance, but do not want to change the distribution of data across partitions.
¨ Round-robin. The Integration Service distributes data evenly among all partitions. Use round-robin partitioning
where you want each partition to process approximately the same number of rows.

1)分区单位是pipeline,partition point把pipeline分成若干个stage，在partition point处可以设置partiton type,the number of partitions,
the number of partitons 在整条pipeline中数目必须一样，在partition point处将按照partition type做redistribute datas among the partitons.然后将分区后的数据交由partition point transformation处理，数据分区在到达下一个partition point之前保持不变，如果下一个partiton points使用的partition type不是pass through（或者和上一个partiton points相同的partition type），则数据重新分区。
在一些pipeline stage，可以把all source datas放在一个partition中，其他partition数据为空，这样可以在一个partition中sort all the datas，然后pass the sorted datas to 需要sorted data的transformation,像：sorted Joiner Transformation,sorted Aggregator.

2)有些Transformation默认设置好了partition point,比如：
Source Qualifier,Normalizer
Controls how the Integration Service extracts data from the source and passes it to the source qualifier.
You cannot delete this partition point.

Rank,Unsorted Aggregator:
Ensures that the Integration Service groups rows properly before it sends them to the transformation.
You can delete these partition points if the pipeline contains only one partition or if the Integration Service passes all rows in a group to a single partition before they enter the transformation.

Target Instances
Controls how the writer passes data to the targets.
You cannot delete this partition point.

Multiple Input Group
The Workflow Manager creates a partition point at a multiple input group transformation when it is configured to process each partition with one thread,
or when a downstream one input group Custom transformation is configured to process each partition with one thread.
For example, the Workflow Manager creates a partition point at a Joiner transformation that is connected to a downstream Custom transformation configured to use one thread per partition.
This ensures that the Integration Service uses one thread to process each partition at a Custom transformation that requires one thread per partition.
You cannot delete this partition point.

3)一些需要重新整合数据的Transformation需要自己设置partition point和partition type,保证传给该Transformation的数据符合该transformation的数据要求像：group data，sorted data,cache data的要求。如果设置的partition type或者the number of partitions设置不正确，会导致session fail.