Samza如何划分Partition

最新推荐文章于 2024-04-24 14:16:20 发布

WW.SS

最新推荐文章于 2024-04-24 14:16:20 发布

阅读量812

点赞数 1

分类专栏：大数据文章标签： Samza Partition 流处理

本文链接：https://blog.csdn.net/StupidPig0818/article/details/48623129

版权

大数据专栏收录该内容

3 篇文章 0 订阅

订阅专栏

在实际的应用中，为了合理利用资源，会增加并行处理。在Samza中可以将一个Job拆分成多个Partition来完成逻辑上的并行处理。那如何使用Partition呢。
我们需要考虑两个问题：
1、在Samza中如何指定Partition的数量
2、如何将我们的划分的数据分发到指定的Partition中

首先看第一个问题，Samza中的Partition是如何划分等所有的操作都是依赖提供消息的消息系统。如果使用的是Kafka系统的话，那么就是由Kafka来指定的。在Samza的官方文档中也有说明：

The number of partitions in the input streams is determined by the systems from which you are consuming. For example, if your input system is Kafka, you can specify the number of partitions when you create a topic from the command line or using the num.partitions in Kafka’s server properties file.

翻译为：Partition的数量完全是由它所消费的系统决定的。如果我们使用的是Kafka的话，可以在命令行创建topic时指定，也可以在kafka的server的配置文件中的num.partitions进行指定。

接下来看看，如何将数据发送到指定的Partition中。
在 process中最后会利用MessageCollector的send方法将处理后的数据发送出去。在send方法的第一个参数OutgoingMessageEnvelope中可以指定PartitionKey。
代码参考如下：

public void process(IncomingMessageEnvelope envelope, MessageCollector collector,TaskCoordinator coordinator) throws Exception {
   collector.send(new OutgoingMessageEnvelope(new SystemStream("kafka","topic-name"),new Partition(userid) ,null,"one user info" ));
    }
}

上面中利用userid作为参数实例化出一个的Partition的实例。从代码中再看看Samza是如何利用这个PartionKey的。

KafkaSystemProducer类：

 def send(source: String, envelope: OutgoingMessageEnvelope) {
    trace("Enqueueing message: %s, %s." format (source, envelope))
    if(producer == null) {
      info("Creating a new producer for system %s." format systemName)
      producer = getProducer()
      debug("Created a new producer for system %s." format systemName)
    }
    // Java-based Kafka producer API requires an "Integer" type partitionKey and does not allow custom overriding of Partitioners
    // Any kind of custom partitioning has to be done on the client-side
    val topicName = envelope.getSystemStream.getStream/*获取topic name*/
    val partitions: java.util.List[PartitionInfo]  = producer.partitionsFor(topicName)/*获取topic对应的Partition数量*/
    val partitionKey = if(envelope.getPartitionKey != null) KafkaUtil.getIntegerPartitionKey(envelope, partitions) else null/*通过envelope和partitions获取partitionKey，下面去看看KafkaUtil.getIntegerPartitionKey是如何处理*/
    val record = new ProducerRecord(envelope.getSystemStream.getStream,
                                    partitionKey,
                                    envelope.getKey.asInstanceOf[Array[Byte]],
                                    envelope.getMessage.asInstanceOf[Array[Byte]])

KafkaUtil.getIntegerPartitionKey函数：

def getIntegerPartitionKey(envelope: OutgoingMessageEnvelope, partitions: java.util.List[PartitionInfo]): Integer = {
    val numPartitions = partitions.size
    abs(envelope.getPartitionKey.hashCode()) % numPartitions/*此处envelope.getPartitionKey就是在process中new 的Partition，此处利用Partition的hashCode对partition的数量取余*/
  }

接下来看看Partition的hashCode到底又是什么，让我们揭下最后一道面纱：

public class Partition implements Comparable<Partition> {
  private final int partition;

  /**
   * Constructs a new Samza stream partition from a specified partition number.
   * @param partition An integer identifying the partition in a Samza stream.
   */
  public Partition(int partition) {/*在process中就是利用的这个构造函数*/
    this.partition = partition;
  }

  public int getPartitionId() {
    return partition;
  }

  @Override
  public int hashCode() {
    return partition;/*hashCode就是我们传递给构造函数的值*/
  }
}