Spark Streaming + Kafka Integration Guide 位置策略和消费策略译文

最新推荐文章于 2024-08-19 12:15:11 发布

javartisan

最新推荐文章于 2024-08-19 12:15:11 发布

阅读量4.2k

点赞数

分类专栏： Spark

本文链接：https://blog.csdn.net/Dax1n/article/details/61917718

版权

Spark 同时被 2 个专栏收录

70 篇文章 0 订阅

订阅专栏

Kafka

12 篇文章 0 订阅

订阅专栏

LocationStrategies 位置策略

The new Kafka consumer API will pre-fetch messages into buffers. Therefore it is important for performance reasons that the Spark integration keep cached consumers on executors (rather than recreating them for each batch), and prefer to schedule partitions on the host locations that have the appropriate consumers.

新的Kafka消费者API可以预获取消息缓存到缓冲区，因此Spark整合Kafka让消费者在executor上进行缓存对性能是非常有助的，可以调度消费者所在主机位置的分区。

In most cases, you should use LocationStrategies.PreferConsistent as shown above. This will distribute partitions evenly across available executors. If your executors are on the same hosts as your Kafka brokers, use PreferBrokers, which will prefer to schedule partitions on the Kafka leader for that partition. Finally, if you have a significant skew in load among partitions, use PreferFixed. This allows you to specify an explicit mapping of partitions to hosts (any unspecified partitions will use a consistent location).

通常，你可以使用 LocationStrategies.PreferConsistent，这个策略会将分区分布到所有可获得的executor上。如果你的executor和kafkabroker在同一主机上的话，可以使用PreferBrokers，这样kafka leader会为此分区进行调度。最后，如果你加载数据有倾斜的话可以使用PreferFixed，这将允许你制定一个分区和主机的映射（没有指定的分区将使用PreferConsistent 策略）

The cache for consumers has a default maximum size of 64. If you expect to be handling more than (64 * number of executors) Kafka partitions, you can change this setting via spark.streaming.kafka.consumer.cache.maxCapacity

消费者默认缓存大小是64，如果你期望处理较大的Kafka分区的话，你可以使用spark.streaming.kafka.consumer.cache.maxCapacity设置大小。

The cache is keyed by topicpartition and group.id, so use a separate group.id for each call to createDirectStream.

缓存是使用key为topicpartition 和组id的，因此对于每一次调用createDirectStream可以使用不同的group.id

ConsumerStrategies（消费策略）

The new Kafka consumer API has a number of different ways to specify topics, some of which require considerable post-object-instantiation setup.ConsumerStrategies provides an abstraction that allows Spark to obtain properly configured consumers even after restart from checkpoint.

新的kafka消费者API有一些不同的指定topic的方式，其中一些方式需要实例化对象之后进行大量的配置，ConsumerStrategies 提供了一个抽象允许Spark为consumer获取适合的配置，甚至在checkpoint重启之后也可以获取适合的配置。

ConsumerStrategies.Subscribe, as shown above, allows you to subscribe to a fixed collection of topics. SubscribePattern allows you to use a regex to specify topics of interest. Note that unlike the 0.8 integration, using Subscribe or SubscribePattern should respond to adding partitions during a running stream. Finally, Assign allows you to specify a fixed collection of partitions. All three strategies have overloaded constructors that allow you to specify the starting offset for a particular partition.

ConsumerStrategies.Subscribe正如上面所述，你可以使用它去订阅一个topics集合，SubscribePattern 可以定于匹配表达式的topics。注意：0.10.0的整合不像0.8的整合，使用 Subscribe 或SubscribePattern 会在运行流时候添加分区。最后，Assign可以指定一个分区的集合，这三种策略有重载构造函数，允许你指定分区和对应的偏移。

If you have specific consumer setup needs that are not met by the options above, ConsumerStrategy is a public class that you can extend.

如果上面配置之后的consumer仍然无法满足你的需求的话，ConsumerStrategy 是一个public ckass，你可以继承并进行定义。

Creating an RDD

If you have a use case that is better suited to batch processing, you can create an RDD for a defined range of offsets.

如果你的需求更适合进行批处理的话，你可以根据Kafka的偏移创建RDD进行批处理

// Import dependencies and create kafka params as in Create Direct Stream above

val offsetRanges = Array(
  // topic, partition, inclusive starting offset, exclusive ending offset
  OffsetRange("test", 0, 0, 100),
  OffsetRange("test", 1, 0, 100)
)

val rdd = KafkaUtils.createRDD[String, String](sparkContext, kafkaParams, offsetRanges, PreferConsistent)

OffsetRange的定义：

final class OffsetRange private( val topic: String,val partition: Int,val fromOffset: Long,val untilOffset: Long) extends Serializable

Note that you cannot use PreferBrokers, because without the stream there is not a driver-side consumer to automatically look up broker metadata for you. Use PreferFixed with your own metadata lookups if necessary.

注意：这种情况下你不可以使用PreferBroker策略，因为没有stream的话就没有driver端的consumer进行检索元数据。可以必要时候检索元数据结合PreferFixed 策略使用。

Obtaining Offsets（获取偏移）

stream.foreachRDD { rdd =>
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
  rdd.foreachPartition { iter =>
    val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)
    println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
  }
}

补充 KafkaRDD类的继承结构：

private[spark] class KafkaRDD extends RDD with Logging with HasOffsetRanges

Note that the typecast to HasOffsetRanges will only succeed if it is done in the first method called on the result of createDirectStream, not later down a chain of methods. Be aware that the one-to-one mapping between RDD partition and Kafka partition does not remain after any methods that shuffle or repartition, e.g. reduceByKey() or window().

注意类型转换为HasOffsetRanges 在调用createDirectStream之后就可以成功转换，并不会迟到于方法链之后（意思就是说并不会在我们想获取offset时候还没有完成转换）。在shuffle和repartition之前RDD分区和Kafka分区都是一对一的，例如：reduceByKey() 或window()操作.

余下相对比较好理解，就不翻译了。

Storing Offsets

Kafka delivery semantics in the case of failure depend on how and when offsets are stored. Spark output operations are at-least-once. So if you want the equivalent of exactly-once semantics, you must either store offsets after an idempotent output, or store offsets in an atomic transaction alongside output. With this integration, you have 3 options, in order of increasing reliablity (and code complexity), for how to store offsets.

Checkpoints

If you enable Spark checkpointing, offsets will be stored in the checkpoint. This is easy to enable, but there are drawbacks. Your output operation must be idempotent, since you will get repeated outputs; transactions are not an option. Furthermore, you cannot recover from a checkpoint if your application code has changed. For planned upgrades, you can mitigate this by running the new code at the same time as the old code (since outputs need to be idempotent anyway, they should not clash). But for unplanned failures that require code changes, you will lose data unless you have another way to identify known good starting offsets.

Kafka itself

Kafka has an offset commit API that stores offsets in a special Kafka topic. By default, the new consumer will periodically auto-commit offsets. This is almost certainly not what you want, because messages successfully polled by the consumer may not yet have resulted in a Spark output operation, resulting in undefined semantics. This is why the stream example above sets “enable.auto.commit” to false. However, you can commit offsets to Kafka after you know your output has been stored, using the commitAsync API. The benefit as compared to checkpoints is that Kafka is a durable store regardless of changes to your application code. However, Kafka is not transactional, so your outputs must still be idempotent.

stream.foreachRDD { rdd =>
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

  // some time later, after outputs have completed
  stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}

As with HasOffsetRanges, the cast to CanCommitOffsets will only succeed if called on the result of createDirectStream, not after transformations. The commitAsync call is threadsafe, but must occur after outputs if you want meaningful semantics.

Your own data store

For data stores that support transactions, saving offsets in the same transaction as the results can keep the two in sync, even in failure situations. If you’re careful about detecting repeated or skipped offset ranges, rolling back the transaction prevents duplicated or lost messages from affecting results. This gives the equivalent of exactly-once semantics. It is also possible to use this tactic even for outputs that result from aggregations, which are typically hard to make idempotent.

// The details depend on your data store, but the general idea looks like this

// begin from the the offsets committed to the database
val fromOffsets = selectOffsetsFromYourDatabase.map { resultSet =>
  new TopicPartition(resultSet.string("topic"), resultSet.int("partition")) -> resultSet.long("offset")
}.toMap

val stream = KafkaUtils.createDirectStream[String, String](
  streamingContext,
  PreferConsistent,
  Assign[String, String](fromOffsets.keys.toList, kafkaParams, fromOffsets)
)

stream.foreachRDD { rdd =>
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

  val results = yourCalculation(rdd)

  // begin your transaction

  // update results
  // update offsets where the end of existing offsets matches the beginning of this batch of offsets
  // assert that offsets were updated correctly

  // end your transaction
}