Spark Streaming + Kafka Integration Guide 位置策略和消费策略译文

12 篇文章 0 订阅

LocationStrategies 位置策略

The new Kafka consumer API will pre-fetch messages into buffers. Therefore it is important for performance reasons that the Spark integration keep cached consumers on executors (rather than recreating them for each batch), and prefer to schedule partitions on the host locations that have the appropriate consumers.

新的Kafka消费者API可以预获取消息缓存到缓冲区,因此Spark整合Kafka让消费者在executor上进行缓存对性能是非常有助的,可以调度消费者所在主机位置的分区。

In most cases, you should use LocationStrategies.PreferConsistent as shown above. This will distribute partitions evenly across available executors. If your executors are on the same hosts as your Kafka brokers, use PreferBrokers, which will prefer to schedule partitions on the Kafka leader for that partition. Finally, if you have a significant skew in load among partitions, use PreferFixed. This allows you to specify an explicit mapping of partitions to hosts (any unspecified partitions will use a consistent location).

通常,你可以使用 LocationStrategies.PreferConsistent,这个策略会将分区分布到所有可获得的executor上。如果你的executor和kafkabroker在同一主机上的话,可以使用PreferBrokers,这样kafka leader会为此分区进行调度。最后,如果你加载数据有倾斜的话可以使用PreferFixed,这将允许你制定一个分区和主机的映射(没有指定的分区将使用PreferConsistent 策略

The cache for consumers has a default maximum size of 64. If you expect to be handling more than (64 * number of executors) Kafka partitions, you can change this setting via spark.streaming.kafka.consumer.cache.maxCapacity

消费者默认缓存大小是64,如果你期望处理较大的Kafka分区的话,你可以使用spark.streaming.kafka.consumer.cache.maxCapacity设置大小。

The cache is keyed by topicpartition and group.id, so use a separate group.id for each call to createDirectStream.

缓存是使用key为topicpartition 和组id的,因此对于每一次调用createDirectStream可以使用不同的group.id

ConsumerStrategies(消费策略)

The new Kafka consumer API has a number of different ways to specify topics, some of which require considerable post-object-instantiation setup.ConsumerStrategies provides an abstraction that allows Spark to obtain properly configured consumers even after restart from checkpoint.

新的kafka消费者API有一些不同的指定topic的方式,其中一些方式需要实例化对象之后进行大量的配置,ConsumerStrategies 提供了一个抽象允许Spark为consumer获取适合的配置,甚至在checkpoint重启之后也可以获取适合的配置。

ConsumerStrategies.Subscribe, as shown above, allows you to subscribe to a fixed collection of topics. SubscribePattern allows you to use a regex to specify topics of interest. Note that unlike the 0.8 integration, using Subscribe or SubscribePattern should respond to adding partitions during a running stream. Finally, Assign allows you to specify a fixed collection of partitions. All three strategies have overloaded constructors that allow you to specify the starting offset for a particular partition.

ConsumerStrategies.Subscribe正如上面所述,你可以使用它去订阅一个topics集合,SubscribePattern 可以定于匹配表达式的topics。注意:0.10.0的整合不像0.8的整合,使用 Subscribe 或SubscribePattern 会在运行流时候添加分区。最后,Assign可以指定一个分区的集合,这三种策略有重载构造函数,允许你指定分区和对应的偏移。


If you have specific consumer setup needs that are not met by the options above, ConsumerStrategy is a public class that you can extend.

如果上面配置之后的consumer仍然无法满足你的需求的话,ConsumerStrategy 是一个public ckass,你可以继承并进行定义。



Creating an RDD

If you have a use case that is better suited to batch processing, you can create an RDD for a defined range of offsets.

如果你的需求更适合进行批处理的话,你可以根据Kafka的偏移创建RDD进行批处理

// Import dependencies and create kafka params as in Create Direct Stream above

val offsetRanges = Array(
  // topic, partition, inclusive starting offset, exclusive ending offset
  OffsetRange("test", 0, 0, 100),
  OffsetRange("test", 1, 0, 100)
)

val rdd = KafkaUtils.createRDD[String, String](sparkContext, kafkaParams, offsetRanges, PreferConsistent)


OffsetRange的定义:

final class OffsetRange private( val topic: String,val partition: Int,val fromOffset: Long,val untilOffset: Long) extends Serializable

Note that you cannot use PreferBrokers, because without the stream there is not a driver-side consumer to automatically look up broker metadata for you. Use PreferFixed with your own metadata lookups if necessary.

注意:这种情况下你不可以使用PreferBroker策略,因为没有stream的话就没有driver端的consumer进行检索元数据。可以必要时候检索元数据结合PreferFixed 策略使用。


Obtaining Offsets(获取偏移)

stream.foreachRDD { rdd =>
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
  rdd.foreachPartition { iter =>
    val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)
    println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
  }
}

补充 KafkaRDD类的继承结构

private[spark] class KafkaRDD extends RDD with Logging with HasOffsetRanges 

Note that the typecast to HasOffsetRanges will only succeed if it is done in the first method called on the result of createDirectStream, not later down a chain of methods. Be aware that the one-to-one mapping between RDD partition and Kafka partition does not remain after any methods that shuffle or repartition, e.g. reduceByKey() or window().

注意类型转换为HasOffsetRanges 在调用createDirectStream之后就可以成功转换,并不会迟到于方法链之后(意思就是说并不会在我们想获取offset时候还没有完成转换)。在shuffle和repartition之前RDD分区和Kafka分区都是一对一的,例如:reduceByKey() 或window()操作.


余下相对比较好理解,就不翻译了。

Storing Offsets

Kafka delivery semantics in the case of failure depend on how and when offsets are stored. Spark output operations are at-least-once. So if you want the equivalent of exactly-once semantics, you must either store offsets after an idempotent output, or store offsets in an atomic transaction alongside output. With this integration, you have 3 options, in order of increasing reliablity (and code complexity), for how to store offsets.

Checkpoints

If you enable Spark checkpointing, offsets will be stored in the checkpoint. This is easy to enable, but there are drawbacks. Your output operation must be idempotent, since you will get repeated outputs; transactions are not an option. Furthermore, you cannot recover from a checkpoint if your application code has changed. For planned upgrades, you can mitigate this by running the new code at the same time as the old code (since outputs need to be idempotent anyway, they should not clash). But for unplanned failures that require code changes, you will lose data unless you have another way to identify known good starting offsets.

Kafka itself

Kafka has an offset commit API that stores offsets in a special Kafka topic. By default, the new consumer will periodically auto-commit offsets. This is almost certainly not what you want, because messages successfully polled by the consumer may not yet have resulted in a Spark output operation, resulting in undefined semantics. This is why the stream example above sets “enable.auto.commit” to false. However, you can commit offsets to Kafka after you know your output has been stored, using the commitAsync API. The benefit as compared to checkpoints is that Kafka is a durable store regardless of changes to your application code. However, Kafka is not transactional, so your outputs must still be idempotent.

stream.foreachRDD { rdd =>
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

  // some time later, after outputs have completed
  stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}

As with HasOffsetRanges, the cast to CanCommitOffsets will only succeed if called on the result of createDirectStream, not after transformations. The commitAsync call is threadsafe, but must occur after outputs if you want meaningful semantics.

Your own data store

For data stores that support transactions, saving offsets in the same transaction as the results can keep the two in sync, even in failure situations. If you’re careful about detecting repeated or skipped offset ranges, rolling back the transaction prevents duplicated or lost messages from affecting results. This gives the equivalent of exactly-once semantics. It is also possible to use this tactic even for outputs that result from aggregations, which are typically hard to make idempotent.

// The details depend on your data store, but the general idea looks like this

// begin from the the offsets committed to the database
val fromOffsets = selectOffsetsFromYourDatabase.map { resultSet =>
  new TopicPartition(resultSet.string("topic"), resultSet.int("partition")) -> resultSet.long("offset")
}.toMap

val stream = KafkaUtils.createDirectStream[String, String](
  streamingContext,
  PreferConsistent,
  Assign[String, String](fromOffsets.keys.toList, kafkaParams, fromOffsets)
)

stream.foreachRDD { rdd =>
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

  val results = yourCalculation(rdd)

  // begin your transaction

  // update results
  // update offsets where the end of existing offsets matches the beginning of this batch of offsets
  // assert that offsets were updated correctly

  // end your transaction
}



spark streaming 是基于 spark 引擎的实时数据处理框架,可以通过集成 kafka 来进行数据流的处理。然而,在使用 spark streaming 进行 kafka 数据流处理时,可能会遇到一些坑。 首先,要注意 spark streamingkafka 版本的兼容性。不同版本的 spark streamingkafka 可能存在一些不兼容的问题,所以在选择版本时要特别留意。建议使用相同版本的 spark streamingkafka,以避免兼容性问题。 其次,要注意 spark streaming 的并行度设置。默认情况下,spark streaming 的并行度是根据 kafka 分区数来决定的,可以通过设置 spark streaming 的参数来调整并行度。如果并行度设置得过高,可能会导致任务处理过慢,甚至出现 OOM 的情况;而设置得过低,则可能无法充分利用集群资源。因此,需要根据实际情况进行合理的并行度设置。 另外,要注意 spark streamingkafka 的性能调优。可以通过调整 spark streaming 缓冲区的大小、批处理时间间隔、kafka 的参数等来提高性能。同时,还可以使用 spark streaming 的 checkpoint 机制来保证数据的一致性和容错性。但是,使用 checkpoint 机制可能会对性能产生一定的影响,所以需要权衡利弊。 最后,要注意处理 kafka 的消息丢失和重复消费的问题。由于网络或其他原因,可能会导致 kafka 的消息丢失;而 spark streaming 在处理数据时可能会出现重试导致消息重复消费的情况。可以通过配置合适的参数来解决这些问题,例如设置 KafkaUtils.createDirectStream 方法的参数 enable.auto.commit,并设置适当的自动提交间隔。 总之,在使用 spark streaming 进行 kafka 数据流处理时,需要留意版本兼容性、并行度设置、性能调优和消息丢失重复消费等问题,以免踩坑。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值