spark Streaming -> kafka
- createDirectStream()方法传参有三种
- ssc,是一个StreamingContext对象
- LocationStrategies
- 位置策略:控制特定的主题分区是在哪个执行器上消费的,在executor针对主题分区如何对消费者进行调度,位置策略有如下三种
- 1、PreferBrokers
- 首选Kafka服务器,只有在kafka服务器和executor位于同一主机,可以使用改策略
- 2、PreferConsistent:首选一致性
- 多数时候采用改方式,在所有可用的执行器上均匀分配kafka的主题所在分区,优点:综合利用集群的计算资源
- 3、PreferFixed
- 首选固定模式
- 如果负载不均衡,可以使用该策略放置在特定节点使用指定的主题分区。手动控制方案;
- 注:在没有显示指定的分区,任然使用PreferConsistent方案
- 1、PreferBrokers
- 位置策略:控制特定的主题分区是在哪个执行器上消费的,在executor针对主题分区如何对消费者进行调度,位置策略有如下三种
- ConsumerStrategies
- 消费者策略
- 是控制如何创建和配置消费者对象
- 或者对kafka上的消息进行如何消费界定,比如t1主题的分区0和1
- 或者消费特定分区上的特定消息段
- Subscirbe
- 为consuer自动分配partition,内部算法保证topic-partition以最优的方式均匀分配给同group 下的不同consumer
- Assign:consumer手动指定需要消费的topic-partitions,不受group.id限制,相当于group无效
- subscribePattern:正则匹配消费特定消息段
- 消费者策略
package com.tyf.example
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
object MySs {
def main(args: Array[String]): Unit = {
//准备straming Context
val conf = new SparkConf().setMaster("local[*]").setAppName("datahandler")
.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc,Seconds(5))
//要从程序恢复故障,就要通过StreamingContext启动checkpointing,消耗的偏移量信息可以从checkpoint处恢复。
ssc.checkpoint("cks")
//准备kafka参数
val kdfkaParams = Map(
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "192.168.133.111:9092",
ConsumerConfig.GROUP_ID_CONFIG -> "tyf2", //指定的group_id
ConsumerConfig.MAX_POLL_RECORDS_CONFIG -> "1000", //consumer每次读取上限,默认500
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer], //反序列话key
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer], //反序列话value
ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> "true",//自动提交
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG -> "earliest" //earliest:总是从头开始读取
)
//读取kafka数据
val ds = KafkaUtils.createDirectStream(ssc
, LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](Set("mmm"), kdfkaParams))
ds.print()
ssc.start()
ssc.awaitTermination()
}
}
-
所需jar包
-
<dependencies> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.11</version> <scope>test</scope> </dependency> <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka_2.11</artifactId> <version>2.0.0</version> <exclusions> <exclusion> <groupId>com.fasterxml.jackson.core</groupId> <artifactId>*</artifactId> </exclusion> </exclusions> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients --> <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-clients</artifactId> <version>2.0.0</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10 --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-kafka-0-10_2.11</artifactId> <version>2.3.4</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version>2.3.4</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.11</artifactId> <version>2.3.4</version> </dependency> <!-- https://mvnrepository.com/artifact/com.google.guava/guava --> <dependency> <groupId>com.google.guava</groupId> <artifactId>guava</artifactId> <version>14.0.1</version> </dependency> <dependency> <groupId>com.fasterxml.jackson.core</groupId> <artifactId>jackson-core</artifactId> <version>2.6.6</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming_2.11</artifactId> <version>2.3.4</version> </dependency> </dependencies>
-
参考链接
KafkaUtils.createDirectStream()参数详解1