1.0KafkaStream编程案列
- 1)在Spark2.3.0版本中Kafka0.8.0版本被标记为过时
- 2)生产中对接Kafka版本最低选择kafka0.10.0。该版本的Stream 是Direct Stream,
- 下面说的内容都是基于Kafka direct Stream。 receiver太古老了,未来肯定是放弃的
- 3)(重要)Kafka direct Stram 产生的数据分区数和Kafka的分区数据数是1:1
package com.wsk.spark.stream
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.internal.Logging
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.{Seconds, StreamingContext}
object KafkaDirectStreamApp extends Logging{
def main(args: Array[String]): Unit = {
if (args.length != 3) {
logError("Usage:KafkaDirectStreamApp <brokers> <topics> <groupid>")
System.exit(0)
}
val Array(brokers,topic,groupid) = args
val sc = new SparkConf()
//本地测试需要放开如下配置
// .setAppName("Kafka Stream App")
// .setMaster("local[2]")
val ssc = new StreamingContext(sc, Seconds(5))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> brokers,
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> groupid,
"auto.offset.reset" -> "latest", //earliest
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = topic.split(",")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
//单词统计
stream.flatMap(_.value().split(","))
.map((_, 1))
.reduceByKey(_ + _)
.print()
ssc.start()
ssc.awaitTermination()
}
}
2.App依赖过多jars的小胖包解决方案
上述的编程中app用到了org.apache.spark.streaming.kafka010,但是该jars spark是不自带的,故需要使用提交作业时使用 --jars传入,但是当很多jar时并不方便,最好的方式是在打打胖包,在不需要的jar的maven中将其标志位 provided,这样我们只将我们需要的jar打到Jar中,且不需要–jars配置一堆jar包,非常方便,注意:生产坚决不要使用大胖包,可能会有各种冲突问题产生,导致程序停止。
第一步:不需要的打包的jar标注为provided
注意本地运行,必须将这些provided去除
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.ruoze.spark</groupId>
<artifactId>spark-train</artifactId>
<version>1.0</version>
<properties>
<scala.version>2.11.8</scala.version>
<spark.version>2.4.2</spark.version>
<!-- <spark.version>1.6.0</spark.version>-->
<hadoop.version>2.6.0-cdh5.7.0</hadoop.version>
<mysql.jdbc.version>5.1.38</mysql.jdbc.version>
<scalikejdbc.version>3.3.2</scalikejdbc.version>
</properties>
<repositories>
<!--添加cdhjar包仓库地址,不然无法下载cdh版本的hadoop jar-->
<repository>
<!--随便起-->
<id>cloudera</id>
<!--随便起-->
<name>cdhRespository</name>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
<repository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</repository>
</repositories>
<dependencies>
<!--添加scala依赖-->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
<scope>provided</scope>
</dependency>
<!--添加spark core 依赖-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<!--添加hadoop 客户端的依赖-->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
<scope>provided</scope>
</dependency>
<!--spark example jars 开始-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>${mysql.jdbc.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.49</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.scalikejdbc</groupId>
<artifactId>scalikejdbc_2.11</artifactId>
<version>${scalikejdbc.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.scalikejdbc</groupId>
<artifactId>scalikejdbc-config_2.11</artifactId>
<version>${scalikejdbc.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
<args>
<arg>-target:jvm-1.5</arg>
</args>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-eclipse-plugin</artifactId>
<configuration>
<downloadSources>true</downloadSources>
<buildcommands>
<buildcommand>ch.epfl.lamp.sdt.core.scalabuilder</buildcommand>
</buildcommands>
<additionalProjectnatures>
<projectnature>ch.epfl.lamp.sdt.core.scalanature</projectnature>
</additionalProjectnatures>
<classpathContainers>
<classpathContainer>org.eclipse.jdt.launching.JRE_CONTAINER</classpathContainer>
<classpathContainer>ch.epfl.lamp.sdt.launching.SCALA_CONTAINER</classpathContainer>
</classpathContainers>
</configuration>
</plugin>
</plugins>
<pluginManagement>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</plugin>
</plugins>
</pluginManagement>
</build>
<reporting>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
</configuration>
</plugin>
</plugins>
</reporting>
</project>
第二步:添加assembly打包插件
如第一步中 build中,jar-with-dependencies表示打包的后缀
第三步:配置assembly打包脚本,并打包
添加assembly:assembly命令
3.Kafka Offset Manager
Spark消费Kafka数据零丢失:
- 1)要想做到消零丢失,即消费语义是at-least-once或精准一次,需代码层次确保消息业务处理成功后手动提交或保存Offset。
扩展1: At most once:最多消费一次 (可能数据丢失);At least once:至少一次(可能重复消费);Exactly once:精准一次(生产最完美状态)
3.1 checkpoint
生产上避免使用checkpiont,过
3.2 Kafka itself
在该批次的业务逻辑代码处理完毕后,手动异步提交kafka的offset信息,由kafka自己管理Offset。
扩展2:kafka默认只会将消费者组消费topic的offset偏移信息保留一天,最好是将该期限调大至一周以上。
扩展3:0.8版本的Kafka没有commit offset 的api
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
// some time later, after outputs have completed
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
3.3Your own data store
自己将kafka的Offset信息保存到DB中(消费者组名称,topic名称,分区id,消费offset的下标)。启动作业时从DB中获取Offset的偏移数据,拉取到数据,每批次处理完毕后,将Offset存储到DB,采用类似于Upsert语法,有则update,没有则insert
// The details depend on your data store, but the general idea looks like this
// begin from the the offsets committed to the database
val fromOffsets = selectOffsetsFromYourDatabase.map { resultSet =>
new TopicPartition(resultSet.string("topic"), resultSet.int("partition")) -> resultSet.long("offset")
}.toMap
val stream = KafkaUtils.createDirectStream[String, String](
streamingContext,
PreferConsistent,
Assign[String, String](fromOffsets.keys.toList, kafkaParams, fromOffsets)
)
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
val results = yourCalculation(rdd)
// begin your transaction
// update results
// update offsets where the end of existing offsets matches the beginning of this batch of offsets
// assert that offsets were updated correctly
// end your transaction
}
扩展:
-
Kafka生产者的ack机制:request.required.acks的配置取值,
0:不等broker的ack就返回,性能最高,但有可能数据发出去了,但如石沉大海,实际没有发送成功,数据可能丢失
1:leader确认消息存下来了,再返回,保证至少一次语义,保证发送到Kafka的数据零丢失,
all:leader和ISR中所有replica都确认消息存下来了,再返回,保证至少一次语义,保证发送到Kafka的数据零丢失 -
Kafka流式计算很难做到精准一次
1)首先要达到这样的效果,代码实现上也是很复杂,需要将每一条业务真处理成功的消息 Offset位移信息一一记录下来,但是往往一批消息,都是统一处理的,如:每批次计算完毕后才批量存储HBase,此时成功Offset已经记录到该批次的最大值,若写HBase一半时异常,则在catche中提交的Offset偏移量是有问题的。
2)其次,即使代码层次做到了精准一次,但是上游推送重复数据(尤其接入第三方数据很常见)时就等价于重复消费数据
故Kafka流式计算在生产上代码层次要保证至少消费一次(数据零丢失)。在架构设计上保证幂等性(即使重复消费也不影响结果)**