spark streaming从指定offset处消费Kafka数据

最新推荐文章于 2022-11-28 20:10:41 发布

Jacob_llz

最新推荐文章于 2022-11-28 20:10:41 发布

阅读量3.6k

点赞数

分类专栏：大数据 spark

大数据同时被 2 个专栏收录

48 篇文章 0 订阅

订阅专栏

spark

19 篇文章 0 订阅

订阅专栏

原文地址:http://blog.csdn.net/high2011/article/details/53706446

首先很感谢原文作者，看到这篇文章我少走了很多弯路，转载此文章是为了保留一份供复习用，请大家支持原作者，移步到上面的连接去看，谢谢

一、情景：当Spark streaming程序意外退出时，数据仍然再往Kafka中推送，然而由于Kafka默认是从latest的offset读取，这会导致数据丢失。为了避免数据丢失，那么我们需要记录每次消费的offset，以便下次检查并且从指定的offset开始读取

二、环境：kafka-0.9.0、spark-1.6.0、jdk-1.7、Scala-2.10.5、idea16

三、实现代码：

1、引入spark和kafka的相关依赖包

[html]view plain copy 
   
 <?xml version="1.0" encoding="UTF-8"?>  
 <project xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  
          xmlns="http://maven.apache.org/POM/4.0.0"  
          xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">  
     <modelVersion>4.0.0</modelVersion>  
   
     <groupId>com.ngaa</groupId>  
     <artifactId>test-my</artifactId>  
     <version>1.0-SNAPSHOT</version>  
     <inceptionYear>2008</inceptionYear>  
     <properties>  
         <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>  
         <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>  
         <!--add  maven release-->  
         <maven.compiler.source>1.7</maven.compiler.source>  
         <maven.compiler.target>1.7</maven.compiler.target>  
         <encoding>UTF-8</encoding>  
         <!--scala版本-->  
         <scala.version>2.10.5</scala.version>  
         <!--测试机器上的scala版本-->  
         <test.scala.version>2.11.7</test.scala.version>  
   
         <jackson.version>2.3.0</jackson.version>  
         <!--slf4j版本-->  
         <slf4j-version>1.7.20</slf4j-version>  
         <!--cdh-spark-->  
         <spark.cdh.version>1.6.0-cdh5.8.0</spark.cdh.version>  
         <spark.streaming.cdh.version>1.6.0-cdh5.8.0</spark.streaming.cdh.version>  
         <kafka.spark.cdh.version>1.6.0-cdh5.8.0</kafka.spark.cdh.version>  
         <!--cdh-hadoop-->  
         <hadoop.cdh.version>2.6.0-cdh5.8.0</hadoop.cdh.version>  
         <!--http client必需要兼容CDH中的hadoop版本（cd /opt/cloudera/parcels/CDH/lib/hadoop/lib）-->  
         <httpclient.version>4.2.5</httpclient.version>  
   
         <!--http copre-->  
         <httpcore.version>4.2.5</httpcore.version>  
         <!--fastjson-->  
         <fastjson.version>1.1.39</fastjson.version>  
   
     </properties>  
   
     <repositories>  
         <repository>  
             <id>scala-tools.org</id>  
             <name>Scala-Tools Maven2 Repository</name>  
             <url>http://scala-tools.org/repo-releases</url>  
         </repository>  
         <!--配置依赖库地址（用于加载CDH依赖的jar包） -->  
         <repository>  
             <id>cloudera</id>  
             <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>  
         </repository>  
     </repositories>  
   
     <pluginRepositories>  
         <pluginRepository>  
             <id>scala-tools.org</id>  
             <name>Scala-Tools Maven2 Repository</name>  
             <url>http://scala-tools.org/repo-releases</url>  
         </pluginRepository>  
     </pluginRepositories>  
   
     <dependencies>  
   
         <!--fastjson-->  
         <dependency>  
             <groupId>com.alibaba</groupId>  
             <artifactId>fastjson</artifactId>  
             <version>${fastjson.version}</version>  
         </dependency>  
         <!--httpclient-->  
         <dependency>  
             <groupId>org.apache.httpcomponents</groupId>  
             <artifactId>httpclient</artifactId>  
             <version>${httpclient.version}</version>  
         </dependency>  
   
         <!--http core-->  
         <dependency>  
             <groupId>org.apache.httpcomponents</groupId>  
             <artifactId>httpcore</artifactId>  
             <version>${httpcore.version}</version>  
         </dependency>  
   
         <!--slf4j-->  
         <dependency>  
             <groupId>org.slf4j</groupId>  
             <artifactId>slf4j-log4j12</artifactId>  
             <version>${slf4j-version}</version>  
         </dependency>  
         <!--hadoop-->  
         <dependency>  
             <groupId>org.apache.hadoop</groupId>  
             <artifactId>hadoop-client</artifactId>  
             <version>${hadoop.cdh.version}</version>  
             <exclusions>  
                 <exclusion>  
                     <groupId>javax.servlet</groupId>  
                     <artifactId>*</artifactId>  
                 </exclusion>  
             </exclusions>  
         </dependency>  
         <dependency>  
             <groupId>org.apache.hadoop</groupId>  
             <artifactId>hadoop-common</artifactId>  
             <version>${hadoop.cdh.version}</version>  
             <exclusions>  
                 <exclusion>  
                     <groupId>javax.servlet</groupId>  
                     <artifactId>*</artifactId>  
                 </exclusion>  
             </exclusions>  
         </dependency>  
         <dependency>  
             <groupId>org.apache.hadoop</groupId>  
             <artifactId>hadoop-hdfs</artifactId>  
             <version>${hadoop.cdh.version}</version>  
             <exclusions>  
                 <exclusion>  
                     <groupId>javax.servlet</groupId>  
                     <artifactId>*</artifactId>  
                 </exclusion>  
             </exclusions>  
         </dependency>  
         <!--spark scala-->  
         <dependency>  
             <groupId>org.scala-lang</groupId>  
             <artifactId>scala-library</artifactId>  
             <version>${scala.version}</version>  
         </dependency>  
         <dependency>  
             <groupId>com.fasterxml.jackson.core</groupId>  
             <artifactId>jackson-databind</artifactId>  
             <version>${jackson.version}</version>  
         </dependency>  
   
         <!--spark streaming和kafka的相关包-->  
         <dependency>  
             <groupId>org.apache.spark</groupId>  
             <artifactId>spark-streaming_2.10</artifactId>  
             <version>${spark.streaming.cdh.version}</version>  
         </dependency>  
         <dependency>  
             <groupId>org.apache.spark</groupId>  
             <artifactId>spark-streaming-kafka_2.10</artifactId>  
             <version>${kafka.spark.cdh.version}</version>  
         </dependency>  
         <dependency>  
             <groupId>junit</groupId>  
             <artifactId>junit</artifactId>  
             <version>4.12</version>  
             <scope>test</scope>  
         </dependency>  
   
         <!--引入windows本地库的spark包-->  
         <dependency>  
         <groupId>org.apache.spark</groupId>  
         <artifactId>spark-assembly_2.10</artifactId>  
         <version>${spark.cdh.version}</version>  
         <scope>system</scope>  
         <systemPath>D:/crt_send_document/spark-assembly-1.6.0-cdh5.8.0-hadoop2.6.0-cdh5.8.0.jar</systemPath>  
         </dependency>  
   
         <!--引入测试环境linux本地库的spark包-->  
         <!--<dependency>-->  
             <!--<groupId>org.apache.spark</groupId>-->  
             <!--<artifactId>spark-assembly_2.10</artifactId>-->  
             <!--<version>${spark.cdh.version}</version>-->  
             <!--<scope>system</scope>-->  
             <!--<systemPath>/opt/cloudera/parcels/CDH/lib/spark/lib/spark-examples-1.6.0-cdh5.8.0-hadoop2.6.0-cdh5.8.0.jar-->  
             <!--</systemPath>-->  
         <!--</dependency>-->  
   
         <!--引入中央仓库的spark包-->  
         <!--<dependency>-->  
         <!--<groupId>org.apache.spark</groupId>-->  
         <!--<artifactId>spark-assembly_2.10</artifactId>-->  
         <!--<version>${spark.cdh.version}</version>-->  
         <!--</dependency>-->  
   
         <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-yarn-server-web-proxy -->  
         <dependency>  
             <groupId>org.apache.hadoop</groupId>  
             <artifactId>hadoop-yarn-server-web-proxy</artifactId>  
             <version>2.6.0-cdh5.8.0</version>  
         </dependency>  
   
     </dependencies>  
   
     <!--maven打包-->  
     <build>  
         <finalName>test-my</finalName>  
         <sourceDirectory>src/main/scala</sourceDirectory>  
         <testSourceDirectory>src/test/scala</testSourceDirectory>  
         <plugins>  
             <plugin>  
                 <groupId>org.scala-tools</groupId>  
                 <artifactId>maven-scala-plugin</artifactId>  
                 <version>2.15.2</version>  
                 <executions>  
                     <execution>  
                         <goals>  
                             <goal>compile</goal>  
                             <goal>testCompile</goal>  
                         </goals>  
                     </execution>  
                 </executions>  
                 <configuration>  
                     <scalaVersion>${scala.version}</scalaVersion>  
                     <args>  
                         <arg>-target:jvm-1.7</arg>  
                     </args>  
                 </configuration>  
             </plugin>  
             <plugin>  
                 <groupId>org.apache.maven.plugins</groupId>  
                 <artifactId>maven-eclipse-plugin</artifactId>  
                 <configuration>  
                     <downloadSources>true</downloadSources>  
                     <buildcommands>  
                         <buildcommand>ch.epfl.lamp.sdt.core.scalabuilder</buildcommand>  
                     </buildcommands>  
                     <additionalProjectnatures>  
                         <projectnature>ch.epfl.lamp.sdt.core.scalanature</projectnature>  
                     </additionalProjectnatures>  
                     <classpathContainers>  
                         <classpathContainer>org.eclipse.jdt.launching.JRE_CONTAINER</classpathContainer>  
                         <classpathContainer>ch.epfl.lamp.sdt.launching.SCALA_CONTAINER</classpathContainer>  
                     </classpathContainers>  
                 </configuration>  
             </plugin>  
             <plugin>  
                 <artifactId>maven-assembly-plugin</artifactId>  
                 <configuration>  
                     <descriptorRefs>  
                         <descriptorRef>jar-with-dependencies</descriptorRef>  
                     </descriptorRefs>  
                     <archive>  
                         <manifest>  
                             <mainClass></mainClass>  
                         </manifest>  
                     </archive>  
                 </configuration>  
                 <executions>  
                     <execution>  
                         <id>make-assembly</id>  
                         <phase>package</phase>  
                         <goals>  
                             <goal>single</goal>  
                         </goals>  
                     </execution>  
                 </executions>  
             </plugin>  
         </plugins>  
     </build>  
     <reporting>  
         <plugins>  
             <plugin>  
                 <groupId>org.scala-tools</groupId>  
                 <artifactId>maven-scala-plugin</artifactId>  
                 <configuration>  
                     <scalaVersion>${scala.version}</scalaVersion>  
                 </configuration>  
             </plugin>  
         </plugins>  
     </reporting>  
   
 </project>  

2、新建测试类

[java]view plain copy 
   
 import kafka.common.TopicAndPartition  
 import kafka.message.MessageAndMetadata  
 import kafka.serializer.StringDecoder  
 import org.apache.log4j.{Level, Logger}  
 import org.apache.spark.{SparkConf, TaskContext}  
 import org.apache.spark.streaming.dstream.InputDStream  
 import org.apache.spark.streaming.kafka.{HasOffsetRanges, KafkaUtils, OffsetRange}  
 import org.apache.spark.streaming.{Seconds, StreamingContext}  
 import org.slf4j.LoggerFactory  
   
 /** 
   * Created by yangjf on 2016/12/18 
   * Update date: 
   * Time: 11:10 
   * Describle :从指定偏移量读取kafka数据 
   * Result of Test: 
   * Command: 
   * Email: jifei.yang@ngaa.com.cn 
   */  
 object ReadBySureOffsetTest {  
   val logger = LoggerFactory.getLogger(ReadBySureOffsetTest.getClass)  
   
   def main(args: Array[String]) {  
     //设置打印日志级别  
     Logger.getLogger("org.apache.kafka").setLevel(Level.ERROR)  
     Logger.getLogger("org.apache.zookeeper").setLevel(Level.ERROR)  
     Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)  
     logger.info("测试从指定offset消费kafka的主程序开始")  
     if (args.length < 1) {  
       System.err.println("Your arguments were " + args.mkString(","))  
       System.exit(1)  
       logger.info("主程序意外退出")  
     }  
     //hdfs://hadoop1:8020/user/root/spark/checkpoint  
     val Array(checkpointDirectory) = args  
     logger.info("checkpoint检查：" + checkpointDirectory)  
     val ssc = StreamingContext.getOrCreate(checkpointDirectory,  
       () => {  
         createContext(checkpointDirectory)  
       })  
     logger.info("streaming开始启动")  
     ssc.start()  
     ssc.awaitTermination()  
   }  
   
   def createContext(checkpointDirectory: String): StreamingContext = {  
     //获取配置  
     val brokers = "hadoop3:9092,hadoop4:9092"  
     val topics = "20161218a"  
   
     //默认为5秒  
     val split_rdd_time = 8  
     // 创建上下文  
     val sparkConf = new SparkConf()  
       .setAppName("SendSampleKafkaDataToApple").setMaster("local[2]")  
       .set("spark.app.id", "streaming_kafka")  
   
     val ssc = new StreamingContext(sparkConf, Seconds(split_rdd_time))  
   
     ssc.checkpoint(checkpointDirectory)  
   
     // 创建包含brokers和topic的直接kafka流  
     val topicsSet: Set[String] = topics.split(",").toSet  
     //kafka配置参数  
     val kafkaParams: Map[String, String] = Map[String, String](  
       "metadata.broker.list" -> brokers,  
       "group.id" -> "apple_sample",  
       "serializer.class" -> "kafka.serializer.StringEncoder"  
 //      "auto.offset.reset" -> "largest"   //自动将偏移重置为最新偏移（默认）  
 //      "auto.offset.reset" -> "earliest"  //自动将偏移重置为最早的偏移  
 //      "auto.offset.reset" -> "none"      //如果没有为消费者组找到以前的偏移，则向消费者抛出异常  
     )  
     /** 
       * 从指定位置开始读取kakfa数据 
       * 注意：由于Exactly  Once的机制，所以任何情况下，数据只会被消费一次！ 
       *      指定了开始的offset后，将会从上一次Streaming程序停止处，开始读取kafka数据 
       */  
     val offsetList = List((topics, 0, 22753623L),(topics, 1, 327041L))                          //指定topic，partition_no，offset  
     val fromOffsets = setFromOffsets(offsetList)     //构建参数  
     val messageHandler = (mam: MessageAndMetadata[String, String]) => (mam.topic, mam.message()) //构建MessageAndMetadata  
    //使用高级API从指定的offset开始消费，欲了解详情，  
    //请进入"http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.kafka.KafkaUtils$"查看  
     val messages: InputDStream[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, String)](ssc, kafkaParams, fromOffsets, messageHandler)  
   
     //数据操作  
     messages.foreachRDD(mess => {  
       //获取offset集合  
       val offsetsList = mess.asInstanceOf[HasOffsetRanges].offsetRanges  
       mess.foreachPartition(lines => {  
         lines.foreach(line => {  
           val o: OffsetRange = offsetsList(TaskContext.get.partitionId)  
           logger.info("++++++++++++++++++++++++++++++此处记录offset+++++++++++++++++++++++++++++++++++++++")  
           logger.info(s"${o.topic}  ${o.partition}  ${o.fromOffset}  ${o.untilOffset}")  
           logger.info("+++++++++++++++++++++++++++++++此处消费数据操作++++++++++++++++++++++++++++++++++++++")  
           logger.info("The kafka  line is " + line)  
         })  
       })  
     })  
     ssc  
   }  
   
   //构建Map  
   def setFromOffsets(list: List[(String, Int, Long)]): Map[TopicAndPartition, Long] = {  
     var fromOffsets: Map[TopicAndPartition, Long] = Map()  
     for (offset <- list) {  
       val tp = TopicAndPartition(offset._1, offset._2)//topic和分区数  
       fromOffsets += (tp -> offset._3)           // offset位置  
     }  
     fromOffsets  
   }  
 }  

四、参考文档：

1、spark API http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.kafka.KafkaUtils$

2、Kafka官方配置说明：http://kafka.apache.org/documentation.html#configuration

3、Kafka SampleConsumer：https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example

4、Spark streaming 消费遍历offset说明：http://spark.apache.org/docs/1.6.0/streaming-kafka-integration.html

5、Kafka官方API说明：http://kafka.apache.org/090/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html

注：以上测试通过，可以根据需要修改。如有疑问，请留言！

Jacob_llz

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
spark streaming从指定offset处消费Kafka数据

原文地址:http://blog.csdn.net/high2011/article/details/53706446 首先很感谢原文作者，看到这篇文章我少走了很多弯路，转载此文章是为了保留一份供复习用，请大家支持原作者，移步到上面的连接去看，谢谢一、情景：当Spark streaming程序意外退出时，数据仍然再往Kafka中
复制链接

扫一扫