Spark Streaming分析Kafka数据

3 篇文章 0 订阅

一、环境

开发环境:
    系统:Win10
    开发工具:scala-eclipse-IDE
    项目管理工具:Maven 3.6.0
    JDK 1.8
    Scala 2.11.11
    Spark 2.4.3
    spark-streaming-kafka-0-10_2.11 (Spark Streaming 提供的Kafka集成接口)
        注1. 末尾的2.11 代表scala版本;
        注2. kafka-0-10 代表支持 kafka 0.10 及以上版本
        注3. 集成接口使用方式,访问以下官方链接:
            http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
    Kafka_2.11-2.2.1 (2.11表示对应的scala版本,2.2.1表示kafka版本)

作业运行环境:
    系统:Linux CentOS7(两台机:主从节点)
        master : 192.168.190.200
        slave1 : 192.168.190.201
    JDK 1.8
    Scala 2.11.11
    Spark 2.4.3
    spark-streaming-kafka-0-10_2.11
    ZooKeeper 3.4.14
    Kafka_2.11-2.2.1

二、案例简介

1. (项目1)实现向Kafka中灌入模拟数据,以键值对形式传入:
    1)姓名地址数据 name_addr 输入格式:(key: name, value: name\taddr\t0),如下:

<Key: name>    <Value: name\taddr\t0>
bob            bob	    shanghai#200000    0
amy            amy	    beijing#100000     0	
alice          alice        shanghai#200000    0	
tom            tom	    beijing#100000     0
lulu           lulu	    hangzhou#310000    0
nick           nick         shanghai#200000    0
注1:其中最后的0代表类型type,0|地址,1|电话
注2:\t代表value中name,addr,type三者的的分隔符(制表符)

     2)姓名电话数据 name_phone 输入格式:(key: name, value: name\tphone\t1),如下:

<Key: name>    <Value: name\tphone\t0>
bob            bob	    15700079421    1
amy            amy	    18700079458    1	
alice          alice        17730079427    1	
tom            tom	    16700379451    1
lulu           lulu	    18800074423    1
nick           nick         14400033426    1
注1:其中最后的1代表类型type,0|地址,1|电话
注2:\t代表value中name,phone,type三者的的分隔符(制表符)

2. (项目2)实现SparkStreaming流式作业以2s为间隔,不断拉取Kafka中对应Topic下的数据,并作出分析(即对上述数据Join连接): 
    Join效果如下,输出至终端控制台(实现合并个人的姓名、地址、电话信息):

姓名:tom,地址:beijign#100000,电话:16700379451
姓名:alice,地址:shanghai#200000,电话:17730079427
姓名:nick,地址:shagnhai#200000,电话:14400033426
姓名:lulu,地址:hangzhou#310000,电话:18800074423
姓名:amy,地址:beijing#100000,电话:18700079458
姓名:bob,地址:shanghai#200000,电话:15700079421

三、Spark Streaming 接收 Kafka 数据的方式

Streaming 与 Kafka 集成官网教程:streaming-kafka-integration.html,如下两种方式:0-8 和 0-10 版本。

本文使用 spark-streaming-kafka-0-10:streaming-kafka-0-10-integration.html
           基于Direct Stream(直接流)读取的方式,有如下特点:
           1)提供了简单的并行性;
           2)实现了Kafka分区与Spark分区 1 : 1 对应;
           3)提供了获取Kafka偏移量 (offset) 和元数据 (metadata)的方式。

四、代码实现 

1. (生产者Producer)项目 kafkaGenerator:用于生成模拟数据,向Kafka推送


    1)pom.xml:需要依赖 kafka

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com</groupId><!-- 组织名 -->
  <artifactId>kafkaGenerator</artifactId><!-- 项目名 -->
  <version>0.1</version><!-- 版本号 -->
  <dependencies>
  	<dependency><!-- Kafka 依赖项 -->
  		<groupId>org.apache.kafka</groupId>
  		<artifactId>kafka_2.11</artifactId>
  		<version>2.2.1</version>
  		<exclusions><!-- 去掉引发冲突的包 -->
  			<exclusion>
  				<artifactId>jmxri</artifactId>
  				<groupId>com.sun.jmx</groupId>
  			</exclusion>
  			<exclusion>
  				<artifactId>jmxtools</artifactId>
  				<groupId>com.sun.jdmk</groupId>
  			</exclusion>
  			<exclusion>
  				<artifactId>jms</artifactId>
  				<groupId>javax.jms</groupId>
  			</exclusion>
  			<exclusion>
  				<artifactId>junit</artifactId>
  				<groupId>junit</groupId>
  			</exclusion>
  		</exclusions>
  	</dependency>
  </dependencies>
  <build>
  	<plugins>
  		<!-- 混合scala/java编译 -->
  		<plugin><!-- scala编译插件 -->
  			<groupId>org.scala-tools</groupId>
  			<artifactId>maven-scala-plugin</artifactId>
  			<executions>
  				<execution>
  					<id>compile</id>
  					<goals>
  						<goal>compile</goal>
  					</goals>
  					<phase>compile</phase>
  				</execution>
  				<execution>
  					<id>test-compile</id>
  					<goals>
  						<goal>testCompile</goal>
  					</goals>
  					<phase>test-compile</phase>
  				</execution>
  				<execution>
  					<phase>process-resources</phase>
  					<goals>
  						<goal>compile</goal>
  					</goals>
  				</execution>
  			</executions>
  		</plugin>
  		<plugin><!-- Maven 编译插件 -->
  			<artifactId>maven-compiler-plugin</artifactId>
  			<configuration>
  				<source>1.8</source><!-- 设置Java源 -->
  				<target>1.8</target>
  			</configuration>
  		</plugin>
  		<!-- for fatjar -->
  		<plugin><!-- 将所有依赖包打入同一个jar包中 -->
  			<groupId>org.apache.maven.plugins</groupId>
  			<artifactId>maven-assembly-plugin</artifactId>
  			<version>2.4</version>
  			<configuration>
  				<descriptorRefs>
  					<!-- jar包的后缀名 -->
  					<descriptorRef>jar-with-dependencies</descriptorRef>
  				</descriptorRefs>
  			</configuration>
  			<executions>
  				<execution>
  					<id>assemble-all</id>
  					<phase>package</phase>
  					<goals>
  						<goal>single</goal>
  					</goals>
  				</execution>
  			</executions>
  		</plugin>
  		<plugin><!-- Maven打包插件 -->
  			<groupId>org.apache.maven.plugins</groupId>
  			<artifactId>maven-jar-plugin</artifactId>
  			<configuration>
  				<archive>
  					<manifest>
  						<!-- 添加类路径 -->
  						<addClasspath>true</addClasspath>
  						<!-- 设置程序的入口类 -->
  						<mainClass>sparkstreaming_action.kafka.generator.Producer</mainClass>
  					</manifest>
  				</archive>
  			</configuration>
  		</plugin>
  	</plugins>
  </build>
  <repositories>
  	<repository>  
		<id>alimaven</id>  
		<name>aliyun maven</name>  
		<url>http://maven.aliyun.com/nexus/content/groups/public/</url>  
		<releases>  
			<enabled>true</enabled>  
		</releases>  
		<snapshots>  
			<enabled>false</enabled>  
		</snapshots>  
	</repository>
  </repositories>
</project>

 2)Producer.scala:程序入口

package sparkstreaming_action.kafka.generator

import scala.util.Random
import java.util.Properties
import org.apache.kafka.clients.producer.KafkaProducer
import org.apache.kafka.clients.producer.ProducerRecord

object Producer extends App{
  //从运行时参数读入topic
  val topic = args(0)
  //从运行时参数读入brokers
  val brokers = args(1)
  //设置一个随机数
  val rnd = new Random()
  //配置项
  val props = new Properties()
  //配置brokers
  //引导服务列表,host:port,host:port格式,只用于初始化引导,Server不必全部列出
  props.put("bootstrap.servers", brokers)
  //设置客户端名称
  //目的是能够让请求的server追踪请求源头,以此来允许ip/port许可列表之外的一些应用可以发送信息
  props.put("client.id", "kafkaGenerator")
  //序列化类型
  props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
  props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
  //建立Kafka连接
  val producer = new KafkaProducer[String, String](props)
  //当前时间ms数
  val t = System.currentTimeMillis()
  //模拟用户名地址数据
  val nameAddrs = Map("bob" -> "shanghai#200000", "amy" -> "beijing#100000",
      "alice" -> "shanghai#200000", "tom" -> "beijign#100000",
      "lulu" -> "hangzhou#310000", "nick" -> "shagnhai#200000")
  //模拟用户名电话数据
  val namePhones = Map("bob" -> "15700079421", "amy" -> "18700079458",
      "alice" -> "17730079427", "tom" -> "16700379451",
      "lulu" -> "18800074423", "nick" -> "14400033426")
  //生成模拟数据(name, addr, type:0)
  for (nameAddr <- nameAddrs) {
    val data = new ProducerRecord[String, String](topic, nameAddr._1,
        s"${nameAddr._1}\t${nameAddr._2}\t0")
    producer.send(data)  //异步发送,写入Kafka
    //if (rnd.nextInt(100) < 50) Thread.sleep(rnd.nextInt(10))
  }
  //生成模拟数据(name, phone, type:1)
  for (namePhone <- namePhones) {
    val data = new ProducerRecord[String, String](topic, namePhone._1,
        s"${namePhone._1}\t${namePhone._2}\t1")
    producer.send(data)  //异步发送,写入Kafka
    //if (rnd.nextInt(100) < 50) Thread.sleep(rnd.nextInt(10))
  }
  
  //计算每条记录的平均发送时间
  System.out.println("sent per second: " 
      + (nameAddrs.size + namePhones.size) * 1000 / (System.currentTimeMillis() - t))
  producer.close()
}

2. (消费者Consumer)项目 kafkaSparkStreaming:用于拉取 Kafka数据(topic: kafkaOperation),并分析(join)输出至控制台

    1)pom.xml:需要依赖 spark-streaming-kafka-0-10_2.11

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com</groupId><!-- 组织名 -->
  <artifactId>kafkaSparkStreaming</artifactId><!-- 项目名 -->
  <version>0.1</version><!-- 版本号 -->
  <properties>
  	<spark.version>2.4.3</spark.version><!-- 设置变量指定Spark版本号 -->
  </properties>
  <dependencies>
  	<dependency><!-- Spark依赖包 -->
  		<groupId>org.apache.spark</groupId>
  		<artifactId>spark-core_2.11</artifactId>
  		<version>${spark.version}</version>
  		<scope>provided</scope><!-- 运行时提供,打包不添加,Spark集群已自带 -->
  	</dependency>
  	<dependency><!-- Spark Streaming依赖包 -->
  		<groupId>org.apache.spark</groupId>
  		<artifactId>spark-streaming_2.11</artifactId>
  		<version>${spark.version}</version>
  		<scope>provided</scope><!-- 运行时提供,打包不添加,Spark集群已自带 -->
  	</dependency>
  	<dependency><!-- Spark Streaming with Kafka 依赖包 -->
  		<groupId>org.apache.spark</groupId>
  		<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
  		<version>${spark.version}</version>
  	</dependency>
  	<dependency><!-- Log 日志依赖包 -->
  		<groupId>log4j</groupId>
  		<artifactId>log4j</artifactId>
  		<version>1.2.17</version>
  	</dependency>
  	<dependency><!-- 日志依赖接口 -->
  		<groupId>org.slf4j</groupId>
  		<artifactId>slf4j-log4j12</artifactId>
  		<version>1.7.12</version>
  	</dependency>
  </dependencies>
  <build>
  	<plugins>
  		<!-- 混合scala/java编译 -->
  		<plugin><!-- scala编译插件 -->
  			<groupId>org.scala-tools</groupId>
  			<artifactId>maven-scala-plugin</artifactId>
  			<executions>
  				<execution>
  					<id>compile</id>
  					<goals>
  						<goal>compile</goal>
  					</goals>
  					<phase>compile</phase>
  				</execution>
  				<execution>
  					<id>test-compile</id>
  					<goals>
  						<goal>testCompile</goal>
  					</goals>
  					<phase>test-compile</phase>
  				</execution>
  				<execution>
  					<phase>process-resources</phase>
  					<goals>
  						<goal>compile</goal>
  					</goals>
  				</execution>
  			</executions>
  		</plugin>
  		<plugin>
  			<artifactId>maven-compiler-plugin</artifactId>
  			<configuration>
  				<source>1.8</source><!-- 设置Java源 -->
  				<target>1.8</target>
  			</configuration>
  		</plugin>
  		<!-- for fatjar -->
  		<plugin><!-- 将所有依赖包打入同一个jar包中 -->
  			<groupId>org.apache.maven.plugins</groupId>
  			<artifactId>maven-assembly-plugin</artifactId>
  			<version>2.4</version>
  			<configuration>
  				<descriptorRefs>
  					<descriptorRef>jar-with-dependencies</descriptorRef>
  				</descriptorRefs>
  			</configuration>
  			<executions>
  				<execution>
  					<id>assemble-all</id>
  					<phase>package</phase>
  					<goals>
  						<goal>single</goal>
  					</goals>
  				</execution>
  			</executions>
  		</plugin>
  		<plugin>
  			<groupId>org.apache.maven.plugins</groupId>
  			<artifactId>maven-jar-plugin</artifactId>
  			<configuration>
  				<archive>
  					<manifest>
  						<!-- 添加类路径 -->
  						<addClasspath>true</addClasspath>
  						<!-- 设置程序的入口类 -->
  						<mainClass>sparkstreaming_action.kafka.operation.KafkaOperation</mainClass>
  					</manifest>
  				</archive>
  			</configuration>
  		</plugin>
  	</plugins>
  </build>
  <repositories>
  	<repository>  
		<id>alimaven</id>  
		<name>aliyun maven</name>  
		<url>http://maven.aliyun.com/nexus/content/groups/public/</url>  
		<releases>  
			<enabled>true</enabled>  
		</releases>  
		<snapshots>  
			<enabled>false</enabled>  
		</snapshots>  
	</repository>
  </repositories>
</project>

    2)KafkaOperation.scala:程序入口

package sparkstreaming_action.kafka.operation

import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.LocationStrategies
import org.apache.spark.streaming.kafka010.ConsumerStrategies

object KafkaOperation extends App {
  //Spark 配置项
  val sparkConf = new SparkConf()
      .setAppName("KafkaOperation")
      .setMaster("spark://master:7077")
      .set("spark.local.dir", "./tmp")
      .set("spark.streaming.kafka.maxRatePerPartition", "10")
      //spark.streaming.kafka.maxRatePerPartition: 控制spark读取的每个分区最大消息数
  //创建流式上下文,2s为批处理间隔
  val ssc = new StreamingContext(sparkConf, Seconds(2))
  //根据broker和topic创建直接通过Kafka连接Direct Kafka
  val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "master:9092,slave1:9092",  //引导服务列表
      "key.deserializer" -> classOf[StringDeserializer],  //key序列化类型
      "value.deserializer" -> classOf[StringDeserializer],  //value序列化类型
      "group.id" -> "kafkaOperationGroup",  //Group设置
      "auto.offset.reset" -> "latest",  //从最新offset开始
      "enable.auto.commit" -> (false: java.lang.Boolean)  //自动提交
    )
  //获取KafkaDStream
  val kafkaDirectStream = KafkaUtils.createDirectStream[String, String](
      ssc,
      LocationStrategies.PreferConsistent,
      ConsumerStrategies.Subscribe[String, String](List("kafkaOperation"), kafkaParams)
    )
  //根据接收的Kafka信息,切分得到用户地址DStream
  val nameAddrStream = kafkaDirectStream.map(_.value).filter(record => {
      //记录以制表符切分
      val tokens = record.split("\t")
      //过滤出 (name addr type:0)的DS记录
      tokens(2).toInt == 0
    }).map(record => {
      //记录以制表符切分
      val tokens = record.split("\t")
      //以(name, addr)的格式返回新的DS
      (tokens(0), tokens(1))
    })
  //根据接收的Kafka信息,切分得到用户电话DStream
  val namePhoneStream = kafkaDirectStream.map(_.value).filter(record => {
      //记录以制表符切分
      val tokens = record.split("\t")
      //过滤出 (name phone type:1)的DS记录
      tokens(2).toInt == 1
    }).map(record => {
      //记录以制表符切分
      val tokens = record.split("\t")
      //以(name, phone)的格式返回新的DS
      (tokens(0), tokens(1))
    })
  //以用户名为key,将地址电话配对在一起
  //并产生固定格式的用户地址电话信息(name,(addr,phone))
  val nameAddrPhoneStream = nameAddrStream.join(namePhoneStream).map(record => {
      s"姓名:${record._1},地址:${record._2._1},电话:${record._2._2}"
    })
  //打印输出(默认前10条)
  nameAddrPhoneStream.print()
  //开始计算
  ssc.start()
  ssc.awaitTermination()
}

    3)该项目(消费者)中的Spark作业流程图

SparkStreaming作业输出print Job的三个DAG调度阶段(Stage)
<截图自Spark UI>

 

五、项目打包运行

1. 打包编译项目
  1.在项目 kafkaSparkStreaming 根目录执行 mvn clean install 编译命令
    将根目录下target/kafkaSparkStreaming-0.1-jar-with-dependencies.jar上传到Linux
  2.在项目 kafkaGenerator 根目录执行 mvn clean install 编译命令
    将根目录下target//kafkaGenerator-0.1-jar-with-dependencies.jar上传到Linux

所有节点执行
2. 启动Zookeeper,Kafka
    $ zkServer.sh start
    $ kafka-server-start.sh -daemon /opt/kafka_2.11-2.2.1/config/server.properties

使用终端A(如:win的PowerShell)连接master主节点
3. 启动Spark
    $ /opt/spark-2.4.3-bin-hadoop2.7/sbin/start-all.sh
4. 提交Spark作业,不断拉取Kafka中(topic: kafkaOperation)的数据
    $ spark-submit \
      --class sparkstreaming_action.kafka.operation.KafkaOperation \
      --num-executors 2 \
      --conf spark.default.parallelism=1000 \
      kafkaSparkStreaming-0.1-jar-with-dependencies.jar
    
    <出现如下日志信息,每2s执行一次Job打印,即:启动成功>
    -------------------------------------------
    Time: 1560391586000 ms
    -------------------------------------------
    
    -------------------------------------------
    Time: 1560391588000 ms
    -------------------------------------------
    <注:一开始Kafka中无数据,所以打印空行>    

使用终端B(PowerShell可开启多个)连接master主节点
5. 向Kafka中灌入模拟数据(可灌入多次,观察Spark UI中的Streaming曲线)
    $ java -cp kafkaGenerator-0.1-jar-with-dependencies.jar \
      sparkstreaming_action.kafka.generator.Producer \
      kafkaOperation master:9092,slave1:9092

    <执行成功,输出如下信息:每条记录向Kafka发送的平均时间>
    sent per second: 54

    <此时,终端A中流式作业拉取到了Kafka数据,并做Join连接后输出,如下:>
    -------------------------------------------
    Time: 1560392046000 ms
    -------------------------------------------
    姓名:tom,地址:beijign#100000,电话:16700379451
    姓名:alice,地址:shanghai#200000,电话:17730079427
    姓名:nick,地址:shagnhai#200000,电话:14400033426
    姓名:lulu,地址:hangzhou#310000,电话:18800074423
    姓名:amy,地址:beijing#100000,电话:18700079458
    姓名:bob,地址:shanghai#200000,电话:15700079421
    -------------------------------------------
    Time: 1560392048000 ms
    -------------------------------------------

6. 在终端B中,查看topic: kafkaOperation 主题中消费者offset情况:
    $ kafka-consumer-groups.sh \
      --bootstrap-server master:9092 \
      --describe --group kafkaOperationGroup
    
    <出现如下描述信息:每向Kafka灌入一条数据,offset都会加1 >
    TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                     HOST             CLIENT-ID
    kafkaOperation  0          -               2412            -               consumer-1-3feb3110-f90c-4a55-b453-4270038e724b /192.168.190.200 consumer-1

 

六、参考文章

1. 《Spark Streaming 实时流式大数据处理实战》第五章 Spark Streaming 与 Kafka

2. 再谈Spark Streaming Kafka反压

3. Kafka配置说明

4. streaming-programming-guide.html(Spark Streaming官网教程)

5. streaming-kafka-0-10-integration.html(官网教程)

6. streaming-kafka-integration.html(Streaming 与 Kafka 集成官网教程)

 

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值