Kafka+Spark Streaming+Redis实时系统实践

基于 Spark通用计算平台,可以很好地扩展各种计算类型的应用,尤其是 Spark提供了内建的计算库支持,像 Spark Streaming、Spark SQL、MLlib、GraphX,这些内建库都提供了高级抽象,可以用非常简洁的代码实现复杂的计算逻辑、这也得益于Scala编程语言的简洁性。这里,我们基于1.3.0版本的Spark搭建了计算平台,实现基于Spark Streaming的实时计算。

  我们的应用场景是分析用户使用手机App的行为,描述如下所示:

  1、手机客户端会收集用户的行为事件(我们以点击事件为例),将数据发送到数据服务器,我们假设这里直接进入到Kafka消息队列
  2、后端的实时服务会从Kafka消费数据,将数据读出来并进行实时分析,这里选择Spark Streaming,因为Spark Streaming提供了与Kafka整合的内置支持
  3、经过Spark Streaming实时计算程序分析,将结果写入Redis,可以实时获取用户的行为数据,并可以导出进行离线综合统计分析

Kafka+Spark Streaming+Redis编程实践

  下面,我们根据上面提到的应用场景,来编程实现这个实时计算应用。首先,写了一个Kafka Producer模拟程序,用来模拟向Kafka实时写入用户行为的事件数据,数据是JSON格式,示例如下:

1{
2    "uid":"068b746ed4620d25e26055a9f804385f"
3    "event_time":"1430204612405"
4    "os_type":"Android"
5    "click_count": 6
6}

一个事件包含4个字段:
  1、uid:用户编号
  2、event_time:事件发生时间戳
  3、os_type:手机App操作系统类型
  4、click_count:点击次数
下面是我们实现的代码,如下所示:

01package com.iteblog.spark.streaming.utils
02  
03import java.util.Properties
04import scala.util.Properties
05import org.codehaus.jettison.json.JSONObject
06import kafka.javaapi.producer.Producer
07import kafka.producer.KeyedMessage
08import kafka.producer.KeyedMessage
09import kafka.producer.ProducerConfig
10import scala.util.Random
11  
12object KafkaEventProducer {
13   
14  privateval users =Array(
15      "4A4D769EB9679C054DE81B973ED5D768","8dfeb5aaafc027d89349ac9a20b3930f",
16      "011BBF43B89BFBF266C865DF0397AA71","f2a8474bf7bd94f0aabbd4cdd2c06dcf",
17      "068b746ed4620d25e26055a9f804385f","97edfc08311c70143401745a03a50706",
18      "d7f141563005d1b5d0d3dd30138f3f62","c8ee90aade1671a21336c721512b817a",
19      "6b67c8c700427dee7552f81f3228c927","a95f22eabc4fd4b580c011a3161a9d9d")
20       
21  privateval random =new Random()
22       
23  privatevar pointer =-1
24   
25  defgetUserID() :String = {
26       pointer= pointer + 1
27    if(pointer >=users.length) {
28      pointer= 0
29      users(pointer)
30    } else {
31      users(pointer)
32    }
33  }
34   
35  defclick() : Double = {
36    random.nextInt(10)
37  }
38   
39  // bin/kafka-topics.sh --zookeeper zk1:2181,zk2:2181,zk3:2181/kafka --create --topic user_events --replication-factor 2 --partitions 2
40  // bin/kafka-topics.sh --zookeeper zk1:2181,zk2:2181,zk3:2181/kafka --list
41  // bin/kafka-topics.sh --zookeeper zk1:2181,zk2:2181,zk3:2181/kafka --describe user_events
42  // bin/kafka-console-consumer.sh --zookeeper zk1:2181,zk2:2181,zk3:22181/kafka --topic test_json_basis_event --from-beginning
43  defmain(args: Array[String]): Unit = {
44    valtopic = "user_events"
45    valbrokers = "10.10.4.126:9092,10.10.4.127:9092"
46    valprops = new Properties()
47    props.put("metadata.broker.list", brokers)
48    props.put("serializer.class","kafka.serializer.StringEncoder")
49     
50    valkafkaConfig =new ProducerConfig(props)
51    valproducer = new Producer[String, String](kafkaConfig)
52     
53    while(true) {
54      // prepare event data
55      valevent = new JSONObject()
56      event
57        .put("uid", getUserID)
58        .put("event_time", System.currentTimeMillis.toString)
59        .put("os_type","Android")
60        .put("click_count", click)
61       
62      // produce event message
63      producer.send(newKeyedMessage[String, String](topic, event.toString))
64      println("Message sent: "+ event)
65       
66      Thread.sleep(200)
67    }
68  }  
69}

  通过控制上面程序最后一行的时间间隔来控制模拟写入速度。下面我们来讨论实现实时统计每个用户的点击次数,它是按照用户分组进行累加次数,逻辑比较简单,关键是在实现过程中要注意一些问题,如对象序列化等。先看实现代码,稍后我们再详细讨论,代码实现如下所示:

01object UserClickCountAnalytics {
02  
03  defmain(args: Array[String]): Unit = {
04    varmasterUrl = "local[1]"
05    if(args.length > 0) {
06      masterUrl= args(0)
07    }
08  
09    // Create a StreamingContext with the given master URL
10    valconf = new SparkConf().setMaster(masterUrl).setAppName("UserClickCountStat")
11    valssc = new StreamingContext(conf, Seconds(5))
12  
13    // Kafka configurations
14    valtopics = Set("user_events")
15    valbrokers = "10.10.4.126:9092,10.10.4.127:9092"
16    valkafkaParams =Map[String, String](
17      "metadata.broker.list"-> brokers, "serializer.class"-> "kafka.serializer.StringEncoder")
18  
19    valdbIndex = 1
20    valclickHashKey ="app::users::click"
21  
22    // Create a direct stream
23    valkafkaStream =KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
24  
25    valevents = kafkaStream.flatMap(line => {
26      valdata = JSONObject.fromObject(line._2)
27      Some(data)
28    })
29  
30    // Compute user click times
31    valuserClicks = events.map(x => (x.getString("uid"), x.getInt("click_count"))).reduceByKey(_+ _)
32    userClicks.foreachRDD(rdd=> {
33      rdd.foreachPartition(partitionOfRecords=> {
34        partitionOfRecords.foreach(pair=> {
35          valuid = pair._1
36          valclickCount = pair._2
37          valjedis = <SPAN class=wp_keywordlink_affiliate><A title=""href="http://www.iteblog.com/archives/tag/redis"target=_blank data-original-title="View all posts in Redis"jQuery1830668587673401759="50">Redis</A></SPAN>Client.pool.getResource
38          jedis.select(dbIndex)
39          jedis.hincrBy(clickHashKey, uid, clickCount)
40          RedisClient.pool.returnResource(jedis)
41        })
42      })
43    })
44  
45    ssc.start()
46    ssc.awaitTermination()
47  
48  }
49}

  上面代码使用了Jedis客户端来操作Redis,将分组计数结果数据累加写入Redis存储,如果其他系统需要实时获取该数据,直接从Redis实时读取即可。RedisClient实现代码如下所示:

01object RedisClient extends Serializable {
02  valredisHost = "10.10.4.130"
03  valredisPort = 6379
04  valredisTimeout =30000
05  lazyval pool =new JedisPool(newGenericObjectPoolConfig(), redisHost, redisPort, redisTimeout)
06  
07  lazyval hook =new Thread {
08    overridedef run ={
09      println("Execute hook thread: "+ this)
10      pool.destroy()
11    }
12  }
13  sys.addShutdownHook(hook.run)
14}

  上面代码我们分别在local[K]和Spark Standalone集群模式下运行通过。

  如果我们是在开发环境进行调试的时候,也就是使用local[K]部署模式,在本地启动K个Worker线程来计算,这K个Worker在同一个JVM实例里,上面的代码默认情况是,如果没有传参数则是local[K]模式,所以如果使用这种方式在创建Redis连接池或连接的时候,可能非常容易调试通过,但是在使用Spark Standalone、YARN Client(YARN Cluster)或Mesos集群部署模式的时候,就会报错,主要是由于在处理Redis连接池或连接的时候出错了。我们可以看一下Spark架构,如图所示(来自官网):

  无论是在本地模式、Standalone模式,还是在Mesos或YARN模式下,整个Spark集群的结构都可以用上图抽象表示,只是各个组件的运行环境不同,导致组件可能是分布式的,或本地的,或单个JVM实例的。如在本地模式,则上图表现为在同一节点上的单个进程之内的多个组件;而在YARN Client模式下,Driver程序是在YARN集群之外的一个节点上提交Spark Application,其他的组件都运行在YARN集群管理的节点上。

  在Spark集群环境部署Application后,在进行计算的时候会将作用于RDD数据集上的函数(Functions)发送到集群中Worker上的Executor上(在Spark Streaming中是作用于DStream的操作),那么这些函数操作所作用的对象(Elements)必须是可序列化的,通过Scala也可以使用lazy引用来解决,否则这些对象(Elements)在跨节点序列化传输后,无法正确地执行反序列化重构成实际可用的对象。上面代码我们使用lazy引用(Lazy Reference)来实现的,代码如下所示:

01// lazy pool reference
02lazy val pool =new JedisPool(newGenericObjectPoolConfig(), redisHost, redisPort, redisTimeout)
03...
04partitionOfRecords.foreach(pair => {
05  valuid = pair._1
06  valclickCount = pair._2
07  valjedis = RedisClient.pool.getResource
08  jedis.select(dbIndex)
09  jedis.hincrBy(clickHashKey, uid, clickCount)
10  RedisClient.pool.returnResource(jedis)
11})

  另一种方式,我们将代码修改为,把对Redis连接的管理放在操作DStream的Output操作范围之内,因为我们知道它是在特定的Executor中进行初始化的,使用一个单例的对象来管理,如下所示:

001package org.shirdrn.spark.streaming
002  
003import org.apache.commons.pool2.impl.GenericObjectPoolConfig
004import org.apache.spark.SparkConf
005import org.apache.spark.streaming.Seconds
006import org.apache.spark.streaming.StreamingContext
007import org.apache.spark.streaming.dstream.DStream.toPairDStreamFunctions
008import org.apache.spark.streaming.kafka.KafkaUtils
009  
010import kafka.serializer.StringDecoder
011import net.sf.json.JSONObject
012import redis.clients.jedis.JedisPool
013  
014object UserClickCountAnalytics {
015  
016  defmain(args: Array[String]): Unit = {
017    varmasterUrl = "local[1]"
018    if(args.length > 0) {
019      masterUrl= args(0)
020    }
021  
022    // Create a StreamingContext with the given master URL
023    valconf = new SparkConf().setMaster(masterUrl).setAppName("UserClickCountStat")
024    valssc = new StreamingContext(conf, Seconds(5))
025  
026    // Kafka configurations
027    valtopics = Set("user_events")
028    valbrokers = "10.10.4.126:9092,10.10.4.127:9092"
029    valkafkaParams =Map[String, String](
030      "metadata.broker.list"-> brokers, "serializer.class"-> "kafka.serializer.StringEncoder")
031  
032    valdbIndex = 1
033    valclickHashKey ="app::users::click"
034  
035    // Create a direct stream
036    valkafkaStream =KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
037  
038    valevents = kafkaStream.flatMap(line => {
039      valdata = JSONObject.fromObject(line._2)
040      Some(data)
041    })
042  
043    // Compute user click times
044    valuserClicks = events.map(x => (x.getString("uid"), x.getInt("click_count"))).reduceByKey(_+ _)
045    userClicks.foreachRDD(rdd=> {
046      rdd.foreachPartition(partitionOfRecords=> {
047        partitionOfRecords.foreach(pair=> {
048           
049          /**
050           * Internal Redis client for managing Redis connection {@link Jedis} based on {@link RedisPool}
051           */
052          objectInternalRedisClient extendsSerializable {
053             
054            @transientprivate varpool: JedisPool = null
055             
056            defmakePool(redisHost:String, redisPort:Int, redisTimeout:Int,
057                maxTotal:Int, maxIdle:Int, minIdle:Int): Unit = {
058              makePool(redisHost, redisPort, redisTimeout, maxTotal, maxIdle, minIdle,true, false, 10000)   
059            }
060             
061            defmakePool(redisHost:String, redisPort:Int, redisTimeout:Int,
062                maxTotal:Int, maxIdle:Int, minIdle:Int, testOnBorrow:Boolean,
063                testOnReturn:Boolean, maxWaitMillis:Long): Unit = {
064              if(pool== null) {
065                   valpoolConfig = new GenericObjectPoolConfig()
066                   poolConfig.setMaxTotal(maxTotal)
067                   poolConfig.setMaxIdle(maxIdle)
068                   poolConfig.setMinIdle(minIdle)
069                   poolConfig.setTestOnBorrow(testOnBorrow)
070                   poolConfig.setTestOnReturn(testOnReturn)
071                   poolConfig.setMaxWaitMillis(maxWaitMillis)
072                   pool= newJedisPool(poolConfig, redisHost, redisPort, redisTimeout)
073                    
074                   valhook = new Thread{
075                        overridedef run =pool.destroy()
076                   }
077                   sys.addShutdownHook(hook.run)
078              }
079            }
080             
081            defgetPool: JedisPool = {
082              assert(pool !=null)
083              pool
084            }
085          }
086           
087          // Redis configurations
088          valmaxTotal = 10
089          valmaxIdle = 10
090          valminIdle = 1
091          valredisHost = "10.10.4.130"
092          valredisPort = 6379
093          valredisTimeout =30000
094          valdbIndex = 1
095          InternalRedisClient.makePool(redisHost, redisPort, redisTimeout, maxTotal, maxIdle, minIdle)
096           
097          valuid = pair._1
098          valclickCount = pair._2
099          valjedis =InternalRedisClient.getPool.getResource
100          jedis.select(dbIndex)
101          jedis.hincrBy(clickHashKey, uid, clickCount)
102          InternalRedisClient.getPool.returnResource(jedis)
103        })
104      })
105    })
106  
107    ssc.start()
108    ssc.awaitTermination()
109  
110  }
111}

  上面代码实现,得益于Scala语言的特性,可以在代码中任何位置进行class或object的定义,我们将用来管理Redis连接的代码放在了特定操作的内部,就避免了瞬态(Transient)对象跨节点序列化的问题。这样做还要求我们能够了解Spark内部是如何操作RDD数据集的,更多可以参考RDD或Spark相关文档。

  在集群上,以Standalone模式运行,执行如下命令:

1cd /usr/local/spark
2./bin/spark-submit --class org.shirdrn.spark.streaming.UserClickCountAnalytics 
3            --master spark://hadoop1:7077 
4            --executor-memory 1G 
5            --total-executor-cores 2 
6            ~/spark-0.0.SNAPSHOT.jarspark://hadoop1:7077

  可以查看集群中各个Worker节点执行计算任务的状态,也可以非常方便地通过Web页面查看。

  下面,看一下我们存储到Redis中的计算结果,如下所示:

01127.0.0.1:6379[1]> HGETALL app::users::click
021)"4A4D769EB9679C054DE81B973ED5D768"
032)"7037"
043)"8dfeb5aaafc027d89349ac9a20b3930f"
054)"6992"
065)"011BBF43B89BFBF266C865DF0397AA71"
076)"7021"
087)"97edfc08311c70143401745a03a50706"
098)"6874"
109)"d7f141563005d1b5d0d3dd30138f3f62"
1110)"7057"
1211)"a95f22eabc4fd4b580c011a3161a9d9d"
1312)"7092"
1413)"6b67c8c700427dee7552f81f3228c927"
1514)"7266"
1615)"f2a8474bf7bd94f0aabbd4cdd2c06dcf"
1716)"7188"
1817)"c8ee90aade1671a21336c721512b817a"
1918)"6950"
2019)"068b746ed4620d25e26055a9f804385f"

pom文件及相关依赖

  这里,附上前面开发的应用所对应的依赖,以及打包Spark Streaming应用程序的Maven配置,以供参考。如果使用maven-shade-plugin插件,配置有问题的话,打包后在Spark集群上提交Application时候可能会报错Invalid signature file digest for Manifest main attributes。参考的Maven配置,如下所示:

003     <modelVersion>4.0.0</modelVersion>
004     <groupId>org.shirdrn.spark</groupId>
005     <artifactId>spark</artifactId>
006     <version>0.0.1-SNAPSHOT</version>
007  
008     <dependencies>
009          <dependency>
010               <groupId>org.apache.spark</groupId>
011               <artifactId>spark-core_2.10</artifactId>
012               <version>1.3.0</version>
013          </dependency>
014          <dependency>
015               <groupId>org.apache.spark</groupId>
016               <artifactId>spark-streaming_2.10</artifactId>
017               <version>1.3.0</version>
018          </dependency>
019          <dependency>
020               <groupId>net.sf.json-lib</groupId>
021               <artifactId>json-lib</artifactId>
022               <version>2.3</version>
023          </dependency>
024          <dependency>
025               <groupId>org.apache.spark</groupId>
026               <artifactId>spark-streaming-kafka_2.10</artifactId>
027               <version>1.3.0</version>
028          </dependency>
029          <dependency>
030               <groupId>redis.clients</groupId>
031               <artifactId>jedis</artifactId>
032               <version>2.5.2</version>
033          </dependency>
034          <dependency>
035               <groupId>org.apache.commons</groupId>
036               <artifactId>commons-pool2</artifactId>
037               <version>2.2</version>
038          </dependency>
039     </dependencies>
040  
041     <build>
042          <sourceDirectory>${basedir}/src/main/scala</sourceDirectory>
043          <testSourceDirectory>${basedir}/src/test/scala</testSourceDirectory>
044          <resources>
045               <resource>
046                    <directory>${basedir}/src/main/resources</directory>
047               </resource>
048          </resources>
049          <testResources>
050               <testResource>
051                    <directory>${basedir}/src/test/resources</directory>
052               </testResource>
053          </testResources>
054          <plugins>
055               <plugin>
056                    <artifactId>maven-compiler-plugin</artifactId>
057                    <version>3.1</version>
058                    <configuration>
059                         <source>1.6</source>
060                         <target>1.6</target>
061                    </configuration>
062               </plugin>
063               <plugin>
064                    <groupId>org.apache.maven.plugins</groupId>
065                    <artifactId>maven-shade-plugin</artifactId>
066                    <version>2.2</version>
067                    <configuration>
068                         <createDependencyReducedPom>true</createDependencyReducedPom>
069                    </configuration>
070                    <executions>
071                         <execution>
072                              <phase>package</phase>
073                              <goals>
074                                   <goal>shade</goal>
075                              </goals>
076                              <configuration>
077                                   <artifactSet>
078                                        <includes>
079                                             <include>*:*</include>
080                                        </includes>
081                                   </artifactSet>
082                                   <filters>
083                                        <filter>
084                                             <artifact>*:*</artifact>
085                                             <excludes>
086                                                  <exclude>META-INF/*.SF</exclude>
087                                                  <exclude>META-INF/*.DSA</exclude>
088                                                  <exclude>META-INF/*.RSA</exclude>
089                                             </excludes>
090                                        </filter>
091                                   </filters>
092                                   <transformers>
093                                        <transformer
094                                             implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
095                                        <transformer
096                                             implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
097                                             <resource>reference.conf</resource>
098                                        </transformer>
099                                        <transformer
100                                             implementation="org.apache.maven.plugins.shade.resource.DontIncludeResourceTransformer">
101                                             <resource>log4j.properties</resource>
102                                        </transformer>
103                                   </transformers>
104                              </configuration>
105                         </execution>
106                    </executions>
107               </plugin>
108          </plugins>
109     </build>
110</project>
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值