Kafka+Spark Streaming+Redis实时系统实践

http://www.iteblog.com/archives/1378


 基于Spark通用计算平台,可以很好地扩展各种计算类型的应用,尤其是Spark提供了内建的计算库支持,像Spark Streaming、Spark SQL、MLlib、GraphX,这些内建库都提供了高级抽象,可以用非常简洁的代码实现复杂的计算逻辑、这也得益于Scala编程语言的简洁性。这里,我们基于1.3.0版本的Spark搭建了计算平台,实现基于Spark Streaming的实时计算。

  我们的应用场景是分析用户使用手机App的行为,描述如下所示:

  1、手机客户端会收集用户的行为事件(我们以点击事件为例),将数据发送到数据服务器,我们假设这里直接进入到Kafka消息队列
  2、后端的实时服务会从Kafka消费数据,将数据读出来并进行实时分析,这里选择Spark Streaming,因为Spark Streaming提供了与Kafka整合的内置支持
  3、经过Spark Streaming实时计算程序分析,将结果写入Redis,可以实时获取用户的行为数据,并可以导出进行离线综合统计分析

Kafka+Spark Streaming+Redis编程实践

  下面,我们根据上面提到的应用场景,来编程实现这个实时计算应用。首先,写了一个Kafka Producer模拟程序,用来模拟向Kafka实时写入用户行为的事件数据,数据是JSON格式,示例如下:

{
     "uid" : "068b746ed4620d25e26055a9f804385f" ,
     "event_time" : "1430204612405" ,
     "os_type" : "Android" ,
     "click_count" : 6
}

一个事件包含4个字段:
  1、uid:用户编号
  2、event_time:事件发生时间戳
  3、os_type:手机App操作系统类型
  4、click_count:点击次数
下面是我们实现的代码,如下所示:

package com.iteblog.spark.streaming.utils
 
import java.util.Properties
import scala.util.Properties
import org.codehaus.jettison.json.JSONObject
import kafka.javaapi.producer.Producer
import kafka.producer.KeyedMessage
import kafka.producer.KeyedMessage
import kafka.producer.ProducerConfig
import scala.util.Random
 
object KafkaEventProducer {
  
   private val users = Array(
       "4A4D769EB9679C054DE81B973ED5D768" , "8dfeb5aaafc027d89349ac9a20b3930f" ,
       "011BBF43B89BFBF266C865DF0397AA71" , "f2a8474bf7bd94f0aabbd4cdd2c06dcf" ,
       "068b746ed4620d25e26055a9f804385f" , "97edfc08311c70143401745a03a50706" ,
       "d7f141563005d1b5d0d3dd30138f3f62" , "c8ee90aade1671a21336c721512b817a" ,
       "6b67c8c700427dee7552f81f3228c927" , "a95f22eabc4fd4b580c011a3161a9d9d" )
      
   private val random = new Random()
      
   private var pointer = - 1
  
   def getUserID() : String = {
        pointer = pointer + 1
     if (pointer > = users.length) {
       pointer = 0
       users(pointer)
     } else {
       users(pointer)
     }
   }
  
   def click() : Double = {
     random.nextInt( 10 )
   }
  
   // bin/kafka-topics.sh --zookeeper zk1:2181,zk2:2181,zk3:2181/kafka --create --topic user_events --replication-factor 2 --partitions 2
   // bin/kafka-topics.sh --zookeeper zk1:2181,zk2:2181,zk3:2181/kafka --list
   // bin/kafka-topics.sh --zookeeper zk1:2181,zk2:2181,zk3:2181/kafka --describe user_events
   // bin/kafka-console-consumer.sh --zookeeper zk1:2181,zk2:2181,zk3:22181/kafka --topic test_json_basis_event --from-beginning
   def main(args : Array[String]) : Unit = {
     val topic = "user_events"
     val brokers = "10.10.4.126:9092,10.10.4.127:9092"
     val props = new Properties()
     props.put( "metadata.broker.list" , brokers)
     props.put( "serializer.class" , "kafka.serializer.StringEncoder" )
    
     val kafkaConfig = new ProducerConfig(props)
     val producer = new Producer[String, String](kafkaConfig)
    
     while ( true ) {
       // prepare event data
       val event = new JSONObject()
       event
         .put( "uid" , getUserID)
         .put( "event_time" , System.currentTimeMillis.toString)
         .put( "os_type" , "Android" )
         .put( "click_count" , click)
      
       // produce event message
       producer.send( new KeyedMessage[String, String](topic, event.toString))
       println( "Message sent: " + event)
      
       Thread.sleep( 200 )
     }
  
}

  通过控制上面程序最后一行的时间间隔来控制模拟写入速度。下面我们来讨论实现实时统计每个用户的点击次数,它是按照用户分组进行累加次数,逻辑比较简单,关键是在实现过程中要注意一些问题,如对象序列化等。先看实现代码,稍后我们再详细讨论,代码实现如下所示:

object UserClickCountAnalytics {
 
   def main(args : Array[String]) : Unit = {
     var masterUrl = "local[1]"
     if (args.length > 0 ) {
       masterUrl = args( 0 )
     }
 
     // Create a StreamingContext with the given master URL
     val conf = new SparkConf().setMaster(masterUrl).setAppName( "UserClickCountStat" )
     val ssc = new StreamingContext(conf, Seconds( 5 ))
 
     // Kafka configurations
     val topics = Set( "user_events" )
     val brokers = "10.10.4.126:9092,10.10.4.127:9092"
     val kafkaParams = Map[String, String](
       "metadata.broker.list" -> brokers, "serializer.class" -> "kafka.serializer.StringEncoder" )
 
     val dbIndex = 1
     val clickHashKey = "app::users::click"
 
     // Create a direct stream
     val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
 
     val events = kafkaStream.flatMap(line = > {
       val data = JSONObject.fromObject(line. _ 2 )
       Some(data)
     })
 
     // Compute user click times
     val userClicks = events.map(x = > (x.getString( "uid" ), x.getInt( "click_count" ))).reduceByKey( _ + _ )
     userClicks.foreachRDD(rdd = > {
       rdd.foreachPartition(partitionOfRecords = > {
         partitionOfRecords.foreach(pair = > {
           val uid = pair. _ 1
           val clickCount = pair. _ 2
           val jedis = RedisClient.pool.getResource
           jedis.select(dbIndex)
           jedis.hincrBy(clickHashKey, uid, clickCount)
           RedisClient.pool.returnResource(jedis)
         })
       })
     })
 
     ssc.start()
     ssc.awaitTermination()
 
   }
}

  上面代码使用了Jedis客户端来操作Redis,将分组计数结果数据累加写入Redis存储,如果其他系统需要实时获取该数据,直接从Redis实时读取即可。RedisClient实现代码如下所示:

object RedisClient extends Serializable {
   val redisHost = "10.10.4.130"
   val redisPort = 6379
   val redisTimeout = 30000
   lazy val pool = new JedisPool( new GenericObjectPoolConfig(), redisHost, redisPort, redisTimeout)
 
   lazy val hook = new Thread {
     override def run = {
       println( "Execute hook thread: " + this )
       pool.destroy()
     }
   }
   sys.addShutdownHook(hook.run)
}

  上面代码我们分别在local[K]和Spark Standalone集群模式下运行通过。

  如果我们是在开发环境进行调试的时候,也就是使用local[K]部署模式,在本地启动K个Worker线程来计算,这K个Worker在同一个JVM实例里,上面的代码默认情况是,如果没有传参数则是local[K]模式,所以如果使用这种方式在创建Redis连接池或连接的时候,可能非常容易调试通过,但是在使用Spark Standalone、YARN Client(YARN Cluster)或Mesos集群部署模式的时候,就会报错,主要是由于在处理Redis连接池或连接的时候出错了。我们可以看一下Spark架构,如图所示(来自官网):

  无论是在本地模式、Standalone模式,还是在Mesos或YARN模式下,整个Spark集群的结构都可以用上图抽象表示,只是各个组件的运行环境不同,导致组件可能是分布式的,或本地的,或单个JVM实例的。如在本地模式,则上图表现为在同一节点上的单个进程之内的多个组件;而在YARN Client模式下,Driver程序是在YARN集群之外的一个节点上提交Spark Application,其他的组件都运行在YARN集群管理的节点上。

  在Spark集群环境部署Application后,在进行计算的时候会将作用于RDD数据集上的函数(Functions)发送到集群中Worker上的Executor上(在Spark Streaming中是作用于DStream的操作),那么这些函数操作所作用的对象(Elements)必须是可序列化的,通过Scala也可以使用lazy引用来解决,否则这些对象(Elements)在跨节点序列化传输后,无法正确地执行反序列化重构成实际可用的对象。上面代码我们使用lazy引用(Lazy Reference)来实现的,代码如下所示:

// lazy pool reference
lazy val pool = new JedisPool( new GenericObjectPoolConfig(), redisHost, redisPort, redisTimeout)
...
partitionOfRecords.foreach(pair = > {
   val uid = pair. _ 1
   val clickCount = pair. _ 2
   val jedis = RedisClient.pool.getResource
   jedis.select(dbIndex)
   jedis.hincrBy(clickHashKey, uid, clickCount)
   RedisClient.pool.returnResource(jedis)
})

  另一种方式,我们将代码修改为,把对Redis连接的管理放在操作DStream的Output操作范围之内,因为我们知道它是在特定的Executor中进行初始化的,使用一个单例的对象来管理,如下所示:

package org.shirdrn.spark.streaming
 
import org.apache.commons.pool 2 .impl.GenericObjectPoolConfig
import org.apache.spark.SparkConf
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.dstream.DStream.toPairDStreamFunctions
import org.apache.spark.streaming.kafka.KafkaUtils
 
import kafka.serializer.StringDecoder
import net.sf.json.JSONObject
import redis.clients.jedis.JedisPool
 
object UserClickCountAnalytics {
 
   def main(args : Array[String]) : Unit = {
     var masterUrl = "local[1]"
     if (args.length > 0 ) {
       masterUrl = args( 0 )
     }
 
     // Create a StreamingContext with the given master URL
     val conf = new SparkConf().setMaster(masterUrl).setAppName( "UserClickCountStat" )
     val ssc = new StreamingContext(conf, Seconds( 5 ))
 
     // Kafka configurations
     val topics = Set( "user_events" )
     val brokers = "10.10.4.126:9092,10.10.4.127:9092"
     val kafkaParams = Map[String, String](
       "metadata.broker.list" -> brokers, "serializer.class" -> "kafka.serializer.StringEncoder" )
 
     val dbIndex = 1
     val clickHashKey = "app::users::click"
 
     // Create a direct stream
     val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
 
     val events = kafkaStream.flatMap(line = > {
       val data = JSONObject.fromObject(line. _ 2 )
       Some(data)
     })
 
     // Compute user click times
     val userClicks = events.map(x = > (x.getString( "uid" ), x.getInt( "click_count" ))).reduceByKey( _ + _ )
     userClicks.foreachRDD(rdd = > {
       rdd.foreachPartition(partitionOfRecords = > {
         partitionOfRecords.foreach(pair = > {
          
           /**
            * Internal Redis client for managing Redis connection {@link Jedis} based on {@link RedisPool}
            */
           object InternalRedisClient extends Serializable {
            
             @ transient private var pool : JedisPool = null
            
             def makePool(redisHost : String, redisPort : Int, redisTimeout : Int,
                 maxTotal : Int, maxIdle : Int, minIdle : Int) : Unit = {
               makePool(redisHost, redisPort, redisTimeout, maxTotal, maxIdle, minIdle, true , false , 10000 )  
             }
            
             def makePool(redisHost : String, redisPort : Int, redisTimeout : Int,
                 maxTotal : Int, maxIdle : Int, minIdle : Int, testOnBorrow : Boolean,
                 testOnReturn : Boolean, maxWaitMillis : Long) : Unit = {
               if (pool == null ) {
                    val poolConfig = new GenericObjectPoolConfig()
                    poolConfig.setMaxTotal(maxTotal)
                    poolConfig.setMaxIdle(maxIdle)
                    poolConfig.setMinIdle(minIdle)
                    poolConfig.setTestOnBorrow(testOnBorrow)
                    poolConfig.setTestOnReturn(testOnReturn)
                    poolConfig.setMaxWaitMillis(maxWaitMillis)
                    pool = new JedisPool(poolConfig, redisHost, redisPort, redisTimeout)
                   
                    val hook = new Thread{
                         override def run = pool.destroy()
                    }
                    sys.addShutdownHook(hook.run)
               }
             }
            
             def getPool : JedisPool = {
               assert(pool ! = null )
               pool
             }
           }
          
           // Redis configurations
           val maxTotal = 10
           val maxIdle = 10
           val minIdle = 1
           val redisHost = "10.10.4.130"
           val redisPort = 6379
           val redisTimeout = 30000
           val dbIndex = 1
           InternalRedisClient.makePool(redisHost, redisPort, redisTimeout, maxTotal, maxIdle, minIdle)
          
           val uid = pair. _ 1
           val clickCount = pair. _ 2
           val jedis = InternalRedisClient.getPool.getResource
           jedis.select(dbIndex)
           jedis.hincrBy(clickHashKey, uid, clickCount)
           InternalRedisClient.getPool.returnResource(jedis)
         })
       })
     })
 
     ssc.start()
     ssc.awaitTermination()
 
   }
}

  上面代码实现,得益于Scala语言的特性,可以在代码中任何位置进行class或object的定义,我们将用来管理Redis连接的代码放在了特定操作的内部,就避免了瞬态(Transient)对象跨节点序列化的问题。这样做还要求我们能够了解Spark内部是如何操作RDD数据集的,更多可以参考RDD或Spark相关文档。

  在集群上,以Standalone模式运行,执行如下命令:

cd /usr/local/spark
. /bin/spark-submit --class org.shirdrn.spark.streaming.UserClickCountAnalytics
            --master spark: //hadoop1 :7077
            --executor-memory 1G
            --total-executor-cores 2
            ~ /spark-0 .0.SNAPSHOT.jar spark: //hadoop1 :7077

  可以查看集群中各个Worker节点执行计算任务的状态,也可以非常方便地通过Web页面查看。

  下面,看一下我们存储到Redis中的计算结果,如下所示:

127.0 . 0.1 : 6379 [ 1 ]> HGETALL app :: users :: click
1 ) "4A4D769EB9679C054DE81B973ED5D768"
2 ) "7037"
3 ) "8dfeb5aaafc027d89349ac9a20b3930f"
4 ) "6992"
5 ) "011BBF43B89BFBF266C865DF0397AA71"
6 ) "7021"
7 ) "97edfc08311c70143401745a03a50706"
8 ) "6874"
9 ) "d7f141563005d1b5d0d3dd30138f3f62"
10 ) "7057"
11 ) "a95f22eabc4fd4b580c011a3161a9d9d"
12 ) "7092"
13 ) "6b67c8c700427dee7552f81f3228c927"
14 ) "7266"
15 ) "f2a8474bf7bd94f0aabbd4cdd2c06dcf"
16 ) "7188"
17 ) "c8ee90aade1671a21336c721512b817a"
18 ) "6950"
19 ) "068b746ed4620d25e26055a9f804385f"

pom文件及相关依赖

  这里,附上前面开发的应用所对应的依赖,以及打包Spark Streaming应用程序的Maven配置,以供参考。如果使用maven-shade-plugin插件,配置有问题的话,打包后在Spark集群上提交Application时候可能会报错Invalid signature file digest for Manifest main attributes。参考的Maven配置,如下所示:

      <modelVersion> 4.0 . 0 </modelVersion>
      <groupId>org.shirdrn.spark</groupId>
      <artifactId>spark</artifactId>
      <version> 0.0 . 1 -SNAPSHOT</version>
 
      <dependencies>
           <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-core _ 2.10 </artifactId>
                <version> 1.3 . 0 </version>
           </dependency>
           <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-streaming _ 2.10 </artifactId>
                <version> 1.3 . 0 </version>
           </dependency>
           <dependency>
                <groupId>net.sf.json-lib</groupId>
                <artifactId>json-lib</artifactId>
                <version> 2.3 </version>
           </dependency>
           <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-streaming-kafka _ 2.10 </artifactId>
                <version> 1.3 . 0 </version>
           </dependency>
           <dependency>
                <groupId>redis.clients</groupId>
                <artifactId>jedis</artifactId>
                <version> 2.5 . 2 </version>
           </dependency>
           <dependency>
                <groupId>org.apache.commons</groupId>
                <artifactId>commons-pool 2 </artifactId>
                <version> 2.2 </version>
           </dependency>
      </dependencies>
 
      <build>
           <sourceDirectory>${basedir}/src/main/scala</sourceDirectory>
           <testSourceDirectory>${basedir}/src/test/scala</testSourceDirectory>
           <resources>
                <resource>
                     <directory>${basedir}/src/main/resources</directory>
                </resource>
           </resources>
           <testResources>
                <testResource>
                     <directory>${basedir}/src/test/resources</directory>
                </testResource>
           </testResources>
           <plugins>
                <plugin>
                     <artifactId>maven-compiler-plugin</artifactId>
                     <version> 3.1 </version>
                     <configuration>
                          <source> 1.6 </source>
                          <target> 1.6 </target>
                     </configuration>
                </plugin>
                <plugin>
                     <groupId>org.apache.maven.plugins</groupId>
                     <artifactId>maven-shade-plugin</artifactId>
                     <version> 2.2 </version>
                     <configuration>
                          <createDependencyReducedPom> true </createDependencyReducedPom>
                     </configuration>
                     <executions>
                          <execution>
                               <phase> package </phase>
                               <goals>
                                    <goal>shade</goal>
                               </goals>
                               <configuration>
                                    <artifactSet>
                                         <includes>
                                              <include>* : *</include>
                                         </includes>
                                    </artifactSet>
                                    <filters>
                                         <filter>
                                              <artifact>* : *</artifact>
                                              <excludes>
                                                   <exclude>META-INF/*.SF</exclude>
                                                   <exclude>META-INF/*.DSA</exclude>
                                                   <exclude>META-INF/*.RSA</exclude>
                                              </excludes>
                                         </filter>
                                    </filters>
                                    <transformers>
                                         <transformer
                                              implementation = "org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
                                         <transformer
                                              implementation = "org.apache.maven.plugins.shade.resource.AppendingTransformer" >
                                              <resource>reference.conf</resource>
                                         </transformer>
                                         <transformer
                                              implementation = "org.apache.maven.plugins.shade.resource.DontIncludeResourceTransformer" >
                                              <resource>log 4 j.properties</resource>
                                         </transformer>
                                    </transformers>
                               </configuration>
                          </execution>
                     </executions>
                </plugin>
           </plugins>
      </build>
</project>

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值