Kafka+Spark Streaming+Redis实时系统实践

最新推荐文章于 2023-05-16 16:39:51 发布

javastart

最新推荐文章于 2023-05-16 16:39:51 发布

阅读量1.6k

点赞数

分类专栏： spark

spark 专栏收录该内容

80 篇文章 2 订阅

订阅专栏

基于 Spark通用计算平台，可以很好地扩展各种计算类型的应用，尤其是 Spark提供了内建的计算库支持，像 Spark Streaming、Spark SQL、MLlib、GraphX，这些内建库都提供了高级抽象，可以用非常简洁的代码实现复杂的计算逻辑、这也得益于Scala编程语言的简洁性。这里，我们基于1.3.0版本的Spark搭建了计算平台，实现基于Spark Streaming的实时计算。

　　我们的应用场景是分析用户使用手机App的行为，描述如下所示：

　　1、手机客户端会收集用户的行为事件（我们以点击事件为例），将数据发送到数据服务器，我们假设这里直接进入到Kafka消息队列
　　2、后端的实时服务会从Kafka消费数据，将数据读出来并进行实时分析，这里选择Spark Streaming，因为Spark Streaming提供了与Kafka整合的内置支持
　　3、经过Spark Streaming实时计算程序分析，将结果写入Redis，可以实时获取用户的行为数据，并可以导出进行离线综合统计分析

Kafka+Spark Streaming+Redis编程实践

　　下面，我们根据上面提到的应用场景，来编程实现这个实时计算应用。首先，写了一个Kafka Producer模拟程序，用来模拟向Kafka实时写入用户行为的事件数据，数据是JSON格式，示例如下：

 
   1{ 
 
   2    "uid":"068b746ed4620d25e26055a9f804385f", 
 
   3    "event_time":"1430204612405", 
 
   4    "os_type":"Android",  
 
   5    "click_count": 6
 
   6}

一个事件包含4个字段：
　　1、uid：用户编号
　　2、event_time：事件发生时间戳
　　3、os_type：手机App操作系统类型
　　4、click_count：点击次数
下面是我们实现的代码，如下所示：

 
   01package  com.iteblog.spark.streaming.utils 
 
   02  
 
   03import  java.util.Properties 
 
   04import  scala.util.Properties 
 
   05import  org.codehaus.jettison.json.JSONObject 
 
   06import  kafka.javaapi.producer.Producer 
 
   07import  kafka.producer.KeyedMessage 
 
   08import  kafka.producer.KeyedMessage 
 
   09import  kafka.producer.ProducerConfig 
 
   10import  scala.util.Random 
 
   11  
 
   12object  KafkaEventProducer { 
 
   13   
 
   14  privateval users =Array( 
 
   15      "4A4D769EB9679C054DE81B973ED5D768","8dfeb5aaafc027d89349ac9a20b3930f",
 
   16      "011BBF43B89BFBF266C865DF0397AA71","f2a8474bf7bd94f0aabbd4cdd2c06dcf",
 
   17      "068b746ed4620d25e26055a9f804385f","97edfc08311c70143401745a03a50706",
 
   18      "d7f141563005d1b5d0d3dd30138f3f62","c8ee90aade1671a21336c721512b817a",
 
   19      "6b67c8c700427dee7552f81f3228c927","a95f22eabc4fd4b580c011a3161a9d9d")
 
   20       
 
   21  privateval random =new Random() 
 
   22       
 
   23  privatevar pointer =-1
 
   24   
 
   25  defgetUserID() :String  = { 
 
   26       pointer= pointer + 1
 
   27    if(pointer >=users.length) { 
 
   28      pointer= 0
 
   29      users(pointer)
 
   30    }  else { 
 
   31      users(pointer)
 
   32    }  
 
   33  }  
 
   34   
 
   35  defclick() : Double  = { 
 
   36    random.nextInt(10)
 
   37  }  
 
   38   
 
   39  // bin/kafka-topics.sh --zookeeper zk1:2181,zk2:2181,zk3:2181/kafka --create --topic user_events --replication-factor 2 --partitions 2
 
   40  // bin/kafka-topics.sh --zookeeper zk1:2181,zk2:2181,zk3:2181/kafka --list
 
   41  // bin/kafka-topics.sh --zookeeper zk1:2181,zk2:2181,zk3:2181/kafka --describe user_events
 
   42  // bin/kafka-console-consumer.sh --zookeeper zk1:2181,zk2:2181,zk3:22181/kafka --topic test_json_basis_event --from-beginning
 
   43  defmain(args: Array[String]): Unit =  { 
 
   44    valtopic = "user_events"
 
   45    valbrokers = "10.10.4.126:9092,10.10.4.127:9092"
 
   46    valprops = new Properties() 
 
   47    props.put("metadata.broker.list", brokers)
 
   48    props.put("serializer.class","kafka.serializer.StringEncoder")
 
   49     
 
   50    valkafkaConfig =new ProducerConfig(props)
 
   51    valproducer = new Producer[String, String](kafkaConfig) 
 
   52     
 
   53    while(true) {
 
   54      // prepare event data
 
   55      valevent = new JSONObject() 
 
   56      event
 
   57        .put("uid", getUserID)
 
   58        .put("event_time", System.currentTimeMillis.toString)
 
   59        .put("os_type","Android") 
 
   60        .put("click_count", click)
 
   61       
 
   62      // produce event message
 
   63      producer.send(newKeyedMessage[String, String](topic, event.toString)) 
 
   64      println("Message sent: "+ event) 
 
   65       
 
   66      Thread.sleep(200)
 
   67    }  
 
   68  }    
 
   69}

　　通过控制上面程序最后一行的时间间隔来控制模拟写入速度。下面我们来讨论实现实时统计每个用户的点击次数，它是按照用户分组进行累加次数，逻辑比较简单，关键是在实现过程中要注意一些问题，如对象序列化等。先看实现代码，稍后我们再详细讨论，代码实现如下所示：

 
   01object  UserClickCountAnalytics { 
 
   02  
 
   03  defmain(args: Array[String]): Unit =  { 
 
   04    varmasterUrl = "local[1]"
 
   05    if(args.length > 0) {
 
   06      masterUrl= args(0)
 
   07    }  
 
   08  
 
   09    // Create a StreamingContext with the given master URL
 
   10    valconf = new SparkConf().setMaster(masterUrl).setAppName("UserClickCountStat")
 
   11    valssc = new StreamingContext(conf, Seconds(5))
 
   12  
 
   13    // Kafka configurations
 
   14    valtopics = Set("user_events")
 
   15    valbrokers = "10.10.4.126:9092,10.10.4.127:9092"
 
   16    valkafkaParams =Map[String, String]( 
 
   17      "metadata.broker.list"-> brokers, "serializer.class"-> "kafka.serializer.StringEncoder")
 
   18  
 
   19    valdbIndex = 1
 
   20    valclickHashKey ="app::users::click"
 
   21  
 
   22    // Create a direct stream
 
   23    valkafkaStream =KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
 
   24  
 
   25    valevents = kafkaStream.flatMap(line => {
 
   26      valdata = JSONObject.fromObject(line._2)
 
   27      Some(data)
 
   28    })  
 
   29  
 
   30    // Compute user click times
 
   31    valuserClicks = events.map(x => (x.getString("uid"), x.getInt("click_count"))).reduceByKey(_+ _)
 
   32    userClicks.foreachRDD(rdd=> { 
 
   33      rdd.foreachPartition(partitionOfRecords=> { 
 
   34        partitionOfRecords.foreach(pair=> { 
 
   35          valuid = pair._1
 
   36          valclickCount = pair._2
 
   37          valjedis = <SPAN class=wp_keywordlink_affiliate><A title=""href="http://www.iteblog.com/archives/tag/redis"target=_blank data-original-title="View all posts in Redis"jQuery1830668587673401759="50">Redis</A></SPAN>Client.pool.getResource
 
   38          jedis.select(dbIndex)
 
   39          jedis.hincrBy(clickHashKey, uid, clickCount)
 
   40          RedisClient.pool.returnResource(jedis)
 
   41        })
 
   42      })
 
   43    })  
 
   44  
 
   45    ssc.start()
 
   46    ssc.awaitTermination()
 
   47  
 
   48  }  
 
   49}

　　上面代码使用了Jedis客户端来操作Redis，将分组计数结果数据累加写入Redis存储，如果其他系统需要实时获取该数据，直接从Redis实时读取即可。RedisClient实现代码如下所示：

 
   01object  RedisClient extends  Serializable { 
 
   02  valredisHost = "10.10.4.130"
 
   03  valredisPort = 6379
 
   04  valredisTimeout =30000
 
   05  lazyval pool =new JedisPool(newGenericObjectPoolConfig(), redisHost, redisPort, redisTimeout)
 
   06  
 
   07  lazyval hook =new Thread { 
 
   08    overridedef run ={ 
 
   09      println("Execute hook thread: "+ this)
 
   10      pool.destroy()
 
   11    }  
 
   12  }  
 
   13  sys.addShutdownHook(hook.run)
 
   14}

　　上面代码我们分别在local[K]和Spark Standalone集群模式下运行通过。

　　如果我们是在开发环境进行调试的时候，也就是使用local[K]部署模式，在本地启动K个Worker线程来计算，这K个Worker在同一个JVM实例里，上面的代码默认情况是，如果没有传参数则是local[K]模式，所以如果使用这种方式在创建Redis连接池或连接的时候，可能非常容易调试通过，但是在使用Spark Standalone、YARN Client（YARN Cluster）或Mesos集群部署模式的时候，就会报错，主要是由于在处理Redis连接池或连接的时候出错了。我们可以看一下Spark架构，如图所示（来自官网）：

　　无论是在本地模式、Standalone模式，还是在Mesos或YARN模式下，整个Spark集群的结构都可以用上图抽象表示，只是各个组件的运行环境不同，导致组件可能是分布式的，或本地的，或单个JVM实例的。如在本地模式，则上图表现为在同一节点上的单个进程之内的多个组件；而在YARN Client模式下，Driver程序是在YARN集群之外的一个节点上提交Spark Application，其他的组件都运行在YARN集群管理的节点上。

　　在Spark集群环境部署Application后，在进行计算的时候会将作用于RDD数据集上的函数（Functions）发送到集群中Worker上的Executor上（在Spark Streaming中是作用于DStream的操作），那么这些函数操作所作用的对象（Elements）必须是可序列化的，通过Scala也可以使用lazy引用来解决，否则这些对象（Elements）在跨节点序列化传输后，无法正确地执行反序列化重构成实际可用的对象。上面代码我们使用lazy引用（Lazy Reference）来实现的，代码如下所示：

 
   01// lazy pool reference 
 
   02lazy  val pool =new JedisPool(newGenericObjectPoolConfig(), redisHost, redisPort, redisTimeout)
 
   03... 
 
   04partitionOfRecords.foreach(pair  => { 
 
   05  valuid = pair._1
 
   06  valclickCount = pair._2
 
   07  valjedis = RedisClient.pool.getResource 
 
   08  jedis.select(dbIndex)
 
   09  jedis.hincrBy(clickHashKey, uid, clickCount)
 
   10  RedisClient.pool.returnResource(jedis)
 
   11})

　　另一种方式，我们将代码修改为，把对Redis连接的管理放在操作DStream的Output操作范围之内，因为我们知道它是在特定的Executor中进行初始化的，使用一个单例的对象来管理，如下所示：

 
   001package  org.shirdrn.spark.streaming 
 
   002  
 
   003import  org.apache.commons.pool2.impl.GenericObjectPoolConfig
 
   004import  org.apache.spark.SparkConf 
 
   005import  org.apache.spark.streaming.Seconds 
 
   006import  org.apache.spark.streaming.StreamingContext 
 
   007import  org.apache.spark.streaming.dstream.DStream.toPairDStreamFunctions 
 
   008import  org.apache.spark.streaming.kafka.KafkaUtils 
 
   009  
 
   010import  kafka.serializer.StringDecoder 
 
   011import  net.sf.json.JSONObject 
 
   012import  redis.clients.jedis.JedisPool 
 
   013  
 
   014object  UserClickCountAnalytics { 
 
   015  
 
   016  defmain(args: Array[String]): Unit =  { 
 
   017    varmasterUrl = "local[1]"
 
   018    if(args.length > 0) {
 
   019      masterUrl= args(0)
 
   020    }  
 
   021  
 
   022    // Create a StreamingContext with the given master URL
 
   023    valconf = new SparkConf().setMaster(masterUrl).setAppName("UserClickCountStat")
 
   024    valssc = new StreamingContext(conf, Seconds(5))
 
   025  
 
   026    // Kafka configurations
 
   027    valtopics = Set("user_events")
 
   028    valbrokers = "10.10.4.126:9092,10.10.4.127:9092"
 
   029    valkafkaParams =Map[String, String]( 
 
   030      "metadata.broker.list"-> brokers, "serializer.class"-> "kafka.serializer.StringEncoder")
 
   031  
 
   032    valdbIndex = 1
 
   033    valclickHashKey ="app::users::click"
 
   034  
 
   035    // Create a direct stream
 
   036    valkafkaStream =KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
 
   037  
 
   038    valevents = kafkaStream.flatMap(line => {
 
   039      valdata = JSONObject.fromObject(line._2)
 
   040      Some(data)
 
   041    })  
 
   042  
 
   043    // Compute user click times
 
   044    valuserClicks = events.map(x => (x.getString("uid"), x.getInt("click_count"))).reduceByKey(_+ _)
 
   045    userClicks.foreachRDD(rdd=> { 
 
   046      rdd.foreachPartition(partitionOfRecords=> { 
 
   047        partitionOfRecords.foreach(pair=> { 
 
   048           
 
   049          /**
 
   050           * Internal Redis client for managing Redis connection {@link Jedis} based on {@link RedisPool}
 
   051           */
 
   052          objectInternalRedisClient extendsSerializable { 
 
   053             
 
   054            @transientprivate varpool: JedisPool =  null
 
   055             
 
   056            defmakePool(redisHost:String, redisPort:Int, redisTimeout:Int, 
 
   057                maxTotal:Int, maxIdle:Int, minIdle:Int): Unit =  { 
 
   058              makePool(redisHost, redisPort, redisTimeout, maxTotal, maxIdle, minIdle,true, false,  10000)   
 
   059            }
 
   060             
 
   061            defmakePool(redisHost:String, redisPort:Int, redisTimeout:Int, 
 
   062                maxTotal:Int, maxIdle:Int, minIdle:Int, testOnBorrow:Boolean, 
 
   063                testOnReturn:Boolean, maxWaitMillis:Long): Unit =  { 
 
   064              if(pool== null) {
 
   065                   valpoolConfig = new GenericObjectPoolConfig() 
 
   066                   poolConfig.setMaxTotal(maxTotal)
 
   067                   poolConfig.setMaxIdle(maxIdle)
 
   068                   poolConfig.setMinIdle(minIdle)
 
   069                   poolConfig.setTestOnBorrow(testOnBorrow)
 
   070                   poolConfig.setTestOnReturn(testOnReturn)
 
   071                   poolConfig.setMaxWaitMillis(maxWaitMillis)
 
   072                   pool= newJedisPool(poolConfig, redisHost, redisPort, redisTimeout)
 
   073                    
 
   074                   valhook = new Thread{ 
 
   075                        overridedef run =pool.destroy() 
 
   076                   }
 
   077                   sys.addShutdownHook(hook.run)
 
   078              }
 
   079            }
 
   080             
 
   081            defgetPool: JedisPool =  {
 
   082              assert(pool !=null) 
 
   083              pool
 
   084            }
 
   085          }
 
   086           
 
   087          // Redis configurations
 
   088          valmaxTotal = 10
 
   089          valmaxIdle = 10
 
   090          valminIdle = 1
 
   091          valredisHost = "10.10.4.130"
 
   092          valredisPort = 6379
 
   093          valredisTimeout =30000
 
   094          valdbIndex = 1
 
   095          InternalRedisClient.makePool(redisHost, redisPort, redisTimeout, maxTotal, maxIdle, minIdle)
 
   096           
 
   097          valuid = pair._1
 
   098          valclickCount = pair._2
 
   099          valjedis =InternalRedisClient.getPool.getResource
 
   100          jedis.select(dbIndex)
 
   101          jedis.hincrBy(clickHashKey, uid, clickCount)
 
   102          InternalRedisClient.getPool.returnResource(jedis)
 
   103        })
 
   104      })
 
   105    })  
 
   106  
 
   107    ssc.start()
 
   108    ssc.awaitTermination()
 
   109  
 
   110  }  
 
   111}

　　上面代码实现，得益于Scala语言的特性，可以在代码中任何位置进行class或object的定义，我们将用来管理Redis连接的代码放在了特定操作的内部，就避免了瞬态（Transient）对象跨节点序列化的问题。这样做还要求我们能够了解Spark内部是如何操作RDD数据集的，更多可以参考RDD或Spark相关文档。

　　在集群上，以Standalone模式运行，执行如下命令：

 
   1cd  /usr/local/spark
 
   2./bin/spark-submit --class org.shirdrn.spark.streaming.UserClickCountAnalytics 
 
   3　　　　　　　　    --master  spark://hadoop1:7077  
 
   4　　　　　　　　    --executor-memory 1G   
 
   5　　　　　　　　    --total-executor-cores 2 
 
   6　　　　　　　　    ~/spark-0.0.SNAPSHOT.jarspark://hadoop1:7077

　　可以查看集群中各个Worker节点执行计算任务的状态，也可以非常方便地通过Web页面查看。

　　下面，看一下我们存储到Redis中的计算结果，如下所示：

 
   01127.0.0.1:6379[1]> HGETALL app::users::click
 
   021)"4A4D769EB9679C054DE81B973ED5D768"
 
   032)"7037"
 
   043)"8dfeb5aaafc027d89349ac9a20b3930f"
 
   054)"6992"
 
   065)"011BBF43B89BFBF266C865DF0397AA71"
 
   076)"7021"
 
   087)"97edfc08311c70143401745a03a50706"
 
   098)"6874"
 
   109)"d7f141563005d1b5d0d3dd30138f3f62"
 
   1110)"7057"
 
   1211)"a95f22eabc4fd4b580c011a3161a9d9d"
 
   1312)"7092"
 
   1413)"6b67c8c700427dee7552f81f3228c927"
 
   1514)"7266"
 
   1615)"f2a8474bf7bd94f0aabbd4cdd2c06dcf"
 
   1716)"7188"
 
   1817)"c8ee90aade1671a21336c721512b817a"
 
   1918)"6950"
 
   2019)"068b746ed4620d25e26055a9f804385f"

pom文件及相关依赖

　　这里，附上前面开发的应用所对应的依赖，以及打包Spark Streaming应用程序的Maven配置，以供参考。如果使用maven-shade-plugin插件，配置有问题的话，打包后在Spark集群上提交Application时候可能会报错Invalid signature file digest for Manifest main attributes。参考的Maven配置，如下所示：

 
   001<project xmlns="http://maven.apache.org/POM/4.0.0"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 
   002     xsi:schemaLocation="http://maven.apache.org/POM/4.0.0http://maven.apache.org/xsd/maven-4.0.0.xsd">
 
   003     <modelVersion>4.0.0</modelVersion>
 
   004     <groupId>org.shirdrn.spark</groupId>
 
   005     <artifactId>spark</artifactId>
 
   006     <version>0.0.1-SNAPSHOT</version>
 
   007  
 
   008     <dependencies>
 
   009          <dependency>
 
   010               <groupId>org.apache.spark</groupId>
 
   011               <artifactId>spark-core_2.10</artifactId>
 
   012               <version>1.3.0</version>
 
   013          </dependency>
 
   014          <dependency>
 
   015               <groupId>org.apache.spark</groupId>
 
   016               <artifactId>spark-streaming_2.10</artifactId>
 
   017               <version>1.3.0</version>
 
   018          </dependency>
 
   019          <dependency>
 
   020               <groupId>net.sf.json-lib</groupId>
 
   021               <artifactId>json-lib</artifactId>
 
   022               <version>2.3</version>
 
   023          </dependency>
 
   024          <dependency>
 
   025               <groupId>org.apache.spark</groupId>
 
   026               <artifactId>spark-streaming-kafka_2.10</artifactId>
 
   027               <version>1.3.0</version>
 
   028          </dependency>
 
   029          <dependency>
 
   030               <groupId>redis.clients</groupId>
 
   031               <artifactId>jedis</artifactId>
 
   032               <version>2.5.2</version>
 
   033          </dependency>
 
   034          <dependency>
 
   035               <groupId>org.apache.commons</groupId>
 
   036               <artifactId>commons-pool2</artifactId>
 
   037               <version>2.2</version>
 
   038          </dependency>
 
   039     </dependencies>
 
   040  
 
   041     <build>
 
   042          <sourceDirectory>${basedir}/src/main/scala</sourceDirectory>
 
   043          <testSourceDirectory>${basedir}/src/test/scala</testSourceDirectory>
 
   044          <resources>
 
   045               <resource>
 
   046                    <directory>${basedir}/src/main/resources</directory>
 
   047               </resource>
 
   048          </resources>
 
   049          <testResources>
 
   050               <testResource>
 
   051                    <directory>${basedir}/src/test/resources</directory>
 
   052               </testResource>
 
   053          </testResources>
 
   054          <plugins>
 
   055               <plugin>
 
   056                    <artifactId>maven-compiler-plugin</artifactId>
 
   057                    <version>3.1</version>
 
   058                    <configuration>
 
   059                         <source>1.6</source>
 
   060                         <target>1.6</target>
 
   061                    </configuration>
 
   062               </plugin>
 
   063               <plugin>
 
   064                    <groupId>org.apache.maven.plugins</groupId>
 
   065                    <artifactId>maven-shade-plugin</artifactId>
 
   066                    <version>2.2</version>
 
   067                    <configuration>
 
   068                         <createDependencyReducedPom>true</createDependencyReducedPom>
 
   069                    </configuration>
 
   070                    <executions>
 
   071                         <execution>
 
   072                              <phase>package</phase>
 
   073                              <goals>
 
   074                                   <goal>shade</goal>
 
   075                              </goals>
 
   076                              <configuration>
 
   077                                   <artifactSet>
 
   078                                        <includes>
 
   079                                             <include>*:*</include>
 
   080                                        </includes>
 
   081                                   </artifactSet>
 
   082                                   <filters>
 
   083                                        <filter>
 
   084                                             <artifact>*:*</artifact>
 
   085                                             <excludes>
 
   086                                                  <exclude>META-INF/*.SF</exclude>
 
   087                                                  <exclude>META-INF/*.DSA</exclude>
 
   088                                                  <exclude>META-INF/*.RSA</exclude>
 
   089                                             </excludes>
 
   090                                        </filter>
 
   091                                   </filters>
 
   092                                   <transformers>
 
   093                                        <transformer
 
   094                                             implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/> 
 
   095                                        <transformer
 
   096                                             implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
 
   097                                             <resource>reference.conf</resource>
 
   098                                        </transformer>
 
   099                                        <transformer
 
   100                                             implementation="org.apache.maven.plugins.shade.resource.DontIncludeResourceTransformer">
 
   101                                             <resource>log4j.properties</resource>
 
   102                                        </transformer>
 
   103                                   </transformers>
 
   104                              </configuration>
 
   105                         </execution>
 
   106                    </executions>
 
   107               </plugin>
 
   108          </plugins>
 
   109     </build>
 
   110</project>

javastart

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Kafka+Spark Streaming+Redis实时系统实践

基于Spark通用计算平台，可以很好地扩展各种计算类型的应用，尤其是Spark提供了内建的计算库支持，像Spark Streaming、Spark SQL、MLlib、GraphX，这些内建库都提供了高级抽象，可以用非常简洁的代码实现复杂的计算逻辑、这也得益于Scala编程语言的简洁性。这里，我们基于1.3.0版本的Spark搭建了计算平台，实现基于Spark Streaming的实时计算。
复制链接

扫一扫

专栏目录