SparkStreaming的应用
1. Spark Streaming介绍
1.1. Spark Streaming概述
1.1.1. 什么是SparkStreaming
Spark Streaming类似于ApacheStorm,用于流式数据的处理。根据其官方文档介绍,Spark Streaming有高吞吐量和容错能力强等特点。Spark Streaming支持的数据输入源很多,例如:Kafka、Flume、Twitter、ZeroMQ和简单的TCP套接字等等。数据输入后可以用Spark的高度抽象原语如:map、reduce、join、window等进行运算。而结果也能保存在很多地方,如HDFS,数据库等。另外SparkStreaming也能和MLlib(机器学习)以及Graphx完美融合。
1.1.2. 为什么要用 SparkStreaming
1.易用
2.容错
3.易整合到Spark体系
1.1.3. Spark与Storm的对比
Spark
Storm
开发语言:Scala
开发语言:Clojure
编程模型:DStream
编程模型:Spout/Bolt
1.2. Flume+kafka+Spark Streaming+redis的环境搭建
1.2.1. 安装zookeeper环境(以单机为实例)
1. 下载zookeeper-3.3.6.tar.gz 上传到服务器
2. 解压 tar –zxvf zookeeper-3.3.6.tar.gz –C /usr/local
3. 修改zookeeper配置文件
4. mv /usr/local/zookeeper-3.3.6/conf/zoo_sample /usr/local/zookeeper-3.3.6/conf/zoo.cfg
5. 配置环境变量 vi /etc/profile 中加入export zookeer_home=ip/usr/local/zookeeper-3.3.6
6. 重启环境变量 source /etc/profile
7. 启动zooker服务 ./user/local/zookeeper-3.3.6/bin/zkServer.sh start
1.2.2. 安装kafka环境(以单机为实例)
1. 下载kafka kafka_2.11-1.1.0.tgz 版本
2. 解压 tar -xvf kafka_2.11-1.1.0.tgz -C /usr/local
3. 修改kafka配置文件vi /usr/local/kafka_2.11-1.1.0/config/server.properties
如下图 修改为自己的ip
4. 修改配置文件vi /usr/local/kafka_2.11-1.1.0/config/producer.properties 把ip改成你的服务器ip
5. 启动kafka服务
i.制作启动脚本 vi kafkastart.sh添加如下内容
/usr/local/kafka_2.11-1.1.0/bin/zookeeper-server-start.sh /usr/local/kafka_2.11-1.1.0/config/zookeeper.properties &
sleep 3
/usr/local/kafka_2.11-1.1.0/bin/kafka-server-start.sh /usr/local/kafka_2.11-1.1.0/config/server.properties &
ii.授权启动脚本
chmod a+x kafkastart.sh ./ kafkastart.sh
6. 测试kafka
a.创建一个主题
bin/kafka-topics.sh --create --zookeeper node1:2181 --replication-factor 1 --partitions 1 --topic test
b.查看主题
bin/kafka-topics.sh --list --zookeeper node1:2181
c. 启动生产者 启动后随意发送消息
bin/kafka-console-producer.sh -broker-list 192.168.1.135:9092 --topic test
d. 启动消费者
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning
如果生产者里发送消息 消费者能接受到说明你的 kafka 队列没有问题
1.2.3. 安装spark环境(以集群为实例三台机器一主两从)
1. 下载spark的spark-2.0.0-bin-hadoop2.6.tgz 安装包
2. 解压 tar –xvf spark-2.0.0-bin-hadoop2.6.tgz –C /usr/local
3. 安装scala环境 tar –zxvf scala-2.11.0.tgz
4. 配置scala环境变量 export PATH=/usr/local/scala/bin:$PATH
5. 让环境变量生效 source /etc/profile
6. 进入spark的conf目录并重命名并修改spark-env.sh.template文件
mv spark-env.sh.template spark-env.sh
7. 在该配置文件中添加如下配置(node1 是在 host 里面配置的 对应的主机 ip 的本地域名映射 )
export JAVA_HOME= /usr/local/jdk1.7.0_80
export SPARK_MASTER_HOST=node1
export SPARK_MASTER_PORT=7077
8. 重命名并修改slaves.template文件
mv slaves.template slaves
9. 在该文件中添加子节点所在的位置(Worker节点)
Node2
Node3
10. 将配置好的Spark拷贝到其他节点上(注意从机也必须要有java环境 scala环境)
Scp -r spark-2.0.0-bin-hadoop2.6/ nod2:/usr/local/
Scp -r spark-2.0.0-bin-hadoop2.6/ nod3:/usr/local/
11. Spark集群配置完毕,目前是1个Master,3个Work,在node1上启动Spark集群
/usr/local/spark-2.0.0-bin-hadoop2.6/sbin/ start-all.sh
1.2.4. 安装flume环境(单机为例)
1. 从官网下载flume包apache-flume-1.7.0-bin.tar.gz
2. 解压安装 tar –zxvf apache-flume-1.7.0-bin.tar.gz -C /usr/local
12. 进入flume的conf目录并重命名并修改flume-env.sh.template文件
Mv flume-env.sh.template flume-env.sh
配置中配置好jdk的环境变量
export JAVA_HOME=/home/soft/jdk1.7.0_80
3. 在conf目录中新建文件flume-kafka.conf
内容如下:
agent002.sources = sources002
agent002.channels = channels002
agent002.sinks = sinks002
## define sources配置监听文件的目录
agent002.sources.sources002.type = exec
agent002.sources.sources002.command = tail -F /usr/local/apache-flume-1.7.0/moinit/log.out
## define channels 配置缓存方式
agent002.channels.channels002.type = memory
agent002.channels.channels002.capacity = 10000
agent002.channels.channels002.transactionCapacity = 10000
agent002.channels.channels002.byteCapacityBufferPercentage = 20
agent002.channels.channels002.byteCapacity = 800000
##define sinks 配置文件落地到 kafka 的配置
agent002.sinks.sinks002.type =org.apache.flume.sink.kafka.KafkaSink
agent002.sinks.sinks002.brokerList=node1:9092
agent002.sinks.sinks002.topic=test
##relationship
agent002.sources.sources002.channels = channels002
agent002.sinks.sinks002.channel = channels002
4. 启动flume 指定配置文件 指定日志级别 ./flume-ng agent --name agent002 --conf-file ../conf/flume-kafka.conf --conf ../conf/ -Dflume.root.logger=DEBUG,console
1.2.5. 安装redis环境(单机为例)
1. 下载redis
Weget http://download.redis.io/releases/redis-3.0.0.tar.gz
2. Reids安装属于源码编译 需要安装gcc环境
yum install gcc-c++
3. 解压reids的源码包
tar -zxf redis-3.0.0.tar.gz
4. 编译
Make
5. 安装redis
make install PREFIX=/usr/local/redis
6. 配置后端启动方式
a. 将redis源码包中的redis.conf配置文件复制到/usr/local/redis/bin/下
b. 修改redis.conf,将daemonize由no改为yes
7. 进入reids的bin目录 启动redis ./redis-server redis.conf
1.2.6. Idea中配置scala开发环境
1. 在idea的设置红 找到plugins模块 ,然后点击browse repostion
2. 搜索scala然后安装
3. 如果安装不了 就去scala官网下载安装包 选择手动安装方式
进这个地址 去找到对应的版本
http://plugins.jetbrains.com/plugin/1347-scala
选择版本时候请根据idea自己提供的版本和更新时间 找到对应的scala版本,不然会安装不了(这个有严格的版本控制)
4. 安装scala的sdk 如window开发 可下载这个 scala-2.11.11.msi然后 一直下一步安装即可
1.2.7. 编写flume+kafka+sparkStreaming+redisDemo
1.核心类
package com.hm.sparkstreaming.demo import com.alibaba.fastjson.JSON import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.streaming.StreamingContext import org.apache.spark.streaming.Durations import org.apache.spark.storage.StorageLevel import org.apache.spark.streaming.dstream.DStream import org.apache.spark.streaming.kafka.KafkaUtils import kafka.serializer.StringDecoder import scala.util.parsing.json.JSONObject object KafkaWordCount { def main(args: Array[ String ]): Unit = { // 至少 2 个线程,一个 DRecive 接受监听端口数据,一个计算 // val sparkConf = new SparkConf().setAppName("KafkaWordCount").setMaster("local[2]") val sparkConf = new SparkConf().setAppName( "KafkaWordCount" ).setMaster( "spark://node1:7077" ) val sc = new StreamingContext(sparkConf, Durations.seconds ( 5 )); // 然后创建一个 set, 里面放入你要读取的 Topic, 这个就是我们所说的 , 它给你做的很好 , 可以并行读取多个 topic val kafkaParams = Map [ String , String ]( "metadata.broker.list" -> "192.168.1.135:9092" ) var topics = Set [ String ]( "test" ); //kafka 返回的数据时 key/value 形式,后面只要对 value 进行分割就 ok 了 val linerdd = KafkaUtils.createDirectStream [ String , String , StringDecoder, StringDecoder]( sc, kafkaParams, topics) val wordrdd = linerdd.flatMap { _._2.split( " " ) } wordrdd.foreachRDD(rdd => { rdd.collect().map(item=>{ val jedeis=RedisClient. clients ; val json=JSON.parseObject (item); val id= json.getString("id" ) val name=json.getString( "name" ) jedeis.set(id,name) }); println (" 从 topic:" + topics + " 读取 rdd:" + rdd.count()) }) wordrdd.print() val resultrdd = wordrdd.map { x => (x, 1 ) }.reduceByKey { _ + _ } resultrdd.print() sc.start() sc.awaitTermination() sc.stop() } }
工具 redis 类
package com.hm.sparkstreaming.demo import com.alibaba.fastjson.JSON import org.json4s.jackson.Json import redis.clients.jedis.Jedis import scala.collection.JavaConversions._ import scala.collection.mutable import scala.util.parsing.json.JSONObject /** * User: wangzhijun */object RedisClient { val clients = RedisConnector. clients /** * * @param key * @param value * @return */ def set(key: String , value: String ): Unit = { clients .set(key, value) } /** * * @param key * @return */ def get(key: String ): Option[ String ] = { val value = clients .get(key) if (value == null ) None else Some (value) } /** * * @param key */ def del(key: String ): Unit = { clients .del(key) } /** * * @param hkey * @param key * @param value * @return */ def hset(hkey: String , key: String , value: String ): Boolean = { clients .hset(hkey, key, value) == 1 } /** * * @param hkey * @param key * @return */ def hget(hkey: String , key: String ): Option[ String ] = { val value = clients .hget(hkey, key) if (value == null ) None else Some (value) } /** * * @param hkey * @param key * @return */ def hdel(hkey: String , key: String ): Option[Long] = {Some ( clients .hdel(hkey, key)) } /** * * @param hkey * @param map */ def hmset(hkey: String , map: mutable.Map[ String , String ]): Unit = { clients .hmset(hkey, mapAsJavaMap(map)) } /** * * @param key * @param value * @return */ def rpush(key: String , value: String ): Option[Long] = {Some ( clients .rpush(key, value)) } /** * * @param key * @return */ def lpop(key: String ): Option[ String ] = { val value = clients .lpop(key) if (value == null ) None else Some (value) } /** * * @param key * @return */ def lhead(key: String ): Option[ String ] = { val head = clients .lindex(key, 0 ) if (head == null ) None else Some (head) } /** * * @param key * @return */ def incr(key: String ): Option[Long] = { val inc = clients .incr(key) if (inc == null ) None else Some (inc) } /** * * @param key * @param time * @return */ def expire(key: String , time: Int) = { clients .expire(key, time) } /** * * @param key * @return */ def ttl(key: String ): Option[Long] = {Some ( clients .ttl(key)) } } object RedisConnector { /* private val jedisClusterNodes = new util.HashSet[HostAndPort]() jedisClusterNodes.add(new HostAndPort("192.168.1.135", 6379))*/ val clients = new Jedis( "192.168.1.135" , 6379 ) } object MainClass { def main(args: Array[ String ]): Unit = { val str= "{ \" id \" : \" 1234 \" , \" userName \" : \" wangzhijun \" }" val json = JSON.parseObject (str) val id=json.getString( "id" ); val name=json.getString( "userName" ); print (id) print (name) } }
3.mvn 配置文件 pom.xml 配置
< project xmlns ="http://maven.apache.org/POM/4.0.0" xmlns: xsi ="http://www.w3.org/2001/XMLSchema-instance" xsi :schemaLocation ="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd" > <modelVersion > 4.0.0</ modelVersion > <groupId > com.hm</ groupId > <artifactId > SparkStreaming-Demo</ artifactId > <version > 0.0.1-SNAPSHOT</ version > <name > ${project.artifactId}</ name > <properties > <maven.compiler.source > 1.7</ maven.compiler.source > <maven.compiler.target > 1.7</ maven.compiler.target > <encoding > UTF-8</ encoding > <scala.version > 2.11.0</ scala.version > <scala.compat.version > 2.11.8</ scala.compat.version > </properties > <dependencies > <dependency > <groupId > org.slf4j</ groupId > <artifactId > slf4j-log4j12</ artifactId > <version > 1.7.8</ version > </dependency > <dependency > <groupId > org.scala-lang</ groupId > <artifactId > scala-library</ artifactId > <version > ${scala.version}</ version > </dependency > <dependency > <groupId > org.apache.spark</ groupId > <artifactId > spark-core_2.11</ artifactId > <version > 2.0.0</ version > </dependency > <dependency > <groupId > org.apache.spark</ groupId > <artifactId > spark-streaming_2.11</ artifactId > <version > 2.0.0</ version > </dependency > <dependency > <groupId > org.apache.spark</ groupId > <artifactId > spark-streaming-kafka-0-8_2.11</ artifactId > <version > 2.0.0</ version > </dependency > <dependency > <groupId > log4j</ groupId > <artifactId > log4j</ artifactId > <version > 1.2.17</ version > </dependency > <dependency > <groupId > redis.clients</ groupId > <artifactId > jedis</ artifactId > <version > 2.9.0</ version > </dependency > <dependency > <groupId > com.alibaba</ groupId > <artifactId > fastjson</ artifactId > <version > 1.2.47</ version > </dependency > </dependencies > <build > <sourceDirectory > src/main/scala</ sourceDirectory > <testSourceDirectory > src/test/scala</ testSourceDirectory > <resources > <resource > <directory > src/main/resources</ directory > <targetPath > ${basedir}/target/classes</ targetPath > <includes > <include > **/*.properties</ include > <include > **/*.xml</ include > </includes > <filtering > true</ filtering > </resource > <resource > <directory > src/main/resources</ directory > <targetPath > ${basedir}/target/resources</ targetPath > <includes > <include > **/*.properties</ include > <include > **/*.xml</ include > </includes > <filtering > true</ filtering > </resource > </resources > <plugins > <plugin > <!-- see http://davidb.github.com/scala-maven-plugin --> < groupId > net.alchim31.maven</ groupId > <artifactId > scala-maven-plugin</ artifactId > <version > 3.2.0</ version > <executions > <execution > <goals > <goal > compile</ goal > <goal > testCompile</ goal > </goals > <configuration > <args > <!-- <arg>-make:transitive</arg> --> < arg > -dependencyfile</ arg > <arg > ${project.build.directory}/.scala_dependencies</ arg > </args > </configuration > </execution > </executions > </plugin > <plugin > <groupId > org.apache.maven.plugins</ groupId > <artifactId > maven-surefire-plugin</ artifactId > <version > 2.18.1</ version > <configuration > <useFile > false</ useFile > <disableXmlReport > true</ disableXmlReport > <includes > <include > **/*Test.*</ include > <include > **/*Suite.*</ include > </includes > </configuration > </plugin > <plugin > <artifactId > maven-assembly-plugin</ artifactId > <version > 2.6</ version > <configuration > <descriptorRefs > <descriptorRef > jar-with-dependencies</ descriptorRef > </descriptorRefs > </configuration > </plugin > </plugins > </build > </ project >
1.2.8. 集群模式编译提交
1. 将代码打成jar上传到服务器 如果我上传到 /root目录下
2. 进入目录/usr/local/spark-2.0.0-bin-hadoop2.6/bin 执行以下脚本
./spark-submit --class com.hm.sparkstreaming.demo.KafkaWordCount --executor-memory 2G --total-executor-cores 4 /root/SparkStreaming-Redis-Demo.jar
1.2.9. Spark的WebUI界面