Spark 高级数据源
一、Spark Streaming接收Flume数据
1.1 基于Flume的Push模式
Flume被用于在Flume agents之间推送数据.在这种方式下,Spark Streaming可以很方便的建立一个receiver,起到一个Avro agent的作用.Flume可以将数据推送到该receiver。
(1)第一步:Flume的配置文件
#bin/flume-ng agent -n a4 -f myagent/option_Push -c conf -Dflume.root.logger=INFO,console
#定义agent名, source、channel、sink的名称
a4.sources = r1
a4.channels = c1
a4.sinks = k1
#具体定义source
a4.sources.r1.type = spooldir
a4.sources.r1.spoolDir = /usr/local/tmp_files/logs
#具体定义channel
a4.channels.c1.type = memory
a4.channels.c1.capacity = 10000
a4.channels.c1.transactionCapacity = 100
#具体定义sink
a4.sinks = k1
a4.sinks.k1.type = avro
a4.sinks.k1.channel = c1
a4.sinks.k1.hostname = 192.168.1.121
a4.sinks.k1.port = 1234
#组装source、channel、sink
a4.sources.r1.channels = c1
a4.sinks.k1.channel = c1
内容解释:
(2)第二步:Spark Streaming程序
import org.apache.spark.SparkConf
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.flume.FlumeUtils
import org.apache.spark.storage.StorageLevel
object FlumeLogPush {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("FlumeLogPush").setMaster("local[2]")
val ssc = new StreamingContext(conf,Seconds(3))
//创建flumeEvent的DStream
val flumeEvent = FlumeUtils.createPollingStream(ssc, "192.168.1.121",1234, StorageLevel.MEMORY_ONLY)
//将FlumeEvent中的事件转成字符串
val lineDStream = flumeEvent.map(e => {
new String(e.event.getBody.array)
})
//输出结果
lineDStream.print()
ssc.start()
ssc.awaitTermination()
}
}
(3)第三步:注意除了需要使用Flume的lib的jar包以外,还需要以下jar包:
链接:https://pan.baidu.com/s/1v7jhZ4A1tK-GKUNH-lCwHw
提取码:5kw8
(4)第四步:测试
启动Spark Streaming程序
启动Flume
拷贝日志文件到/usr/local/tmp_files/logs目录
观察输出,采集到数据
1.2 基于Custom Sink的Pull模式
不同于Flume直接将数据推送到Spark Streaming中,第二种模式通过以下条件运行一个正常的Flume sink。Flume将数据推送到sink中,并且数据保持buffered状态。Spark Streaming使用一个可靠的Flume接收器和转换器从sink拉取数据。只要当数据被接收并且被Spark Streaming备份后,转换器才运行成功。
这样,与第一种模式相比,保证了很好的健壮性
和容错能力
。然而,这种模式需要为Flume配置一个正常的sink
。
以下为配置步骤:
(1)第一步:Flume的配置文件
#bin/flume-ng agent -n a1 -f myagent/option_Pull -c conf -Dflume.root.logger=INFO,console
a1.channels = c1
a1.sinks = k1
a1.sources = r1
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /usr/local/tmp_files/logs
a1.channels.c1.type = memory
a1.channels.c1.capacity = 100000
a1.channels.c1.transactionCapacity = 100000
a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = 192.168.1.121
a1.sinks.k1.port = 1234
#组装source、channel、sink
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
(2)第二步:Spark Streaming程序
import org.apache.spark.SparkConf
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.flume.FlumeUtils
import org.apache.spark.storage.StorageLevel
object FlumeLogPull {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("FlumeLogPull").setMaster("local[2]")
val ssc = new StreamingContext(conf, Seconds(10))
//创建FlumeEvent的DStream
val flumeEvent = FlumeUtils.createPollingStream(ssc,"192.168.1.121",1234,StorageLevel.MEMORY_ONLY_SER_2)
//将FlumeEvent中的事件转成字符串
val lineDStream = flumeEvent.map( e => {
new String(e.event.getBody.array)
})
//输出结果
lineDStream.print()
ssc.start()
ssc.awaitTermination();
}
}
(3)第三步:需要的jar包
将Spark的jar包拷贝到Flume的lib目录下
下面的这个jar包也需要拷贝到Flume的lib目录下,同时加入IDEA工程的classpath
链接:https://pan.baidu.com/s/1v7jhZ4A1tK-GKUNH-lCwHw
提取码:5kw8
(4)第四步:测试
启动Flume
在IDEA中启动FlumeLogPull
将测试数据拷贝到/usr/local/tmp_files/logs
观察IDEA中的输出
二、Spark Streaming接收Kafka数据
Apache Kafka是一种高吞吐量的分布式发布订阅消息系统。
2.1 搭建ZooKeeper(Standalone):
(1)配置/root/training/zookeeper-3.4.10/conf/zoo.cfg文件
dataDir=/root/training/zookeeper-3.4.10/tmp
server.1=spark81:2888:3888
(2)在/root/training/zookeeper-3.4.10/tmp目录下创建一个myid的空文件
echo 1 > /root/training/zookeeper-3.4.6/tmp/myid
2.2 搭建Kafka环境(单机单broker):
(1)修改server.properties文件
(2)启动Kafka
bin/kafka-server-start.sh config/server.properties &
出现以下错误:
(3)测试Kafka
//创建Topic
bin/kafka-topics.sh --create --zookeeper spark81:2181
-replication-factor 1 --partitions 3 --topic mydemo1
//发送消息
bin/kafka-console-producer.sh --broker-list spark81:9092 --topic mydemo1
//接收消息
bin/kafka-console-consumer.sh --zookeeper spark81:2181 --topic mydemo1
2.3 搭建Spark Streaming和Kafka的集成开发环境
由于Spark Streaming和Kafka集成的时候,依赖的jar包比较多,而且还会产生冲突。强烈建议使用Maven的方式来搭建项目工程。
下面是依赖的pom.xml文件:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>ZDemo5</groupId>
<artifactId>ZDemo5</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<spark.version>2.1.0</spark.version>
<scala.version>2.11</scala.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
<version>2.1.1</version>
</dependency>
</dependencies>
</project>
2.4 基于Receiver的方式
这个方法使用了Receivers来接收数据。Receivers的实现使用到Kafka高层次的消费者API。对于所有的Receivers,接收到的数据将会保存在Spark executors中,然后由Spark Streaming启动的Job来处理这些数据。
(1)开发Spark Streaming的Kafka Receivers
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
object KafkaWordCount {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("KafkaWordCount").setMaster("local[2]")
val ssc = new StreamingContext(conf, Seconds(10))
//创建topic名称,1表示一次从这个topic中获取一条记录
val topics = Map("mydemo1" -> 1)
//创建Kafka的输入流,指定ZooKeeper的地址
val kafkaStream = KafkaUtils.createStream(ssc,"192.168.1.121:2181","mygroup",topics)
//处理每次接收到的数据
val lineDStream = kafkaStream.map(e => {
new String(e.toString())
})
//输出结果
lineDStream.print()
ssc.start()
ssc.awaitTermination();
}
}
(2)测试
启动Kafka消息的生产者
bin/kafka-console-producer.sh --broker-list spark81:9092 --topic mydemo1
在IDEA中启动任务,接收Kafka消息
2.5 直接读取方式
和基于Receiver接收数据不一样,这种方式定期地从Kafka的topic+partition中查询最新的偏移量,再根据定义的偏移量范围在每个batch里面处理数据。当作业需要处理的数据来临时,spark通过调用Kafka的简单消费者API读取一定范围的数据。
(1)开发Spark Streaming的程序
import kafka.serializer.StringDecoder
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka.KafkaUtils
object DirectKafkaWordCount {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("DirectKafkaWordCount").setMaster("local[2]")
val ssc = new StreamingContext(conf, Seconds(10))
//创建topic名称,1表示一次从这个topic中获取一条记录
val topics = Set("mydemo1")
//指定Kafka的broker地址
val kafkaParams = Map[String, String]("metadata.broker.list" -> "192.168.1.121:9092")
//创建DStream,接收Kafka的数据
val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
//处理每次接收到的数据
val lineDStream = kafkaStream.map(e => {
new String(e.toString())
})
//输出结果
lineDStream.print()
ssc.start()
ssc.awaitTermination();
}
}
(2)测试
启动Kafka消息的生产者
bin/kafka-console-producer.sh --broker-list spark81:9092 --topic mydemo1
在IDEA中启动任务,接收Kafka消息