目录
flume push数据到SparkStreaming
1. flume文件配置
# 定义 source, channel, 和sink的名字
a1.sources = s1
a1.channels = c1
a1.sinks = avroSink
# 对source的一些设置
a1.sources.s1.type = netcat
a1.sources.s1.bind = localhost
a1.sources.s1.port = 5678
a1.sources.s1.channels = c1
# 对channel的一些设置
a1.channels.c1.type = memory
# 对sink的一些设置
a1.sinks.avroSink.type = avro
a1.sinks.avroSink.channel = c1
a1.sinks.avroSink.hostname = singleNode
a1.sinks.avroSink.port = 9999
2. Spark程序
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.flume.{FlumeUtils, SparkFlumeEvent}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object FlumePushSparkStream {
def main(args: Array[String]): Unit = {
//这里local我设置为4,刚开始设置为*,运行jar的时候无法运行
val conf = new SparkConf().setMaster("local[4]").setAppName("demo")
val ssc = new StreamingContext(conf,Seconds(5))
val flumStream: ReceiverInputDStream[SparkFlumeEvent] = FlumeUtils.createStream(
ssc,
"192.168.136.20",
9999
)
val lines: DStream[String] = flumStream.map(x => new String(x.event.getBody.array()).trim)
val result: DStream[(String, Int)] = lines
.flatMap(_.split("\\s+"))
.map((_, 1))
.reduceByKey(_ + _)
result.print()
ssc.start()
ssc.awaitTermination()
}
}
3.pom文件
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>cn.kgc</groupId>
<artifactId>SparkLearn</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<hadoop.version>2.6.0-cdh5.14.2</hadoop.version>
<hive.version>1.1.0-cdh5.14.2</hive.version>
<hbase.version>1.2.0-cdh5.14.2</hbase.version>
<scala.version>2.11.8</scala.version>
<spark.version>2.4.5</spark.version>
</properties>
<repositories>
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
</repositories>
<dependencies>
<!--scala-->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8</version>
</dependency>
<!-- spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.5</version>
</dependency>
<!-- spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.5</version>
</dependency>
<!-- spark-hive -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.4.5</version>
</dependency>
<!-- spark-graphx -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-graphx_2.11</artifactId>
<version>2.4.5</version>
</dependency>
<!-- spark-streaming -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.4.5</version>
</dependency>
<!--spark-streaming_2.11-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-flume_2.11</artifactId>
<version>2.4.5</version>
</dependency>
<!--spark-streaming-flume_2.11-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-flume_2.11</artifactId>
<version>2.4.5</version>
</dependency>
<!--kafka-->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.11.0.2</version>
</dependency>
<!-- mysql-connector-java -->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.38</version>
</dependency>
<!-- hadoop -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.6.0</version>
</dependency>
<!-- log4j -->
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
<!-- junit -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
</dependency>
</dependencies>
<build>
<plugins>
<!--java打包插件-->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.2</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
</configuration>
<executions>
<execution>
<phase>compile</phase>
<goals>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
<!--scala打包插件-->
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.15.2</version>
<executions>
<execution>
<id>scala-compile-first</id>
<goals>
<goal>compile</goal>
</goals>
<configuration>
<includes>
<include>**/*.scala</include>
</includes>
</configuration>
</execution>
</executions>
</plugin>
<!--将依赖打入jar包-->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.6</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
4.Spark-submit
打好jar包,上传到虚拟机。
spark-submit --class SparkStreaming.FlumePushSparkStream SparkLearn-1.0-SNAPSHOT-jar-with-dependencies.jar
5.运行flume
flume-ng agent --name a1 --conf /opt/install/flume/conf/ --conf-file ./flume-push.conf -Dflume.root.logger=INFO,console
6.telnet连接
telnet 192.168.136.20 5678
发送消息后,可在运行的jar包查看处理的消息:
sparkStreaming pull flume
1.flume配置文件
a2.sources=s1
a2.channels=c1
a2.sinks=k1
a2.sources.s1.type=netcat
a2.sourecs.s1.bind=localhost
a2.sources.s1.port=7777
a2.sources.s1.channels=c1
a2.chaneels.c1.type=memory
a2.sinks.k1.type=org.apache.spark.streaming.flume.sink.SparkSink
a2.sinks.k1.hostname=hadoop01
a2.sinks.k1.port=9999
a2.sinks.k1.channel=c1
上传jar包到flume
avro-1.8.2.jar
avro-ipc-1.8.2.jar
commons-lang3-3.5.jar
scala-library-2.11.8.jar
spark-streaming-flume-sink_2.11-2.4.4.jar
spark-streaming-flume_2.11-2.4.4.jar
需要的大家可以到下面链接中下载:
Flume Sink jar
提取码:7gz0
总共6个jar包。再删除掉就版本的jar包:
mv scala-library-2.10.5.jar scala-library-2.10.5.jar.bak
mv avro-1.7.4.jar avro-1.7.4.jar.bak
mv avro-ipc-1.7.4.jar avro-ipc-1.7.4.jar.bak
3.编写Spark程序
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.flume.{FlumeUtils, SparkFlumeEvent}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object SparkPull {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[4]").setAppName("SparkPull")
val ssc = new StreamingContext(conf,Seconds(5))
//1.加载数据源
val flumeStream: ReceiverInputDStream[SparkFlumeEvent] = FlumeUtils.createPollingStream(ssc,"hadoop01",9999)
val lines = flumeStream.map(x=>new String(x.event.getBody.array()).trim)
//2.数据处理
val result: DStream[(String, Int)] = lines.flatMap(_.split(" "))
.map((_, 1)).reduceByKey(_ + _)
result.print()
ssc.start()
ssc.awaitTermination()
}
}
4.启动flume
flume-ng agent --name a2 --conf /opt/install/flume/conf/ --conf-file ./flume-pull.conf -Dflume.root.logger=INFO,console
5.编写好的代码打成jar包, 上传到集群运行(胖包)
spark-submit --class SparkStreaming.SparkPull flume-sparkStreaming-1.0-SNAPSHOT-jar-with-dependencies.jar
6.启动telnet,输入数据
telnet localhost 7777
可以在运行的jar包端查看输出结果。