Spark Streaming整合Flume有两种方式
Approach 1: Flume-style Push-based Approach
Approach 2: Pull-based Approach using a Custom Sink
这里介绍第一种
Spark Streaming在Flume扮演一个avro agent。
由于是push模式,需要先启动Spark Streaming。
Flume发送数据到一个avro sink上。
在node1上进入flume目录
cd /app/flume/flume/conf
创建flume的agent配置文件
vi test-flume-push-streaming.conf
#flume-push-streaming
flume-push-streaming.sources = netcat-source
flume-push-streaming.sinks = avro-sink
flume-push-streaming.channels = memory-channel
flume-push-streaming.sources.netcat-source.type = netcat
flume-push-streaming.sources.netcat-source.bind = node1
flume-push-streaming.sources.netcat-source.port = 44444
#flume把数据写到什么机器的什么端口
#这里的192.168.2.100是用IDEA运行spark streaming程序的一台电脑
flume-push-streaming.sinks.avro-sink.type = avro
flume-push-streaming.sinks.avro-sink.hostname = 192.168.2.100
flume-push-streaming.sinks.avro-sink.port = 41414
flume-push-streaming.channels.memory-channel.type = memory
flume-push-streaming.sources.netcat-source.channels = memory-channel
flume-push-streaming.sinks.avro-sink.channel = memory-channel
项目目录
pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.sid.spark</groupId>
<artifactId>spark-train</artifactId>
<version>1.0</version>
<inceptionYear>2008</inceptionYear>
<properties>
<scala.version>2.11.8</scala.version>
<kafka.version>0.9.0.0</kafka.version>
<spark.version>2.2.0</spark.version>
<hadoop.version>2.9.0</hadoop.version>
<hbase.version>1.4.4</hbase.version>
</properties>
<repositories>
<repository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</repository>
</repositories>
<pluginRepositories>
<pluginRepository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</pluginRepository>
</pluginRepositories>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.11</artifactId>
<version>${kafka.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
<exclusions>
<exclusion>
<artifactId>servlet-api</artifactId>
<groupId>javax.servlet</groupId>
</exclusion>
</exclusions>
</dependency>
<!--<dependency>-->
<!--<groupId>org.apache.hbase</groupId>-->
<!--<artifactId>hbase-client</artifactId>-->
<!--<version>${hbase.version}</version>-->
<!--</dependency>-->
<!--<dependency>-->
<!--<groupId>org.apache.hbase</groupId>-->
<!--<artifactId>hbase-server</artifactId>-->
<!--<version>${hbase.version}</version>-->
<!--</dependency>-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-flume_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>net.jpountz.lz4</groupId>
<artifactId>lz4</artifactId>
<version>1.3.0</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.31</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
<args>
<arg>-target:jvm-1.5</arg>
</args>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-eclipse-plugin</artifactId>
<configuration>
<downloadSources>true</downloadSources>
<buildcommands>
<buildcommand>ch.epfl.lamp.sdt.core.scalabuilder</buildcommand>
</buildcommands>
<additionalProjectnatures>
<projectnature>ch.epfl.lamp.sdt.core.scalanature</projectnature>
</additionalProjectnatures>
<classpathContainers>
<classpathContainer>org.eclipse.jdt.launching.JRE_CONTAINER</classpathContainer>
<classpathContainer>ch.epfl.lamp.sdt.launching.SCALA_CONTAINER</classpathContainer>
</classpathContainers>
</configuration>
</plugin>
</plugins>
</build>
<reporting>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
</configuration>
</plugin>
</plugins>
</reporting>
</project>
代码
package com.sid.spark
import org.apache.spark.SparkConf
import org.apache.spark.streaming.flume.FlumeUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* Created by jy02268879 on 2018/7/18.
*
* SparkStreaming整合Flume用Push方式 模拟词频统计 无状态
*/
object FlumePushSparkStreaming {
def main(args: Array[String]): Unit = {
//SparkStreaming从什么机器的端口接收数据
if(args.length != 2){
System.err.println("Usage: FlumePushStreaming <hostname> <port>")
System.exit(1)
}
val Array(hostname,port) = args
//本地运行
val sparkConf = new SparkConf().setMaster("local[2]").setAppName("FlumePushSparkStreaming")
//提交到服务器运行 不用把setMaster和setAppName写死
//val sparkConf = new SparkConf()
val ssc = new StreamingContext(sparkConf,Seconds(5))
//该代码运行在192.168.2.100机器上,flume会把数据push到这台机器上来
val flumeStream = FlumeUtils.createStream(ssc, hostname, port.toInt)
//flume在传输的时候数据event是有head和body的,这里只拿body就好了
flumeStream.map( x => new String(x.event.getBody.array()).trim)
.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).print()
ssc.start()
ssc.awaitTermination()
}
}
启动IDEA项目 传入参数hostname port
启动node1的agent
cd /app/flume/flume
bin/flume-ng agent --name flume-push-streaming -c conf -f conf/test-flume-push-streaming.conf -Dflume.root.logger=INFO,console
在node1中
telnet node1 44444
因为flume数据的来源配置的是从node1的44444端口接收数据
发送数据
IDEA控制台显示
本地运行通过以后打包传到服务器上运行
修改代码
打包
把target下面的jar包传到服务器,提交到spark上运行
cd /app/spark/spark-2.2.0-bin-2.9.0/bin
./spark-submit --class com.sid.spark.FlumePushSparkStreaming --master local[2] --name FlumePushSparkStreaming --packages org.apache.spark:spark-streaming-flume_2.11:2.2.0 /app/spark/test_data/spark-train-1.0-SNAPSHOT.jar node1 41414
修改flume的sink地址到node1,重启flume
cd /app/flume/flume/conf
vi test-flume-push-streaming.conf
#flume-push-streaming
flume-push-streaming.sources = netcat-source
flume-push-streaming.sinks = avro-sink
flume-push-streaming.channels = memory-channel
flume-push-streaming.sources.netcat-source.type = netcat
flume-push-streaming.sources.netcat-source.bind = node1
flume-push-streaming.sources.netcat-source.port = 44444
#flume把数据写到什么机器的什么端口
#这里的192.168.2.100是用IDEA运行spark streaming程序的一台电脑
flume-push-streaming.sinks.avro-sink.type = avro
flume-push-streaming.sinks.avro-sink.hostname = node1
flume-push-streaming.sinks.avro-sink.port = 41414
flume-push-streaming.channels.memory-channel.type = memory
flume-push-streaming.sources.netcat-source.channels = memory-channel
flume-push-streaming.sinks.avro-sink.channel = memory-channel
启动node1的agent
cd /app/flume/flume
bin/flume-ng agent --name flume-push-streaming -c conf -f conf/test-flume-push-streaming.conf -Dflume.root.logger=INFO,console
telnet node1 44444
查看服务器上运行的spark