简易实现Flume->Kafka->SparkStreaming->Mysql
本文章主要实现Spark简单的Spark流式处理,通过Flume监听文件传入Kafka消费者消费后经由SparkStreaming进行处理传入Mysql
1. Flume与Kafka配置
1.1Flume配置
在Flume /conf/group/目录下面创建flume-file-kafka.conf
agent.sources = r1
agent.channels = c1
agent.sinks = s1
agent.sources.r1.type = spooldir
agent.sources.r1.spoolDir = /opt/module/data/log/ #存放监控文件目录
agent.sources.r1.fileHeader = true
# Each sink's type must be defined
#agent.sinks.s1.type = logger
agent.sinks.s1.type = org.apache.flume.sink.kafka.KafkaSink
agent.sinks.s1.topic = expr # 传入kafka主题名
agent.sinks.s1.brokerList = hadoop101:9092
agent.sinks.s1.requiredAcks = 1
agent.sinks.s1.batchSize = 2
# Each channel's type is defined.
agent.channels.c1.type = memory
agent.channels.c1.capacity = 100
agent.sources.r1.channels = c1
agent.sinks.s1.channel = c1
1.2 创建kafka主题与生产者消费者
(1)创建主题
./bin/kafka-topics.sh --create --bootstrap-server hadoop101:9092 --replicationfactor 3 --partitions 2 --topic expr
(2)创建生产者
./kafka-console-producer.sh --broker-list hadoop101:9092 --topic expr
(3)创建消费者
./kafka-console-consumer.sh --bootstrap-server hadoop101:9092 --topic atguigu
1.3 将vegetable.txt 存放到/opt/module/data/log/下
2.启动flume
2.1 启动flume并观察
./bin/flume-ng agent --conf conf -f ./conf/group/flume-file-kafka.conf -n agent -Dflume.root.logger=INFO,console
在kafka中可看到如下数据
2.2 改变数据后缀观察
这时vegetable.txt数据已经被消费,我们需要改变后缀让其重新被监控到
mv vegetable.txt.COMPLETED vegetable.txt
可以看到kafka会继续接收到数据
3.Mysql创建数据库与表
创建spark处理后的结果数据表
show databases ;
drop database test1;
create database test_1 default charset utf8;
use test_1;
create table price_test(
time timestamp,
type varchar(100),
price double
)
4.编写并启动Spark程序
pom.xml依赖
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>sparktest</artifactId>
<version>1.0-SNAPSHOT</version>
<inceptionYear>2008</inceptionYear>
<properties>
<scala.version>2.12.7</scala.version>
<spark.version>3.0.0</spark.version>
</properties>
<repositories>
<repository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</repository>
</repositories>
<pluginRepositories>
<pluginRepository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</pluginRepository>
</pluginRepositories>
<dependencies>
<!-- <dependency>-->
<!-- <groupId>org.scala-lang</groupId>-->
<!-- <artifactId>scala-library</artifactId>-->
<!-- <version>${scala.version}</version>-->
<!-- </dependency>-->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.49</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
<version>2.10.1</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.10.1</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.83</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.4</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.specs</groupId>
<artifactId>specs</artifactId>
<version>1.2.5</version>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
<args>
<arg>-target:jvm-1.5</arg>
</args>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-eclipse-plugin</artifactId>
<configuration>
<downloadSources>true</downloadSources>
<buildcommands>
<buildcommand>ch.epfl.lamp.sdt.core.scalabuilder</buildcommand>
</buildcommands>
<additionalProjectnatures>
<projectnature>ch.epfl.lamp.sdt.core.scalanature</projectnature>
</additionalProjectnatures>
<classpathContainers>
<classpathContainer>org.eclipse.jdt.launching.JRE_CONTAINER</classpathContainer>
<classpathContainer>ch.epfl.lamp.sdt.launching.SCALA_CONTAINER</classpathContainer>
</classpathContainers>
</configuration>
</plugin>
</plugins>
</build>
<reporting>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
</configuration>
</plugin>
</plugins>
</reporting>
</project>
test.scala
package org.example
import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import java.sql.{Connection, DriverManager, PreparedStatement}
object spark_test {
def main(args: Array[String]): Unit = {
//1.创建 SparkConf
val sparkConf: SparkConf = new
SparkConf().setAppName("ReceiverWordCount").setMaster("local[*]")
//2.创建 StreamingContext
val ssc = new StreamingContext(sparkConf, Seconds(3))
//3.定义 Kafka 参数
val kafkaPara: Map[String, Object] = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG ->
"192.168.226.40:9092",
ConsumerConfig.GROUP_ID_CONFIG->
"expr",
"key.deserializer" ->
"org.apache.kafka.common.serialization.StringDeserializer",
"value.deserializer" ->
"org.apache.kafka.common.serialization.StringDeserializer"
)
//4.读取 Kafka 数据创建 DStream
val kafkaDStream: InputDStream[ConsumerRecord[String, String]] =
KafkaUtils.createDirectStream[String, String](ssc, LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](Set("expr"), kafkaPara))
//5.将每条消息的 KV 取出
val valueDStream: DStream[String] = kafkaDStream.map(record => record.value())
//6.处理每一行数据 进行分割
val value: DStream[(String, String, String)] = valueDStream.flatMap(_.split("\n")).map {
line: String =>
val data: (String, String, String) = (line.split(" ")(0), line.split(" ")(1), line.split(" ")(5))
(data._1, data._2, data._3)
}
value.print()
value.foreachRDD(rdd => rdd.foreachPartition(line => {
Class.forName("com.mysql.jdbc.Driver")
//获取mysql连接
val conn = DriverManager.getConnection("jdbc:mysql://192.168.226.40:3306/test_1?useSSL=false&useUnicode=true&characterEncoding=UTF-8"
, "root", "123456")
//把数据写入mysql
try {
for (row <- line) {
val sql = "insert into price_test(time,type,price) values('" + row._1 + "','" + row._2 + "','"+row._3+"')"
conn.prepareStatement(sql).executeUpdate()
}
} finally {
conn.close()
}
}))
//7.开启任务
ssc.start()
ssc.awaitTermination()
}
}
5.查看结果
顺序一定是先启动kafka消费者->后启动Flume->马上启动scala程序->(程序如果不报错直接停止,再次启动即可,然后改变文件后缀重新监听数据)
文章差不多到这里就结束了,案例还是非常基础的,如果想要数据的来自己实现一遍的,关注然后私信我即可,共勉。