今天讲了kafka和sparkstream的一个简单结合,试着在网上找了一个例子进行实现
1、相关配置 spark2.2.0,scala2.11.8,kafka_2.10-0.10.2.1,jdk1.8
2、这里是自己的pom.xml文件 如下
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>make</groupId>
<artifactId>Spark_code_hive</artifactId>
<version>1.0-SNAPSHOT</version>
<inceptionYear>2008</inceptionYear>
<properties>
<scala.version>2.11.8</scala.version>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<encoding>UTF-8</encoding>
<spark.version>2.2.0</spark.version>
<hadoop.version>2.9.1</hadoop.version>
<kafka.version>0.10.2.1</kafka.version>
</properties>
<repositories>
<repository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</repository>
</repositories>
<pluginRepositories>
<pluginRepository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</pluginRepository>
</pluginRepositories>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.4</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.specs</groupId>
<artifactId>specs</artifactId>
<version>1.2.5</version>
<scope>test</scope>
</dependency>
<!-- Spark core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
<scope>compile</scope>
</dependency>
<!-- Spark SQL -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
<scope>compile</scope>
</dependency>
<!--spark streaming-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>${spark.version}</version>
<scope>compile</scope>
</dependency>
<!--spark kafka-->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.11</artifactId>
<version>${kafka.version}</version>
<scope>compile</scope>
</dependency>
<!-- HDFS Client -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>${spark.version}</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.spark-project.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>1.2.1.spark2</version>
</dependency>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.11</artifactId>
<version>1.5.0</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.27</version>
</dependency>
<!--fastjson -->
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.47</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
<args>
<arg>-target:jvm-1.5</arg>
</args>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-eclipse-plugin</artifactId>
<configuration>
<downloadSources>true</downloadSources>
<buildcommands>
<buildcommand>ch.epfl.lamp.sdt.core.scalabuilder</buildcommand>
</buildcommands>
<additionalProjectnatures>
<projectnature>ch.epfl.lamp.sdt.core.scalanature</projectnature>
</additionalProjectnatures>
<classpathContainers>
<classpathContainer>org.eclipse.jdt.launching.JRE_CONTAINER</classpathContainer>
<classpathContainer>ch.epfl.lamp.sdt.launching.SCALA_CONTAINER</classpathContainer>
</classpathContainers>
</configuration>
</plugin>
</plugins>
</build>
<reporting>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
</configuration>
</plugin>
</plugins>
</reporting>
</project>
3、创建一个相关的配置文件,my.properties 如下,就是你的kafka的一些topic的相关设置
# kafka configs
kafka.bootstrap.servers=make.spark.com:9092,make.spark.com:9093,make.spark.com:9094
kafka.topic.source=spark-kafka-demo
kafka.topic.sink=spark-sink-test
kafka.group.id=spark_demo_gid1
4、创建我们的相关代码代码
4.1 首先创建读取我们配置文件my.properties的工具类,如下
package Utils
import java.util.Properties
/**
* Properties的工具类
* Created by make on 2017-08-08 18:39
*/
object PropertiesUtil {
/**
* 获取配置文件Properties对象
* @author make
* @return java.util.Properties
*/
def getProperties() :Properties = {
val properties = new Properties()
//读取源码中resource文件夹下的my.properties配置文件,得到一个properties
val reader = getClass.getResourceAsStream("/my.properties")
properties.load(reader)
properties
}
/**
* 获取配置文件中key对应的value
* @author make
* @return java.util.Properties
*/
def getPropString(key : String) : String = {
getProperties().getProperty(key)
}
/**
* 获取配置文件中key对应的整数值,可能后面这里会需要其他的值
* @author yore
* @return java.util.Properties
*/
def getPropInt(key : String) : Int = {
getProperties().getProperty(key).toInt
}
/**
* 获取配置文件中key对应的布尔值
* @author make
* @return java.util.Properties
*/
def getPropBoolean(key : String) : Boolean = {
getProperties().getProperty(key).toBoolean
}
}
4.2 我们创建一个kafkasink类用来 实例化producer以及向kafka发送数据 如下
package spark_stream
import java.util.concurrent.Future
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord, RecordMetadata}
/**
* 手动实现一个KafkaSink类,并实例化producer 将数据发送到kafka的对应topic
* This is the key idea that allows us to work around running into NotSerializableExceptions.
* Created by make on 2018-08-08 18:50
*/
class KafkaSink[K,V](createProducer: () => KafkaProducer[K, V]) extends Serializable {
//创建一个 生产者
lazy val producer = createProducer()
/** 发送消息 */
//本质是调用producer.send进行数据发送
def send(topic : String, key : K, value : V) : Future[RecordMetadata] =
producer.send(new ProducerRecord[K,V](topic,key,value))
def send(topic : String, value : V) : Future[RecordMetadata] =
producer.send(new ProducerRecord[K,V](topic,value))
}
//使用了伴生对象,简单实例化kafkasink
object KafkaSink {
import scala.collection.JavaConversions._
def apply[K, V](config: Map[String, Object]): KafkaSink[K, V] = {
val createProducerFunc = () => {
val producer = new KafkaProducer[K, V](config)
sys.addShutdownHook {
// Ensure that, on executor JVM shutdown, the Kafka producer sends
// any buffered messages to Kafka before shutting down.
producer.close()
}
producer
}
//返回一个producer
new KafkaSink(createProducerFunc)
}
def apply[K, V](config: java.util.Properties): KafkaSink[K, V] = apply(config.toMap)
}
4.3、创建我们的主方法类
package spark_stream
import java.util.Properties
import Utils.PropertiesUtil
import com.alibaba.fastjson.{JSON, JSONObject}
import org.apache.commons.lang3.StringUtils
import org.apache.kafka.common.serialization.{StringDeserializer, StringSerializer}
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.{Milliseconds, StreamingContext}
import org.apache.spark.streaming.kafka010.ConsumerStrategies._
import org.apache.spark.streaming.kafka010.LocationStrategies._
object SparkKafkaDemo extends App {
// default a Logger Object
val LOG = org.slf4j.LoggerFactory.getLogger(SparkKafkaDemo.getClass)
/*if (args.length < 2) {
System.err.println(s"""
|Usage: DirectKafkaWordCount <brokers> <topics>
| <brokers> is a list of one or more Kafka brokers
| <topics> is a list of one or more kafka topics to consume from
|
""".stripMargin)
System.exit(1)
}*/
// 设置日志级别
Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Logger.getLogger("org.apache.spark.sql").setLevel(Level.WARN)
val Array(brokers, topics, outTopic) = /*args*/ Array(
PropertiesUtil.getPropString("kafka.bootstrap.servers"),
PropertiesUtil.getPropString("kafka.topic.source"),
PropertiesUtil.getPropString("kafka.topic.sink")
)
// Create context
/* 第一种方式 */
val sparkConf = new SparkConf().setMaster("local[2]")
.setAppName("spark-kafka-demo1")
val ssc = new StreamingContext(sparkConf, Milliseconds(1000))
/* 第二种方式 */
/*val spark = SparkSession.builder()
.appName("spark-kafka-demo1")
.master("local[2]")
.getOrCreate()
// 引入隐式转换方法,允许ScalaObject隐式转换为DataFrame
import spark.implicits._
val ssc = new StreamingContext(spark.sparkContext,Seconds(1))*/
// 设置检查点
ssc.checkpoint("spark_demo_cp1")
// Create direct Kafka Stream with Brokers and Topics
// 注意:这个Topic最好是Array形式的,set形式的匹配不上
//var topicSet = topics.split(",")/*.toSet*/
val topicsArr: Array[String] = topics.split(",")
// set Kafka Properties
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> brokers,
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> PropertiesUtil.getPropString("kafka.group.id"),
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
/**
* createStream是Spark和Kafka集成包0.8版本中的方法,它是将offset交给ZK来维护的
*
* 在0.10的集成包中使用的是createDirectStream,它是自己来维护offset,在这个版本中
* zkCli中是看不到每个分区,到底消费到了那个偏移量,而在老的版本中,是可以看到的
* 速度上要比交给ZK维护要快很多,但是无法进行offset的监控。
* 这个方法只有3个参数,使用起来最为方便,但是每次启动的时候默认从Latest offset开始读取,
* 或者设置参数auto.offset.reset="smallest"后将会从Earliest offset开始读取。
*
* 官方文档@see <a href="http://spark.apache.org/docs/2.1.2/streaming-kafka-0-10-integration.html">Spark Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher)</a>
*
*/
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topicsArr, kafkaParams)
)
/** Kafak sink */
//set producer config
val kafkaProducer: Broadcast[KafkaSink[String, String]] = {
val kafkaProducerConfig = {
val p = new Properties()
p.setProperty("bootstrap.servers", brokers)
p.setProperty("key.serializer", classOf[StringSerializer].getName)
p.setProperty("value.serializer", classOf[StringSerializer].getName)
p
}
LOG.info("kafka producer init done!")
// 广播KafkaSink 传入kafkaProducerConfig,在kafkaSink中实例化producer
ssc.sparkContext.broadcast(KafkaSink[String, String](kafkaProducerConfig))
}
var jsonObject = new JSONObject()
//对传入的流中的数据,进行筛选和逻辑处理
stream.filter(record => {
// 过滤掉不符合要求的数据
try {
jsonObject = JSON.parseObject(record.value)
} catch {
case e: Exception => {
LOG.error("转换为JSON时发生了异常!\t{}", e.getMessage)
}
}
// 如果不为空字符时,为null,返回false过滤,否则为true通过
StringUtils.isNotEmpty(record.value) && null != jsonObject
}).map(record => {
//这个地方可以写自己的业务逻辑代码,因为本次是测试,简单返回一个元组
jsonObject = JSON.parseObject(record.value)
// 返出一个元组,(时间戳,json的数据日期,json的关系人姓名)
(System.currentTimeMillis(),
jsonObject.getString("date_dt"),
jsonObject.getString("relater_name")
)
}).foreachRDD(rdd => {
if (!rdd.isEmpty()) {
rdd.foreach(kafkaTuple => {
//向Kafka发送数据,outTopic,value,也就是我们kafkasink的第二种send方法
//取出广播的value 调用send方法 对每个数据进行发送 和 打印
kafkaProducer.value.send(
outTopic,
kafkaTuple._1 + "\t" + kafkaTuple._2 + "\t" + kafkaTuple._3
)
//同时将信息打印到控制台,以便查看
println(kafkaTuple._1 + "\t" + kafkaTuple._2 + "\t" + kafkaTuple._3)
})
}
})
// 启动streamContext
ssc.start()
//一直等待数据 直到关闭
ssc.awaitTermination()
}
5、我们在我们kafka集群上,创建对应的一个生产者,以及消费者
创建两个对应的topic
bin/kafka-topics.sh --create --zookeeper make.spark.com:2181/kafka_10 --topic spark-kafka-demo --partitions 3 --replication-factor 2
bin/kafka-topics.sh --create --zookeeper make.spark.com:2181/kafka_10 --partitions 3 --replication-factor 1 --topic spark-sink-test
创建一个对我们的程序发送数据的生产者
bin/kafka-console-producer.sh --broker-list make.spark.com:9092,make.spark.com:9093,make.spark.com:9094 --topic spark-kafka-demo
创建一个消费我们的程发送数据的消费者
bin/kafka-console-consumer.sh --bootstrap-server make.spark.com:9092,make.spark.com:9093,make.spark.com:9094 --from-beginning --topic spark-sink-test
6、启动生产者,启动我们的程序,并在生产者窗口,写入我们的测试数据 如下
{"date_dt": "201808081823","relater_name": "make"}
{"date_dt": "201808081823","relater_name": "make"}
{"date_dt": "201808081823","relater_name": "make"}
{"date_dt": "201808081823","relater_name": "make"}
{"date_dt": "201808081823","relater_name": "make"}
{"date_dt": "201808081823","relater_name": "make"}
{"date_dt": "201808081823","relater_name": "make"}
{"date_dt": "201808081823","relater_name": "make"}
{"date_dt": "201808081823","relater_name": "make"}
{"date_dt": "201808081823","relater_name": "make"}
{"date_dt": "201808081823","relater_name": "make"}
切换到我们的idea 可以看到我们的打印信息已经输出了,name我么的数据也发送出去了
切换到我们的消费者窗口,也可以看到数据已经过来了
到这里,就实现一个接受,发送的一个kafka-stream-kafka这样的一个流程,也学到不少东西,以上
参考文章:参考文章点这里!!谢谢博主