Spark Streaming 框架介绍
Spark Streaming 是 Spark core API 的扩展,支持实时数据流的处理,并且具有可扩展,高吞吐量,容错的特点。 数据可以从许多来源获取,如 Kafka, Flume, Kinesis 或 TCP sockets,并且可以使用复杂的算法进行处理,这些算法使用诸如 map,reduce,join 和 window 等高级函数表示。 最后,处理后的数据可以推送到文件系统,数据库等。 实际上,您可以将Spark 的机器学习和图形处理算法应用于数据流。
框架集成
修改 pom 文件,增加依赖关系
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.starnet.es</groupId>
<artifactId>sparkstreaming-elasticsearch</artifactId>
<version>1.0</version>
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch</artifactId>
<version>7.8.0</version>
</dependency>
<!-- elasticsearch 的客户端 -->
<dependency>
<groupId>org.elasticsearch.client</groupId>
<artifactId>elasticsearch-rest-high-level-client</artifactId>
<version>7.8.0</version>
</dependency>
<!-- elasticsearch 依赖 2.x 的 log4j -->
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-api</artifactId>
<version>2.8.2</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>2.8.2</version>
</dependency>
</dependencies>
</project>
代码如下:
package com.starnet.es
import org.apache.http.HttpHost
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.ReceiverInputDStream
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.elasticsearch.action.index.{IndexRequest, IndexResponse}
import org.elasticsearch.client.{RequestOptions, RestClient, RestHighLevelClient}
import org.elasticsearch.common.xcontent.XContentType
object SparkStreamingEsTest {
def main(args: Array[String]): Unit = {
val sparkConf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("ESTest")
val ssc = new StreamingContext(sparkConf, Seconds(3))
val ds: ReceiverInputDStream[String] = ssc.socketTextStream("localhost", 4444)
ds.foreachRDD(
rdd => {
rdd.foreach(
data => {
val client = new RestHighLevelClient(RestClient.builder(new HttpHost("hadoop113", 9200)))
val str = data.split(" ")
val request = new IndexRequest()
request.index("product").id(str(0))
val json =
s"""
| { "data" : "${str(1)}" }
|""".stripMargin
request.source(json, XContentType.JSON)
val response: IndexResponse = client.index(request, RequestOptions.DEFAULT)
println(response.getResult)
client.close()
}
)
}
)
ssc.start()
ssc.awaitTermination()
}
}
进行测试,先使用netcat开启本地4444端口
nc -lk 4444
然后启动服务,在nc处输入以下内容
1234 test
去ES查看
GET http://10.10.10.113:9200/product/_doc/1234
结果如下,说明对接成功了
{
"_index": "product",
"_type": "_doc",
"_id": "1234",
"_version": 1,
"_seq_no": 7,
"_primary_term": 1,
"found": true,
"_source": {
"data": "test"
}
}
以上只是SparkStreaming框架集成,但是与Spark的集成几乎是一样的。