最近有需求,要将spark的数据写入elasticsearch。亲自测试后,将结果进行分享:
直接上代码:
创建Maven工程pom文件:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>estest</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<spark.version>2.4.4</spark.version>
<scala.version>2.11.12</scala.version>
<hadoop.version>2.6.0</hadoop.version>
<elasticsearch.version>7.12.0</elasticsearch.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-spark-20_2.11</artifactId>
<version>${elasticsearch.version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.json4s/json4s-native -->
<dependency>
<groupId>org.json4s</groupId>
<artifactId>json4s-native_2.12</artifactId>
<version>3.6.11</version>
<scope>provided</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<configuration>
<recompileMode>incremental</recompileMode>
</configuration>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.2.1</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
主要提供了两种读写方式:一种是通过DataFrameReader/Writer传入ES Source实现;另一种是直接读写DataFrame实现。在实现前,还需要知道一些相关的配置:
参数 | 描述 |
---|---|
es.nodes.wan.only | true or false,在此模式下,连接器禁用发现,并且所有操作通过声明的es.nodes连接 |
es.nodes | ES节点 |
es.port | ES端口 |
es.index.auto.create | true or false,是否自动创建index |
es.resource | 资源路径 |
es.mapping.id | es会为每个文档分配一个全局id。如果不指定此参数将随机生成;如果指定的话按指定的来 |
es.batch.size.bytes | es批量API的批量写入的大小(以字节为单位) |
es.batch.write.refresh | 批量更新完成后是否调用索引刷新 |
es.read.field.as.array.include | 读es的时候,指定将哪些字段作为数组类型 |
注意:运行程序时需要ES的jar包,可以将elasticsearch-spark-20_2.11-7.12.0.jar打进包里,也可以将es的包上传到本地服务器,在shell中指定jar包(--jars /home/pro/muzili/applications/sbin/elasticsearch-spark-20_2.11-7.12.0.jar)或将jar包上传到spark安装目录的jars下。要不然会报类找不到异常。es.index.auto.create设置为true时写数据时没有库则会自动创建。
1.DataFrameReader读ES
法一:
package com.muzili.applications
import org.apache.spark.SparkConf
import org.apache.spark.sql.{DataFrame, SparkSession}
object Spark_Read_Es2 {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Spark_Read_Es2").setMaster("local[*]")
conf.set("es.index.auto.create","true")
conf.set("es.nodes","190.191.200.141,190.191.200.142,190.191.200.143")
conf.set("es.port","9200")
conf.set("es.nodes.wan.only","true")
conf.set("es.read.field.as.array.include","array名字")//
val spark: SparkSession = SparkSession.builder().config(conf).getOrCreate()
val ess: DataFrame = spark.sqlContext.read.format("org.elasticsearch.spark.sql")
.option("inferSchema", "true")
.load("es_test/_doc")
ess.show(false)
}
}
法二:
package com.muzili.applications
import org.apache.spark.sql.{DataFrame, SparkSession}
object Spark_Read_Es1 {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder().appName("Spark_Read_Es1").getOrCreate()
val options = Map(
"es.nodes.wan.only" -> "true",
"es.nodes" -> "190.191.200.141,190.191.200.142,190.191.200.143",
"es.port" -> "9200",
"es.read.field.as.array.include" -> "arr1, arr2"
)
val df: DataFrame = spark
.read
.format("org.elasticsearch.spark.sql")
.options(options)
.load("es_test/_doc")
df.show()
}
}
2.DataFrameWriter写ES
package com.muzili.applications
import org.apache.spark.sql.{SaveMode, SparkSession}
object Spark_To_Es1 {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder().appName("Spark_To_Es1").getOrCreate()
val options = Map(
"es.index.auto.create" -> "true",
"es.nodes.wan.only" -> "true",
"es.nodes" -> "190.191.200.141,190.191.200.142,190.191.200.143",
"es.port" -> "9200",
"es.mapping.id" -> "id"
)
val sourceDF = spark.read.parquet("/user/pro/tmp/20200521")
sourceDF
.write
.format("org.elasticsearch.spark.sql")
.options(options)
.mode(SaveMode.Append)
.save("es_test/_doc")
}
}
3.读DataFrame
jar包中提供了esDF()方法可以直接读es数据为DataFrame,以下是源码。
class SparkSessionFunctions(ss: SparkSession) extends Serializable {
def esDF() = EsSparkSQL.esDF(ss)
def esDF(resource: String) = EsSparkSQL.esDF(ss, resource)
def esDF(resource: String, query: String) = EsSparkSQL.esDF(ss, resource, query)
def esDF(cfg: scala.collection.Map[String, String]) = EsSparkSQL.esDF(ss, cfg)
def esDF(resource: String, cfg: scala.collection.Map[String, String]) = EsSparkSQL.esDF(ss, resource, cfg)
def esDF(resource: String, query: String, cfg: scala.collection.Map[String, String]) = EsSparkSQL.esDF(ss, resource, query, cfg)
}
简单说一下各个参数:
resource:资源路径,例如hive_table/docs
cfg:一些es的配置,和上面代码中的options差不多
query:指定DSL查询语句来过滤要读的数据,例如"?q=user_group_id:3"表示读user_group_id为3的数据
val options = Map(
"pushdown" -> "true",
"es.nodes.wan.only" -> "true",
"es.nodes" -> "190.191.200.141,190.191.200.142,190.191.200.143",
"es.port" -> "9200"
)
val df = spark.esDF("es_test/docs", "?q=user_group_id:3", options)
df.show()
4.写DataFrame
jar包中提供了saveToEs()方法可以将DataFrame写入ES,以下是源码。
// the sparkDatasetFunctions already takes care of this
// but older clients might still import it hence why it's still here
implicit def sparkDataFrameFunctions(df: DataFrame) = new SparkDataFrameFunctions(df)
class SparkDataFrameFunctions(df: DataFrame) extends Serializable {
def saveToEs(resource: String): Unit = { EsSparkSQL.saveToEs(df, resource) }
def saveToEs(resource: String, cfg: scala.collection.Map[String, String]): Unit = { EsSparkSQL.saveToEs(df, resource, cfg) }
def saveToEs(cfg: scala.collection.Map[String, String]): Unit = { EsSparkSQL.saveToEs(df, cfg) }
}
resource:资源路径,例如hive_table/docs
cfg:一些es的配置,和上面代码中的options差不多
package com.muzili.applications
import org.apache.spark.sql.SparkSession
import org.elasticsearch.spark.sql._
object Spark_To_Es2 {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder().appName("Spark_To_Es2").getOrCreate()
val options = Map(
"es.index.auto.create" -> "true",
"es.nodes.wan.only" -> "true",
"es.nodes" -> "190.191.200.141,190.191.200.142,190.191.200.143",
"es.port" -> "9200",
"es.mapping.id" -> "id"
)
val df = spark.read.parquet("/user/pro/tmp/20200521")
df.saveToEs("es_test/docs", options)
}
}
5.Structured Streaming - ES
es也提供了对Structured Streaming的集成,使用Structured Streaming可以实时的写入ES。
import org.elasticsearch.spark.sql._
val options = Map(
"es.index.auto.create" -> "true",
"es.nodes.wan.only" -> "true",
"es.nodes" -> "190.191.200.141,190.191.200.142,190.191.200.143",
"es.port" -> "9200",
"es.mapping.id" -> "id"
)
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "a:9092,b:9092,c:9092")
.option("subscribe", "test")
.option("failOnDataLoss", "false")
.load()
df
.writeStream
.outputMode(OutputMode.Append())
.format("es")
.option("checkpointLocation", s"hdfs://hadoop:8020/checkpoint/test01")
.options(options)
.start("test_streaming/docs")
.awaitTermination()
6.实战
1.将hdfs上的数据写入Es:
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.elasticsearch.spark.sql._
object Spark_To_Es {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder().appName("Spark_To_Es") //.master("local[*]")
.config("spark.es.nodes", "190.191.200.141,190.191.200.142,190.191.200.143")
.config("spark.es.port", "9200")
.config("spark.es.mapping.id","id")
.config("es.batch.size.bytes","0.5mb")
.config("es.batch.size.entries","500")
.config("es.batch.write.retry.count","5")
.config("es.write.operation","upsert") // update/upsert/default
.getOrCreate()
val EsReadPath1 = "/user/pro/muzili/picture_code_in_one/full_update/01/result/face_archive.json/*"
val sourceDF1: DataFrame = spark.read.json(EsReadPath1)
sourceDF1.printSchema()
sourceDF1.repartition(1).saveToEs("figure_code/_doc")
println("------------数据写入成功----------------")
}
}
2.将数据库mysql中的数据写入Es:
import java.util.Properties
import org.apache.spark.SparkConf
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.elasticsearch.spark.sql.EsSparkSQL
/**
* author:muzili
* name:ESDemo
*/
object ESDemo {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName(ESDemo.getClass.getName).setMaster("local")
sparkConf.set("es.nodes","190.191.200.141,190.191.200.142,190.191.200.143")
sparkConf.set("es.port","9200")
sparkConf.set("es.index.auto.create", "true")
sparkConf.set("es.write.operation", "index")
val sparkSession: SparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
val url: String = "jdbc:mysql://localhost:3306/testdb"
val table: String = "courses"
val properties: Properties = new Properties()
properties.put("user","root")
properties.put("password","123456")
properties.put("driver","com.mysql.jdbc.Driver")
val course: DataFrame = sparkSession.read.jdbc(url,table,properties)
course.show()
EsSparkSQL.saveToEs(course,"course")
sparkSession.stop()
}
}
3.将postgresql中的数据写入Es:
package com.muzili.applications
import org.apache.spark.SparkConf
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.elasticsearch.spark.sql.EsSparkSQL
/**
* author:muzili
* name:ESDemo
*/
object ESDemo {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName(ESDemo.getClass.getName)//.setMaster("local")
sparkConf.set("es.nodes","190.191.200.141,190.191.200.142,190.191.200.143")
sparkConf.set("es.port","9200")
sparkConf.set("es.index.auto.create", "true")
sparkConf.set("es.write.operation", "index")
val spark: SparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
val df: DataFrame = read_pgsql(spark, "device_report_20210413")
df.show()
EsSparkSQL.saveToEs(df,"test") //resource:资源路径,例如hive_table/docs
spark.stop()
}
def read_pgsql(spark:SparkSession,table_name:String): DataFrame = {
import java.util.Properties
val url = "jdbc:postgresql://190.176.35.140:5432/data_governance_db?user=root&password=123456"
val connectionProperties = new Properties()
connectionProperties.setProperty("Driver","org.postgresql.Driver")
val df = spark.read.jdbc(url, table_name, connectionProperties)
df
}
}
4.将json数据写入Es:
package com.muzili.applications
import org.apache.spark.sql.{SaveMode, SparkSession}
object Spark_To_Es_Test {
//com.muzili.applications.Spark_To_Es_Test
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("Spark_To_Es")
.getOrCreate()
val options = Map(
"es.index.auto.create" -> "true",
"es.nodes.wan.only" -> "true",
"es.nodes" -> "190.191.200.141,190.191.200.142,190.191.200.143",
"es.port" -> "9200",
"es.mapping.id" -> "aid"
)
val sourceDF = spark.read.format("json").load("/user/xdata/aid/picture_code_in_one/test/result.json/*")
sourceDF.show(10,false)
sourceDF
.write
.format("org.elasticsearch.spark.sql")
.options(options)
.mode(SaveMode.Append)
.save("es_test/_doc")
}
}
执行脚本:
#!/bin/bash
#部署在102的/home/pro/muzili上
BASE_HOME=/home/pro/muzili/DataFusion
LOGDIR=$BASE_HOME/logs/Spark_To_Es.out
spark-submit --master spark://190.176.35.102:7079 \
--conf spark.cores.max=71 \
--driver-memory 18g \
--jars /home/pro/muzili/DataFusion/sbin/elasticsearch-spark-20_2.11-7.12.0.jar \
--class com.muzili.applications.Spark_To_Es \
/home/pro/muzili/DataFusion/sbin/estest-1.0-SNAPSHOT.jar > $LOGDIR & \
echo "脚本执行成功!"
7.References
ES Spark Support文档:
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html#spark
ES Spark Configuration:
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration.html