文章目录
1. spark 批量写入es
正常情况下,我们的spark任务有写入es的需求的时候,我们都是使用ES_Hadoop。参考官方的这里,选择适合自己的版本,如果是hive,spark等都有用到的话可以直接配置
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-hadoop</artifactId>
<version>7.1.1</version>
</dependency>
因为我们这里只是用到了spark,spark的版本是2.3 , scale 是2.11 ,elasticsearch是7.1.1所以只引入spark的包即可。
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-spark-20_2.11</artifactId>
<version>7.1.1</version>
</dependency>
2. java-spark写入elasticsearch
java写入es的代码可以这样
@Data
public class UserProfileRecord {
public String uid;
public String want_val;
}
SparkConf sparkConf = new SparkConf()
.setAppName(JOB_NAME)
.set(ConfigurationOptions.ES_NODES, esHost)
.set(ConfigurationOptions.ES_PORT, esPort)
.set(ConfigurationOptions.ES_NET_HTTP_AUTH_USER, esUser)
.set(ConfigurationOptions.ES_NET_HTTP_AUTH_PASS, esPass)
.set(ConfigurationOptions.ES_BATCH_SIZE_ENTRIES, "500")
.set(ConfigurationOptions.ES_MAPPING_ID, "uid");
SparkSession sparkSession = SparkSession.builder().config(sparkConf).getOrCreate();
Dataset<Row> wantedCols = sparkSession.read().parquet(path);
Dataset<UserProfileRecord> searchUserProfile = wantedCols.mapPartitions(
new MapPartitionsFunction<Row, UserProfileRecord>() {
@Override
public Iterator<UserProfileRecord> call(Iterator<Row> input) throws Exception {
List<UserProfileRecord> cleanProfileList = new LinkedList<>();
while (input.hasNext()) {
UserProfileRecord aRecord = new UserProfileRecord();
...
...
...
cleanProfileList.add(aRecord);
}
return cleanProfileList.iterator();
}
}
, Encoders.bean(UserProfileRecord.class));
EsSparkSQL.saveToEs(searchUserProfile.repartition(3), this.writeIndex);
这里因为es当前只有3个节点,所以用了一个repartition来将写入es的task数变成3个,减小对es的压力,在实际的使用过程中主片的写入速度能够达到平均3w/s,但是当任务产出的数据量比较大的时候写入的时间会比较长,还是会对当前的es集群产生比较大的影响,导致部分查询超时。
查找了很多官方的文档,发现能够调整的很有限,一般都是调整partition的数量和ConfigurationOptions.ES_BATCH_SIZE_ENTRIES 来throttle写入es的速度。我这边各种试探,收效甚微。
本来想用elasticsearch的java-client直接做rest请求的(这样就可以控制速速了),但是翻了一下es_hadoop的源码,发现她用的是tranport-client(是es内部通信使用的基于tcp的协议封装)那肯定比http类型的rest更高效啊,而且还有很多partition和es索引的replica的映射关系,想着应该是做了很多优化。所以还是用es_hadoop来做吧,没有办法了,只能看看改改源码了。
3. es_hadoop的源码拓展
增加了两个scala文件(强上scala😂)
MyEsSparkSQL
MyEsDataFrameWriter
注意包名一定要是org.elasticsearch.spark.sql
1. MyEsSparkSQL
package org.elasticsearch.spark.sql
import org.apache.commons.logging.LogFactory
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SparkSession
import org.elasticsearch.hadoop.EsHadoopIllegalArgumentException
import org.elasticsearch.hadoop.cfg.ConfigurationOptions.ES_QUERY
import org.elasticsearch.hadoop.cfg.ConfigurationOptions.ES_RESOURCE_READ
import org.elasticsearch.hadoop.cfg.ConfigurationOptions.ES_RESOURCE_WRITE
import org.elasticsearch.hadoop.cfg.PropertiesSettings
import org.elasticsearch.hadoop.rest.InitializationUtils
import org.elasticsearch.hadoop.util.ObjectUtils
import org.elasticsearch.spark.cfg.SparkSettingsManager
import scala.collection.JavaConverters.mapAsJavaMapConverter
import scala.collection.JavaConverters.propertiesAsScalaMapConverter
import scala.collection.Map
object MyEsSparkSQL {
private val init = { ObjectUtils.loadClass("org.elasticsearch.spark.rdd.CompatUtils", classOf[ObjectUtils].getClassLoader) }
@transient private[this] val LOG = LogFactory.getLog(EsSparkSQL.getClass)
//
// Read
//
def esDF(sc: SQLContext): DataFrame = esDF(sc, Map.empty[String, String])
def esDF(sc: SQLContext, resource: String): DataFrame = esDF(sc, Map(ES_RESOURCE_READ -> resource))
def esDF(sc: SQLContext, resource: String, query: String): DataFrame = esDF(sc, Map(ES_RESOURCE_READ -> resource, ES_QUERY -> query))
def esDF(sc: SQLContext, cfg: Map[String, String]): DataFrame = {
val esConf = new SparkSettingsManager().load(sc.sparkContext.getConf).copy()
esConf.merge(cfg.asJava)
sc.read.format("org.elasticsearch.spark.sql").options(esConf.asProperties.asScala.toMap).load
}
def esDF(sc: SQLContext, resource: String, query: String, cfg: Map[String, String]): DataFrame = {
esDF(sc, collection.mutable.Map(cfg.toSeq: _*) += (ES_RESOURCE_READ -> resource, ES_QUERY -> query))
}
def esDF(sc: SQLContext, resource: String, cfg: Map[String, String]): DataFrame = {
esDF(sc, collection.mutable.Map(cfg.toSeq: _*) += (ES_RESOURCE_READ -> resource))
}
// SparkSession variant
def esDF(ss: SparkSession): DataFrame = esDF(ss.sqlContext, Map.empty[String, String])
def esDF(ss: SparkSession, resource: String): DataFrame = esDF(ss.sqlContext, Map(ES_RESOURCE_READ -> resource))
def esDF(ss: SparkSession, resource: String, query: String): DataFrame = esDF(ss.sqlContext, Map(ES_RESOURCE_READ -> resource, ES_QUERY -> query))
def esDF(ss: SparkSession, cfg: Map[String, String]): DataFrame = esDF(ss.sqlContext, cfg)
def esDF(ss: SparkSession, resource: String, query: String, cfg: Map[String, String]): DataFrame = esDF(ss.sqlContext, resource, query, cfg)
def esDF(ss: SparkSession, resource: String, cfg: Map[String, String]): DataFrame = esDF(ss.sqlContext, resource, cfg)
//
// Write
//
def saveToEs(srdd: Dataset[_], resource: String): Unit = {
saveToEs(srdd, Map(ES_RESOURCE_WRITE -> resource))
}
def saveToEs(srdd: Dataset[_], resource: String, cfg: Map[String, String]): Unit = {
saveToEs(srdd, collection.mutable.Map(cfg.toSeq: _*) += (ES_RESOURCE_WRITE -> resource))
}
def saveToEs(srdd: Dataset[_], cfg: Map[String, String]): Unit = {
if (srdd != null) {
if (srdd.isStreaming) {
throw new EsHadoopIllegalArgumentException("Streaming Datasets should not be saved with 'saveToEs()'. Instead, use " +
"the 'writeStream().format(\"es\").save()' methods.")
}
val sparkCtx = srdd.sqlContext.sparkContext
val sparkCfg = new SparkSettingsManager().load(sparkCtx.getConf)
val esCfg = new PropertiesSettings().load(sparkCfg.save())
esCfg.merge(cfg.asJava)
// Need to discover ES Version before checking index existence
InitializationUtils.discoverClusterInfo(esCfg, LOG)
InitializationUtils.checkIdForOperation(esCfg)
InitializationUtils.checkIndexExistence(esCfg)
sparkCtx.runJob(srdd.toDF().rdd, new MyEsDataFrameWriter(srdd.schema, esCfg.save()).write _)
}
}
}
这个类就是直接盗版了EsSparkSQL,只是重写了def saveToEs(srdd: Dataset[_], cfg: Map[String, String]): Unit
方法中的最后一句
从
sparkCtx.runJob(srdd.toDF().rdd, new EsDataFrameWriter(srdd.schema, esCfg.save()).write _)
变成了
sparkCtx.runJob(srdd.toDF().rdd, new MyEsDataFrameWriter(srdd.schema, esCfg.save()).write _)
2. MyEsDataFrameWriter
package org.elasticsearch.spark.sql
import java.util.concurrent.atomic.AtomicInteger
import lombok.extern.slf4j.Slf4j
import org.apache.spark.TaskContext
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.StructType
import org.elasticsearch.hadoop.rest.RestService
import org.elasticsearch.hadoop.serialization.{BytesConverter, JdkBytesConverter}
import org.elasticsearch.hadoop.serialization.builder.ValueWriter
import org.elasticsearch.hadoop.serialization.field.FieldExtractor
import org.elasticsearch.spark.rdd.EsRDDWriter
/**
* Created by chencc on 2020/8/31.
*/
@Slf4j
class MyEsDataFrameWriter (schema: StructType, override val serializedSettings: String)
extends EsRDDWriter[Row](serializedSettings:String) {
override protected def valueWriter: Class[_ <: ValueWriter[_]] = classOf[DataFrameValueWriter]
override protected def bytesConverter: Class[_ <: BytesConverter] = classOf[JdkBytesConverter]
override protected def fieldExtractor: Class[_ <: FieldExtractor] = classOf[DataFrameFieldExtractor]
override protected def processData(data: Iterator[Row]): Any = { (data.next, schema) }
override def write(taskContext: TaskContext, data: Iterator[Row]): Unit = {
val writer = RestService.createWriter(settings, taskContext.partitionId.toLong, -1, log)
taskContext.addTaskCompletionListener((TaskContext) => writer.close())
if (runtimeMetadata) {
writer.repository.addRuntimeFieldExtractor(metaExtractor)
}
val counter= new AtomicInteger(0);
while (data.hasNext) {
counter.incrementAndGet();
writer.repository.writeToIndex(processData(data))
if(counter.get()>=500){
Thread.sleep(100);
counter.set(0)
log.info("batch is 2000 will sleep 50 milliseconds ")
// log.info("no sleep..")
}
}
}
}
这个MyEsDataFrameWriter 重写了EsRDDWriter的write方法,增加了一些sleep,实际上可以根据线上的实际情况来调整这里。
这里的500和100可以做成在SparkConf中配置的,灵活性就更高了。
通过这样的配置可以完美的throttle spark 写入es的速度。