spark写入elasticsearch限流

1. spark 批量写入es

正常情况下,我们的spark任务有写入es的需求的时候,我们都是使用ES_Hadoop。参考官方的这里,选择适合自己的版本,如果是hive,spark等都有用到的话可以直接配置

<dependency>
  <groupId>org.elasticsearch</groupId>
  <artifactId>elasticsearch-hadoop</artifactId>
  <version>7.1.1</version>
</dependency>

因为我们这里只是用到了spark,spark的版本是2.3 , scale 是2.11 ,elasticsearch是7.1.1所以只引入spark的包即可。


       <dependency>
            <groupId>org.elasticsearch</groupId>
            <artifactId>elasticsearch-spark-20_2.11</artifactId>
            <version>7.1.1</version>
        </dependency>

2. java-spark写入elasticsearch

java写入es的代码可以这样

@Data
public class UserProfileRecord {
    public String uid;
    public String want_val;
}
 SparkConf sparkConf = new SparkConf()
                    .setAppName(JOB_NAME)
                    .set(ConfigurationOptions.ES_NODES, esHost)
                    .set(ConfigurationOptions.ES_PORT, esPort)
                    .set(ConfigurationOptions.ES_NET_HTTP_AUTH_USER, esUser)
                    .set(ConfigurationOptions.ES_NET_HTTP_AUTH_PASS, esPass)
                    .set(ConfigurationOptions.ES_BATCH_SIZE_ENTRIES, "500")
                    .set(ConfigurationOptions.ES_MAPPING_ID, "uid");

SparkSession sparkSession = SparkSession.builder().config(sparkConf).getOrCreate();
            Dataset<Row> wantedCols = sparkSession.read().parquet(path);

Dataset<UserProfileRecord> searchUserProfile = wantedCols.mapPartitions(
                    new MapPartitionsFunction<Row, UserProfileRecord>() {
                        @Override
                        public Iterator<UserProfileRecord> call(Iterator<Row> input) throws Exception {
                            List<UserProfileRecord> cleanProfileList = new LinkedList<>();
                            while (input.hasNext()) {
                                UserProfileRecord aRecord = new UserProfileRecord();
				...
				...
				...

                                cleanProfileList.add(aRecord);
                            }
                            return cleanProfileList.iterator();
                        }
                    }
                    , Encoders.bean(UserProfileRecord.class));

EsSparkSQL.saveToEs(searchUserProfile.repartition(3), this.writeIndex);


  这里因为es当前只有3个节点,所以用了一个repartition来将写入es的task数变成3个,减小对es的压力,在实际的使用过程中主片的写入速度能够达到平均3w/s,但是当任务产出的数据量比较大的时候写入的时间会比较长,还是会对当前的es集群产生比较大的影响,导致部分查询超时。
  查找了很多官方的文档,发现能够调整的很有限,一般都是调整partition的数量和ConfigurationOptions.ES_BATCH_SIZE_ENTRIES 来throttle写入es的速度。我这边各种试探,收效甚微。
  本来想用elasticsearch的java-client直接做rest请求的(这样就可以控制速速了),但是翻了一下es_hadoop的源码,发现她用的是tranport-client(是es内部通信使用的基于tcp的协议封装)那肯定比http类型的rest更高效啊,而且还有很多partition和es索引的replica的映射关系,想着应该是做了很多优化。所以还是用es_hadoop来做吧,没有办法了,只能看看改改源码了。

3. es_hadoop的源码拓展

增加了两个scala文件(强上scala😂)
MyEsSparkSQL
MyEsDataFrameWriter

注意包名一定要是org.elasticsearch.spark.sql

1. MyEsSparkSQL

package org.elasticsearch.spark.sql


import org.apache.commons.logging.LogFactory
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SparkSession
import org.elasticsearch.hadoop.EsHadoopIllegalArgumentException
import org.elasticsearch.hadoop.cfg.ConfigurationOptions.ES_QUERY
import org.elasticsearch.hadoop.cfg.ConfigurationOptions.ES_RESOURCE_READ
import org.elasticsearch.hadoop.cfg.ConfigurationOptions.ES_RESOURCE_WRITE
import org.elasticsearch.hadoop.cfg.PropertiesSettings
import org.elasticsearch.hadoop.rest.InitializationUtils
import org.elasticsearch.hadoop.util.ObjectUtils
import org.elasticsearch.spark.cfg.SparkSettingsManager

import scala.collection.JavaConverters.mapAsJavaMapConverter
import scala.collection.JavaConverters.propertiesAsScalaMapConverter
import scala.collection.Map

object MyEsSparkSQL {

  private val init = { ObjectUtils.loadClass("org.elasticsearch.spark.rdd.CompatUtils", classOf[ObjectUtils].getClassLoader) }

  @transient private[this] val LOG = LogFactory.getLog(EsSparkSQL.getClass)

  //
  // Read
  //

  def esDF(sc: SQLContext): DataFrame = esDF(sc, Map.empty[String, String])
  def esDF(sc: SQLContext, resource: String): DataFrame = esDF(sc, Map(ES_RESOURCE_READ -> resource))
  def esDF(sc: SQLContext, resource: String, query: String): DataFrame = esDF(sc, Map(ES_RESOURCE_READ -> resource, ES_QUERY -> query))
  def esDF(sc: SQLContext, cfg: Map[String, String]): DataFrame = {
    val esConf = new SparkSettingsManager().load(sc.sparkContext.getConf).copy()
    esConf.merge(cfg.asJava)

    sc.read.format("org.elasticsearch.spark.sql").options(esConf.asProperties.asScala.toMap).load
  }

  def esDF(sc: SQLContext, resource: String, query: String, cfg: Map[String, String]): DataFrame = {
    esDF(sc, collection.mutable.Map(cfg.toSeq: _*) += (ES_RESOURCE_READ -> resource, ES_QUERY -> query))
  }

  def esDF(sc: SQLContext, resource: String, cfg: Map[String, String]): DataFrame = {
    esDF(sc, collection.mutable.Map(cfg.toSeq: _*) += (ES_RESOURCE_READ -> resource))
  }

  // SparkSession variant
  def esDF(ss: SparkSession): DataFrame = esDF(ss.sqlContext, Map.empty[String, String])
  def esDF(ss: SparkSession, resource: String): DataFrame = esDF(ss.sqlContext, Map(ES_RESOURCE_READ -> resource))
  def esDF(ss: SparkSession, resource: String, query: String): DataFrame = esDF(ss.sqlContext, Map(ES_RESOURCE_READ -> resource, ES_QUERY -> query))
  def esDF(ss: SparkSession, cfg: Map[String, String]): DataFrame = esDF(ss.sqlContext, cfg)
  def esDF(ss: SparkSession, resource: String, query: String, cfg: Map[String, String]): DataFrame = esDF(ss.sqlContext, resource, query, cfg)
  def esDF(ss: SparkSession, resource: String, cfg: Map[String, String]): DataFrame = esDF(ss.sqlContext, resource, cfg)

  //
  // Write
  //

  def saveToEs(srdd: Dataset[_], resource: String): Unit = {
    saveToEs(srdd, Map(ES_RESOURCE_WRITE -> resource))
  }
  def saveToEs(srdd: Dataset[_], resource: String, cfg: Map[String, String]): Unit = {
    saveToEs(srdd, collection.mutable.Map(cfg.toSeq: _*) += (ES_RESOURCE_WRITE -> resource))
  }
  def saveToEs(srdd: Dataset[_], cfg: Map[String, String]): Unit = {
    if (srdd != null) {
      if (srdd.isStreaming) {
        throw new EsHadoopIllegalArgumentException("Streaming Datasets should not be saved with 'saveToEs()'. Instead, use " +
          "the 'writeStream().format(\"es\").save()' methods.")
      }
      val sparkCtx = srdd.sqlContext.sparkContext
      val sparkCfg = new SparkSettingsManager().load(sparkCtx.getConf)
      val esCfg = new PropertiesSettings().load(sparkCfg.save())
      esCfg.merge(cfg.asJava)

      // Need to discover ES Version before checking index existence
      InitializationUtils.discoverClusterInfo(esCfg, LOG)
      InitializationUtils.checkIdForOperation(esCfg)
      InitializationUtils.checkIndexExistence(esCfg)

      sparkCtx.runJob(srdd.toDF().rdd, new MyEsDataFrameWriter(srdd.schema, esCfg.save()).write _)
    }
  }
}



这个类就是直接盗版了EsSparkSQL,只是重写了def saveToEs(srdd: Dataset[_], cfg: Map[String, String]): Unit方法中的最后一句

      sparkCtx.runJob(srdd.toDF().rdd, new EsDataFrameWriter(srdd.schema, esCfg.save()).write _)

变成了

      sparkCtx.runJob(srdd.toDF().rdd, new MyEsDataFrameWriter(srdd.schema, esCfg.save()).write _)

2. MyEsDataFrameWriter

package org.elasticsearch.spark.sql

import java.util.concurrent.atomic.AtomicInteger

import lombok.extern.slf4j.Slf4j
import org.apache.spark.TaskContext
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.StructType
import org.elasticsearch.hadoop.rest.RestService
import org.elasticsearch.hadoop.serialization.{BytesConverter, JdkBytesConverter}
import org.elasticsearch.hadoop.serialization.builder.ValueWriter
import org.elasticsearch.hadoop.serialization.field.FieldExtractor
import org.elasticsearch.spark.rdd.EsRDDWriter

/**
  * Created by chencc on 2020/8/31.
  */
@Slf4j
class MyEsDataFrameWriter (schema: StructType, override val serializedSettings: String)
  extends EsRDDWriter[Row](serializedSettings:String) {

  override protected def valueWriter: Class[_ <: ValueWriter[_]] = classOf[DataFrameValueWriter]
  override protected def bytesConverter: Class[_ <: BytesConverter] = classOf[JdkBytesConverter]
  override protected def fieldExtractor: Class[_ <: FieldExtractor] = classOf[DataFrameFieldExtractor]

  override protected def processData(data: Iterator[Row]): Any = { (data.next, schema) }

  override def write(taskContext: TaskContext, data: Iterator[Row]): Unit = {
    val writer = RestService.createWriter(settings, taskContext.partitionId.toLong, -1, log)

    taskContext.addTaskCompletionListener((TaskContext) => writer.close())

    if (runtimeMetadata) {
      writer.repository.addRuntimeFieldExtractor(metaExtractor)
    }

    val counter= new AtomicInteger(0);
    while (data.hasNext) {
      counter.incrementAndGet();
      writer.repository.writeToIndex(processData(data))
      if(counter.get()>=500){
        Thread.sleep(100);
        counter.set(0)
        log.info("batch is 2000 will sleep 50 milliseconds ")
//        log.info("no sleep..")
      }
    }
  }
}

这个MyEsDataFrameWriter 重写了EsRDDWriter的write方法,增加了一些sleep,实际上可以根据线上的实际情况来调整这里。
这里的500和100可以做成在SparkConf中配置的,灵活性就更高了。

通过这样的配置可以完美的throttle spark 写入es的速度。

  • 0
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值