Elasticsearch-spark 源码解析 ---savetoEs

最新推荐文章于 2024-01-23 02:26:42 发布

knowfarhhy

最新推荐文章于 2024-01-23 02:26:42 发布

阅读量6.1k

点赞数 3

分类专栏： spark

本文链接：https://blog.csdn.net/u011707542/article/details/86647410

版权

spark 专栏收录该内容

29 篇文章 1 订阅

订阅专栏

使用例子

object Save2EsLocalTest {

  def main(args: Array[String]): Unit = {

    val conf = new SparkConf().setAppName("save2eslocal").setMaster("local[*]")
    conf.set("spark.streaming.stopGracefullyOnShutdown","true")
    conf.set("es.index.auto.create", "false")
    conf.set("es.nodes", "127.0.0.1")
    conf.set("es.port", "9200")

    val sc = new SparkContext(conf)

    /*
    * es的参数
    *
    * es.resource.write  : index/type
    * es.write.operation :
    *                     index  加入新数据
    *                     upsert 数据不存在插入，数据存在更新
    *
    * es.mapping.id : 将document field 映射为 document id
    *
    * */
    val config = scala.collection.mutable.Map("es.resource.write" -> "test/students","es.mapping.id"->"sid","es.write.operation"->"upsert")

    //必须引入
    import org.elasticsearch.spark._

    val students = sc.makeRDD(Seq(Map("sid"->"7","sname"->"hhy","sage"->100)))

    students.saveToEs(config)

    sc.stop()

  }

}

源码分析

我们从主要的入口saveToEs开始分析，RDD是没有saveToEs这个方法，那么为什么我们可以在这里调用呢？
因为我们引入了import org.elasticsearch.spark._这个包，而这个里面有个package对象spark，里面存在相应的implicit,implict的使用，可以自行查阅资料学习。

主要代码如下：

  implicit def sparkRDDFunctions[T : ClassTag](rdd: RDD[T]) = new SparkRDDFunctions[T](rdd)

  class SparkRDDFunctions[T : ClassTag](rdd: RDD[T]) extends Serializable {
    def saveToEs(resource: String): Unit = { EsSpark.saveToEs(rdd, resource) }
    def saveToEs(resource: String, cfg: scala.collection.Map[String, String]): Unit = { EsSpark.saveToEs(rdd, resource, cfg) }
    def saveToEs(cfg: scala.collection.Map[String, String]): Unit = { EsSpark.saveToEs(rdd, cfg)    }
  }

上述代码中可以看到，其实隐饰转换主要用到了EsSpark类中的saveToEs方法。如果使用saveToEs(resource: String)方法，后面的代码会进一步创建一个cfg然后将resource封装进去，“es.resource.write”-> resource,resource的格式是:index/type，
saveToEs(resource: String, cfg: scala.collection.Map[String, String])或者saveToEs(cfg: scala.collection.Map[String, String]),这三种都会有相应的cfg，然后会跟SparkConf中的参数(里面会把spark.开头的key,去掉spark.只留下后面的作为key)合并，组成新的settings配置。

  def saveToEs(rdd: RDD[_], cfg: Map[String, String]) {
    doSaveToEs方法(rdd, cfg, false)
  }

EsSpark.saveToES主要调用了doSaveToEs方法


  private[spark] def doSaveToEs(rdd: RDD[_], cfg: Map[String, String], hasMeta: Boolean) {
    CompatUtils.warnSchemaRDD(rdd, LogFactory.getLog("org.elasticsearch.spark.rdd.EsSpark"))

    if (rdd == null || rdd.partitions.length == 0) {
      return
    }
    
    val sparkCfg = new SparkSettingsManager().load(rdd.sparkContext.getConf)
    val config = new PropertiesSettings().load(sparkCfg.save())
    config.merge(cfg.asJava)

    // Need to discover the EsVersion here before checking if the index exists
    InitializationUtils.discoverEsVersion(config, LOG)
    InitializationUtils.checkIdForOperation(config)
    InitializationUtils.checkIndexExistence(config)

    rdd.sparkContext.runJob(rdd, new EsRDDWriter(config.save(), hasMeta).write _)
  }

runJob是spark中的提交job的代码，可以自定查阅资料理解其实现，我们主要讲一下new EsRDDWriter(config.save(), hasMeta).write _。上述代码中会通过rest方式查看你的集群版本，以及检查是否是update操作，update操作需要设置es.mapping.id这个配置，值为你document的某个字段即可，另外如果es.index.auto.create设置为no,或者false等，会检查一下index是否存在，不存在抛出异常，如果设置为yes或者true，那么不存在会自动创建index,所以不会检查index是否存在。

下面进入new EsRDDWriter(config.save(), hasMeta).write的代码分析：

new EsRDDWriter(config.save(), hasMeta)时候，会对EsRDDWriter初始化

  protected def valueWriter: Class[_ <: ValueWriter[_]] = classOf[ScalaValueWriter]
  protected def bytesConverter: Class[_ <: BytesConverter] = classOf[JdkBytesConverter]
  protected def fieldExtractor: Class[_ <: FieldExtractor] = classOf[ScalaMapFieldExtractor]
  
  lazy val settings = {
  	//此处的serializedSettings就是config.save()，其实就是将参数配置转为String表示，然后再次转回来
    val settings = new PropertiesSettings().load(serializedSettings);

    //设置一些必要的初始化的参数，可自定义实现
    //es.ser.writer.value.class / es.ser.writer.bytes.class / es.mapping.default.extractor.class

    InitializationUtils.setValueWriterIfNotSet(settings, valueWriter, log)
    InitializationUtils.setBytesConverterIfNeeded(settings, bytesConverter, log)
    InitializationUtils.setFieldExtractorIfNotSet(settings, fieldExtractor, log)

    settings
  }

  lazy val metaExtractor = new ScalaMetadataExtractor()

Scala中使用关键字lazy来定义惰性变量，实现延迟加载(懒加载)，惰性变量只能是不可变变量，并且只有在调用惰性变量时，才会去实例化这个变量。

  def write(taskContext: TaskContext, data: Iterator[T]) {
    val writer = RestService.createWriter(settings, taskContext.partitionId, -1, log)

    taskContext.addTaskCompletionListener((TaskContext) => writer.close())

    if (runtimeMetadata) {
      writer.repository.addRuntimeFieldExtractor(metaExtractor)
    }

    while (data.hasNext) {
      writer.repository.writeToIndex(processData(data))
    }
  }

write方法具体分析

write方法创建PartitionWriter

   val writer = RestService.createWriter(settings, taskContext.partitionId, -1, log)

看createWriter具体实现：

    public static PartitionWriter createWriter(Settings settings, int currentSplit, int totalSplits, Log log) {
        Version.logVersion();

        InitializationUtils.validateSettings(settings);
        InitializationUtils.discoverEsVersion(settings, log);
        InitializationUtils.discoverNodesIfNeeded(settings, log);
        InitializationUtils.filterNonClientNodesIfNeeded(settings, log);
        InitializationUtils.filterNonDataNodesIfNeeded(settings, log);
        InitializationUtils.filterNonIngestNodesIfNeeded(settings, log);

        List<String> nodes = SettingsUtils.discoveredOrDeclaredNodes(settings);

  
        int selectedNode = (currentSplit < 0) ? new Random().nextInt(nodes.size()) : currentSplit % nodes.size();

        // select the appropriate nodes first, to spread the load before-hand
        SettingsUtils.pinNode(settings, nodes.get(selectedNode));

        Resource resource = new Resource(settings, false);

        log.info(String.format("Writing to [%s]", resource));

        // 分析是单索引还是多索引的情况，二者区别是：单索引可以固定node,多索引随机选择node
        IndexExtractor iformat = ObjectUtils.instantiate(settings.getMappingIndexExtractorClassName(), settings);
        iformat.compile(resource.toString());

        RestRepository repository = (iformat.hasPattern() ? initMultiIndices(settings, currentSplit, resource, log) : initSingleIndex(settings, currentSplit, resource, log));

        return new PartitionWriter(settings, currentSplit, totalSplits, repository);
    }

单索引情况下：
shard与node的存在对应关系，会使用partitionId对shard数量哈希取模，得到相应的node,然后这个partitionwriter就往这个shard写数据；
如果设置es.nodes.client.only=true,那么固定使用client节点；
如果设置了es.nodes.wan.only=true,那么方式同多索引的情况。

增加task任务监听

taskContext.addTaskCompletionListener((TaskContext) => writer.close())

具体写的实现

while (data.hasNext) {
      writer.repository.writeToIndex(processData(data))
    }

 public void writeToIndex(Object object) {
        Assert.notNull(object, "no object data given");

        lazyInitWriting();
        doWriteToIndex(command.write(object));
    }

首先看lazyInitWriting()

   private void lazyInitWriting() {
        if (!writeInitialized) {
            writeInitialized = true;

            autoFlush = !settings.getBatchFlushManual();
            ba.bytes(new byte[settings.getBatchSizeInBytes()], 0);
            trivialBytesRef = new BytesRef();
            bufferEntriesThreshold = settings.getBatchSizeInEntries();
            requiresRefreshAfterBulk = settings.getBatchRefreshAfterWrite();

            this.command = BulkCommands.create(settings, metaExtractor, client.internalVersion);
        }
    }

看重要的实现BulkCommands.create

 public static BulkCommand create(Settings settings, MetadataExtractor metaExtractor, EsMajorVersion version) {

        String operation = settings.getOperation();
        BulkFactory factory = null;

        if (ConfigurationOptions.ES_OPERATION_CREATE.equals(operation)) {
            factory = new CreateBulkFactory(settings, metaExtractor);
        }
        else if (ConfigurationOptions.ES_OPERATION_INDEX.equals(operation)) {
            factory = new IndexBulkFactory(settings, metaExtractor);
        }
        else if (ConfigurationOptions.ES_OPERATION_UPDATE.equals(operation)) {
            factory = new UpdateBulkFactory(settings, metaExtractor, version);
        }
        else if (ConfigurationOptions.ES_OPERATION_UPSERT.equals(operation)) {
            factory = new UpdateBulkFactory(settings, true, metaExtractor, version);
        }
        else {
            throw new EsHadoopIllegalArgumentException("Unknown operation " + operation);
        }

        return factory.createBulk();
    }

上述代码中可以看到，根据es.write.operation不同的操作，创建不同的AbstractBulkFactory的具体实现类，但是factory.createBulk()的时候，都是调用的AbstractBulkFactory的实现，具体实现类没有重写这个方法。

    public BulkCommand createBulk() {
        List<Object> before = new ArrayList<Object>();
        List<Object> after = new ArrayList<Object>();

        if (!isStatic) {
            before.add(new DynamicHeaderRef());
            after.add(new DynamicEndRef());
        }
        else {
            writeObjectHeader(before);
            before = compact(before);
            writeObjectEnd(after);
            after = compact(after);
        }

        boolean isScriptUpdate = settings.hasUpdateScript();

        // RDD中数据是否是json类型
        if (jsonInput) {
        	//json类型的并且使用了script
            if (isScriptUpdate) {
                return new JsonScriptTemplateBulk(before, after, jsonExtractors, settings);
            }
            //json类型的没有使用script
            return new JsonTemplatedBulk(before, after, jsonExtractors, settings);
        }
        //非json时候，同时使用了script
        if (isScriptUpdate) {
            return new ScriptTemplateBulk(settings, before, after, valueWriter);
        }
        //其他
        return new TemplatedBulk(before, after, valueWriter);
    }

before主要设置了写数据之前的信息，比如_version,_type,_routing等
after主要设置了"\n"

  private List<Object> compact(List<Object> list) {
        if (list == null || list.isEmpty()) {
            return null;
        }

        List<Object> compacted = new ArrayList<Object>();
        StringBuilder stringAccumulator = new StringBuilder();
        for (Object object : list) {
            if (object instanceof FieldExtractor) {
                if (stringAccumulator.length() > 0) {
                    compacted.add(new BytesArray(stringAccumulator.toString()));
                    stringAccumulator.setLength(0);
                }
                compacted.add(new FieldWriter((FieldExtractor) object));
            }
            else {
                stringAccumulator.append(object.toString());
            }
        }

        if (stringAccumulator.length() > 0) {
            compacted.add(new BytesArray(stringAccumulator.toString()));
        }
        return compacted;
    }

compact主要是将before / after中的保存的值进一步封装，FieldExtractor类型的封装到FieldWriter中，String类型的封装到BytesArray中
，before / after这两个list中对象只包含这两种。

回到前面的writeToIndex方法看其内部的调用的doWriteToIndex(command.write(object))方法的实现：假设我们的数据是Map,那么command是TemplatedBulk的对象，看其write的实现

   public BytesRef write(Object object) {
        ref.reset();
        scratchPad.reset();

        Object processed = preProcess(object, scratchPad);
        // write before object
        writeTemplate(beforeObject, processed);
        // write object
        doWriteObject(processed, scratchPad, valueWriter);
        ref.add(scratchPad);
        // writer after object
        writeTemplate(afterObject, processed);
        return ref;
    }

上面代码就是把before after 等封装到了BytesRef中


    private void doWriteToIndex(BytesRef payload) {
        // check space first
        // ba is the backing array for data
        if (payload.length() > ba.available()) {
            if (autoFlush) {
                flush();
            }
            else {
                throw new EsHadoopIllegalStateException(
                        String.format("Auto-flush disabled and bulk buffer full; disable manual flush or increase capacity [current size %s]; bailing out", ba.capacity()));
            }
        }

        data.copyFrom(payload);
        payload.reset();

        dataEntries++;
        if (bufferEntriesThreshold > 0 && dataEntries >= bufferEntriesThreshold) {
            if (autoFlush) {
                flush();
            }
            else {
                // handle the corner case of manual flush that occurs only after the buffer is completely full (think size of 1)
                if (dataEntries > bufferEntriesThreshold) {
                    throw new EsHadoopIllegalStateException(
                            String.format(
                                    "Auto-flush disabled and maximum number of entries surpassed; disable manual flush or increase capacity [current size %s]; bailing out",
                                    bufferEntriesThreshold));
                }
            }
        }
    }

上述代码主要的是flush（）方法的实现，进一步主要调用的是下面代码：

    public BulkResponse tryFlush() {
        BulkResponse bulkResult;

        try {
            // double check data - it might be a false flush (called on clean-up)
            if (data.length() > 0) {
                if (log.isDebugEnabled()) {
                    log.debug(String.format("Sending batch of [%d] bytes/[%s] entries", data.length(), dataEntries));
                }

                bulkResult = client.bulk(resourceW, data);
                executedBulkWrite = true;
            } else {
                bulkResult = BulkResponse.ok(0);
            }
        } catch (EsHadoopException ex) {
            hadWriteErrors = true;
            throw ex;
        }

        // always discard data since there's no code path that uses the in flight data
        discard();

        return bulkResult;
    }

主要代码是

  bulkResult = client.bulk(resourceW, data);

此处的client是RestClient的对象，new RestRepository时候，已经进行了初始化，而在创建PartitionWriter时候创建了RestRepository

   public RestRepository(Settings settings) {
        this.settings = settings;

        if (StringUtils.hasText(settings.getResourceRead())) {
            this.resourceR = new Resource(settings, true);
        }

        if (StringUtils.hasText(settings.getResourceWrite())) {
            this.resourceW = new Resource(settings, false);
        }

        Assert.isTrue(resourceR != null || resourceW != null, "Invalid configuration - No read or write resource specified");

        this.client = new RestClient(settings);
    }

下面看client.bulk方法：

   public BulkResponse bulk(Resource resource, TrackingBytesArray data) {
        Retry retry = retryPolicy.init();
        BulkResponse processedResponse;

        boolean isRetry = false;

        do {
            // NB: dynamically get the stats since the transport can change
            long start = network.transportStats().netTotalTime;
            Response response = execute(PUT, resource.bulk(), data);
            long spent = network.transportStats().netTotalTime - start;

            stats.bulkTotal++;
            stats.docsSent += data.entries();
            stats.bulkTotalTime += spent;
            // bytes will be counted by the transport layer

            if (isRetry) {
                stats.docsRetried += data.entries();
                stats.bytesRetried += data.length();
                stats.bulkRetries++;
                stats.bulkRetriesTotalTime += spent;
            }

            isRetry = true;

            processedResponse = processBulkResponse(response, data);
        } while (data.length() > 0 && retry.retry(processedResponse.getHttpStatus()));

        return processedResponse;
    }

主要看这行代码Response response = execute(PUT, resource.bulk(), data);

protected Response execute(Method method, String path, ByteSequence buffer) {
        return execute(new SimpleRequest(method, null, path, null, buffer), true);
    }

//上面代码调用了下面的方法
protected Response execute(Request request, boolean checkStatus) {
        Response response = network.execute(request);
        if (checkStatus) {
            checkResponse(request, response);
        }
        return response;
    }

其中network是NetworkClient的对象，network.execute主要实现代码是： response = currentTransport.execute(routedRequest)，其中currentTransport是Transport 的对象，该类主要有两个实现：CommonsHttpTransport和LeasedTransport

public NetworkClient(Settings settings) {
        this(settings, (!SettingsUtils.hasJobTransportPoolingKey(settings) ? new CommonsHttpTransportFactory() : PooledTransportManager.getTransportFactory(settings)));
    }

如果没有设置es.internal.transport.pooling.key那么是CommonsHttpTransportFactory创建CommonsHttpTransport，否则就是另外一个了，new Trasnport时候，如果设置了username password，会进行认证，
“es.net.http.auth.user” “es.net.http.auth.pass”;。

然后调用了 response = currentTransport.execute(routedRequest)

     public Response execute(Request request) throws IOException {
        HttpMethod http = null;

        switch (request.method()) {
        case DELETE:
            http = new DeleteMethodWithBody();
            break;
        case HEAD:
            http = new HeadMethod();
            break;
        case GET:
            http = (request.body() == null ? new GetMethod() : new GetMethodWithBody());
            break;
        case POST:
            http = new PostMethod();
            break;
        case PUT:
            http = new PutMethod();
            break;

        default:
            throw new EsHadoopTransportException("Unknown request method " + request.method());
        }

        CharSequence uri = request.uri();
        if (StringUtils.hasText(uri)) {
            if (String.valueOf(uri).contains("?")) {
                throw new EsHadoopInvalidRequest("URI has query portion on it: [" + uri + "]");
            }
            http.setURI(new URI(escapeUri(uri.toString(), sslEnabled), false));
        }

        // NB: initialize the path _after_ the URI otherwise the path gets reset to /
        // add node prefix (if specified)
        String path = pathPrefix + addLeadingSlashIfNeeded(request.path().toString());
        if (path.contains("?")) {
            throw new EsHadoopInvalidRequest("Path has query portion on it: [" + path + "]");
        }

        path = HttpEncodingTools.encodePath(path);

        http.setPath(path);

        try {
            // validate new URI
            uri = http.getURI().toString();
        } catch (URIException uriex) {
            throw new EsHadoopTransportException("Invalid target URI " + request, uriex);
        }

        CharSequence params = request.params();
        if (StringUtils.hasText(params)) {
            http.setQueryString(params.toString());
        }

        ByteSequence ba = request.body();
        if (ba != null && ba.length() > 0) {
            if (!(http instanceof EntityEnclosingMethod)) {
                throw new IllegalStateException(String.format("Method %s cannot contain body - implementation bug", request.method().name()));
            }
            EntityEnclosingMethod entityMethod = (EntityEnclosingMethod) http;
            entityMethod.setRequestEntity(new BytesArrayRequestEntity(ba));
            entityMethod.setContentChunked(false);
        }

        //为请求添加头部信息，通过es.net.http.header.xxx 参数设置， headers = new HeaderProcessor(settings)

        headers.applyTo(http);

        // when tracing, log everything
        if (log.isTraceEnabled()) {
            log.trace(String.format("Tx %s[%s]@[%s][%s]?[%s] w/ payload [%s]", proxyInfo, request.method().name(), httpInfo, request.path(), request.params(), request.body()));
        }

        long start = System.currentTimeMillis();
        try {

        	//httpclient调用executeMethod
            client.executeMethod(http);
        } finally {
            stats.netTotalTime += (System.currentTimeMillis() - start);
        }

        if (log.isTraceEnabled()) {
            Socket sk = ReflectionUtils.invoke(GET_SOCKET, conn, (Object[]) null);
            String addr = sk.getLocalAddress().getHostAddress();
            log.trace(String.format("Rx %s@[%s] [%s-%s] [%s]", proxyInfo, addr, http.getStatusCode(), HttpStatus.getStatusText(http.getStatusCode()), http.getResponseBodyAsString()));
        }

        // the request URI is not set (since it is retried across hosts), so use the http info instead for source
        return new SimpleResponse(http.getStatusCode(), new ResponseInputStream(http), httpInfo);
    }

上面的bulk方法中执行的代码是Response response = execute(PUT, resource.bulk(), data);
我们可以知道会request.method()将会匹配put，从而HttpMethod http = new PutMethod();

knowfarhhy

关注

3
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
Elasticsearch-spark 源码解析 ---savetoEs

使用例子object Save2EsLocalTest { def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("save2eslocal").setMaster("local[*]") conf.set("spark.streaming.stopGracefullyO...
复制链接

扫一扫