从hdfs写入elasticsearch设置文档_id（不通过es.mapping.id）

qq_35701300

于 2023-11-20 14:18:34 发布

阅读量161

点赞数

文章标签： elasticsearch 大数据搜索引擎

本文链接：https://blog.csdn.net/qq_35701300/article/details/134506142

版权

一、从hdfs写入elasticsearch的文档_id设置值的痛点

1、elasticsearch-hadoop中如果不设定元数据metadata的时候elasticsearch会自动生成_id，但这个_id没有任何意思，不是我们所要求的。

2、设置自定义的_id值，常用操作是通过设置es.mapping.id这个值，这样就产生了冗余字段。

二、源码分析（elasticsearch-hadoop 7.15.2版本）

job.setOutputFormatClass(EsOutputFormat.class); 设置输出格式，在EsOutputFormat这个类中

真正实现写入es中的是内部类EsRecordWriter这个实现的，这个类中的write方法来完成写入。

 protected static class EsRecordWriter extends RecordWriter implements org.apache.hadoop.mapred.RecordWriter {

        protected final Configuration cfg;
        protected boolean initialized = false;

        protected RestRepository repository;
        private String uri;
        private Resource resource;

        private HeartBeat beat;
        private final Progressable progressable;

        public EsRecordWriter(Configuration cfg, Progressable progressable) {
            this.cfg = cfg;
            this.progressable = progressable;
        }

        @Override
        public void write(Object key, Object value) throws IOException {
            if (!initialized) {
                initialized = true;
                init();
            }
            repository.writeToIndex(value);
        }

        protected void init() throws IOException {
            //int instances = detectNumberOfInstances(cfg);
            int currentInstance = detectCurrentInstance(cfg);

            if (log.isTraceEnabled()) {
                log.trace(String.format("EsRecordWriter instance [%s] initiating discovery of target shard...",
                        currentInstance));
            }

            Settings settings = HadoopSettingsManager.loadFrom(cfg).copy();

            if (log.isTraceEnabled()) {
                log.trace(String.format("Init shard writer from cfg %s", HadoopCfgUtils.asProperties(cfg)));
            }

            InitializationUtils.setValueWriterIfNotSet(settings, WritableValueWriter.class, log);
            InitializationUtils.setBytesConverterIfNeeded(settings, WritableBytesConverter.class, log);
            InitializationUtils.setFieldExtractorIfNotSet(settings, MapWritableFieldExtractor.class, log);
            InitializationUtils.setUserProviderIfNotSet(settings, HadoopUserProvider.class, log);

            PartitionWriter pw = RestService.createWriter(settings, currentInstance, -1, log);

            this.repository = pw.repository;

            if (progressable != null) {
                this.beat = new HeartBeat(progressable, cfg, settings.getHeartBeatLead(), log);
                this.beat.start();
            }
        }

而write 方法中repository.writeToIndex(value); 来完成写入metadata 和数据的操作，重点来了metadata数据是在什么时候传进去的。

public class RestRepository implements Closeable, StatsAware {

    private static Log log = LogFactory.getLog(RestRepository.class);

    // wrapper around existing BA (for cases where the serialization already occurred)
    private BytesRef trivialBytesRef;
    private boolean writeInitialized = false;

    private RestClient client;
    // optional extractor passed lazily to BulkCommand
    private MetadataExtractor metaExtractor;

    private BulkEntryWriter bulkEntryWriter;
    private BulkProcessor bulkProcessor;

   //中间方法略



    //这个方法中设置元数据
    public void addRuntimeFieldExtractor(MetadataExtractor metaExtractor) {
        this.metaExtractor = metaExtractor;
    }

}

addRuntimeFieldExtractor方法来设置元数据的值，如果元数据为空的话，就有ES自己生成_id

三、代码实现

传入MetadataExtractor ，发现是个接口，唯一一个实现了MetadataExtractor，PerEntityPoolingMetadataExtractor，但是是个抽象类，需要我们自己去实现

getValue方法

public class CustomMetadataExtractor extends PerEntityPoolingMetadataExtractor {

    private StringBuffer id = new StringBuffer();  //线程安全
    public  CustomMetadataExtractor(String id) {
        synchronized (this){  //上锁
            this.id.append(id);
            super.version = EsMajorVersion.V_7_X; //这里指定ES版本
        }
    }

    @Override
    public Object getValue(Metadata metadata) {
        // 在这里根据 metadata 类型返回相应的值，可能需要根据不同的 metadata 类型进行处理
        switch (metadata) {
            case ID:
                // 处理 文档 元数据的逻辑
                return id.toString(); // 替换为实际的 _id 值获取逻辑
            default:
                return null; // 如果未知的 metadata 类型，返回 null 或者适当的默认值
        }
    }
}

要往进传id，还得重写EsOutputFormat 这个类的write方法，把自定义的_id穿进去

public class CustomEsOutputFormat extends EsOutputFormat {


    @Override
    public RecordWriter getRecordWriter(FileSystem ignored, JobConf job, String name, Progressable progress) {
        return new CustomEsRecordWriter(job, progress);
    }


    protected static class CustomEsRecordWriter extends EsRecordWriter {
        public CustomEsRecordWriter(Configuration cfg, Progressable progressable) {
            super(cfg, progressable);
        }

        @Override
        public void write(Object key, Object value) throws IOException {

            if (!initialized) {
                initialized = true;
                init();
            }
            String documentId = key.toString();

            // 将元数据添加到文档内容中
            CustomMetadataExtractor customMetadataExtractor = new CustomMetadataExtractor(documentId);
            repository.addRuntimeFieldExtractor(customMetadataExtractor);
            repository.writeToIndex(value);

        }
    }


}