从hdfs写入elasticsearch设置文档_id(不通过es.mapping.id)

一、从hdfs写入elasticsearch的文档_id设置值的痛点

1、elasticsearch-hadoop中如果不设定元数据metadata的时候elasticsearch会自动生成_id,但这个_id没有任何意思,不是我们所要求的。

2、设置自定义的_id值,常用操作是通过设置es.mapping.id这个值,这样就产生了冗余字段。

二、源码分析(elasticsearch-hadoop 7.15.2版本)

job.setOutputFormatClass(EsOutputFormat.class); 设置输出格式,在EsOutputFormat这个类中

真正实现写入es中的是 内部类EsRecordWriter这个实现的,这个类中的write方法来完成写入。

 protected static class EsRecordWriter extends RecordWriter implements org.apache.hadoop.mapred.RecordWriter {

        protected final Configuration cfg;
        protected boolean initialized = false;

        protected RestRepository repository;
        private String uri;
        private Resource resource;

        private HeartBeat beat;
        private final Progressable progressable;

        public EsRecordWriter(Configuration cfg, Progressable progressable) {
            this.cfg = cfg;
            this.progressable = progressable;
        }

        @Override
        public void write(Object key, Object value) throws IOException {
            if (!initialized) {
                initialized = true;
                init();
            }
            repository.writeToIndex(value);
        }

        protected void init() throws IOException {
            //int instances = detectNumberOfInstances(cfg);
            int currentInstance = detectCurrentInstance(cfg);

            if (log.isTraceEnabled()) {
                log.trace(String.format("EsRecordWriter instance [%s] initiating discovery of target shard...",
                        currentInstance));
            }

            Settings settings = HadoopSettingsManager.loadFrom(cfg).copy();

            if (log.isTraceEnabled()) {
                log.trace(String.format("Init shard writer from cfg %s", HadoopCfgUtils.asProperties(cfg)));
            }

            InitializationUtils.setValueWriterIfNotSet(settings, WritableValueWriter.class, log);
            InitializationUtils.setBytesConverterIfNeeded(settings, WritableBytesConverter.class, log);
            InitializationUtils.setFieldExtractorIfNotSet(settings, MapWritableFieldExtractor.class, log);
            InitializationUtils.setUserProviderIfNotSet(settings, HadoopUserProvider.class, log);

            PartitionWriter pw = RestService.createWriter(settings, currentInstance, -1, log);

            this.repository = pw.repository;

            if (progressable != null) {
                this.beat = new HeartBeat(progressable, cfg, settings.getHeartBeatLead(), log);
                this.beat.start();
            }
        }

而write 方法中repository.writeToIndex(value); 来完成写入metadata 和数据的操作,重点来了metadata数据是在什么时候传进去的。

public class RestRepository implements Closeable, StatsAware {

    private static Log log = LogFactory.getLog(RestRepository.class);

    // wrapper around existing BA (for cases where the serialization already occurred)
    private BytesRef trivialBytesRef;
    private boolean writeInitialized = false;

    private RestClient client;
    // optional extractor passed lazily to BulkCommand
    private MetadataExtractor metaExtractor;

    private BulkEntryWriter bulkEntryWriter;
    private BulkProcessor bulkProcessor;

   //中间方法略



    //这个方法中设置元数据
    public void addRuntimeFieldExtractor(MetadataExtractor metaExtractor) {
        this.metaExtractor = metaExtractor;
    }

}
addRuntimeFieldExtractor方法来设置元数据的值,如果元数据为空的话,就有ES自己生成_id

三、代码实现

传入MetadataExtractor ,发现是个接口,唯一一个实现了MetadataExtractor,PerEntityPoolingMetadataExtractor,但是是个抽象类,需要我们自己去实现

getValue方法

public class CustomMetadataExtractor extends PerEntityPoolingMetadataExtractor {

    private StringBuffer id = new StringBuffer();  //线程安全
    public  CustomMetadataExtractor(String id) {
        synchronized (this){  //上锁
            this.id.append(id);
            super.version = EsMajorVersion.V_7_X; //这里指定ES版本
        }
    }

    @Override
    public Object getValue(Metadata metadata) {
        // 在这里根据 metadata 类型返回相应的值,可能需要根据不同的 metadata 类型进行处理
        switch (metadata) {
            case ID:
                // 处理 文档 元数据的逻辑
                return id.toString(); // 替换为实际的 _id 值获取逻辑
            default:
                return null; // 如果未知的 metadata 类型,返回 null 或者适当的默认值
        }
    }
}

要往进传id,还得重写EsOutputFormat 这个类的write方法,把自定义的_id穿进去

public class CustomEsOutputFormat extends EsOutputFormat {


    @Override
    public RecordWriter getRecordWriter(FileSystem ignored, JobConf job, String name, Progressable progress) {
        return new CustomEsRecordWriter(job, progress);
    }


    protected static class CustomEsRecordWriter extends EsRecordWriter {
        public CustomEsRecordWriter(Configuration cfg, Progressable progressable) {
            super(cfg, progressable);
        }

        @Override
        public void write(Object key, Object value) throws IOException {

            if (!initialized) {
                initialized = true;
                init();
            }
            String documentId = key.toString();

            // 将元数据添加到文档内容中
            CustomMetadataExtractor customMetadataExtractor = new CustomMetadataExtractor(documentId);
            repository.addRuntimeFieldExtractor(customMetadataExtractor);
            repository.writeToIndex(value);

        }
    }


}

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值