一、从hdfs写入elasticsearch的文档_id设置值的痛点
1、elasticsearch-hadoop中如果不设定元数据metadata的时候elasticsearch会自动生成_id,但这个_id没有任何意思,不是我们所要求的。
2、设置自定义的_id值,常用操作是通过设置es.mapping.id这个值,这样就产生了冗余字段。
二、源码分析(elasticsearch-hadoop 7.15.2版本)
job.setOutputFormatClass(EsOutputFormat.class); 设置输出格式,在EsOutputFormat这个类中
真正实现写入es中的是 内部类EsRecordWriter这个实现的,这个类中的write方法来完成写入。
protected static class EsRecordWriter extends RecordWriter implements org.apache.hadoop.mapred.RecordWriter {
protected final Configuration cfg;
protected boolean initialized = false;
protected RestRepository repository;
private String uri;
private Resource resource;
private HeartBeat beat;
private final Progressable progressable;
public EsRecordWriter(Configuration cfg, Progressable progressable) {
this.cfg = cfg;
this.progressable = progressable;
}
@Override
public void write(Object key, Object value) throws IOException {
if (!initialized) {
initialized = true;
init();
}
repository.writeToIndex(value);
}
protected void init() throws IOException {
//int instances = detectNumberOfInstances(cfg);
int currentInstance = detectCurrentInstance(cfg);
if (log.isTraceEnabled()) {
log.trace(String.format("EsRecordWriter instance [%s] initiating discovery of target shard...",
currentInstance));
}
Settings settings = HadoopSettingsManager.loadFrom(cfg).copy();
if (log.isTraceEnabled()) {
log.trace(String.format("Init shard writer from cfg %s", HadoopCfgUtils.asProperties(cfg)));
}
InitializationUtils.setValueWriterIfNotSet(settings, WritableValueWriter.class, log);
InitializationUtils.setBytesConverterIfNeeded(settings, WritableBytesConverter.class, log);
InitializationUtils.setFieldExtractorIfNotSet(settings, MapWritableFieldExtractor.class, log);
InitializationUtils.setUserProviderIfNotSet(settings, HadoopUserProvider.class, log);
PartitionWriter pw = RestService.createWriter(settings, currentInstance, -1, log);
this.repository = pw.repository;
if (progressable != null) {
this.beat = new HeartBeat(progressable, cfg, settings.getHeartBeatLead(), log);
this.beat.start();
}
}
而write 方法中repository.writeToIndex(value); 来完成写入metadata 和数据的操作,重点来了metadata数据是在什么时候传进去的。
public class RestRepository implements Closeable, StatsAware {
private static Log log = LogFactory.getLog(RestRepository.class);
// wrapper around existing BA (for cases where the serialization already occurred)
private BytesRef trivialBytesRef;
private boolean writeInitialized = false;
private RestClient client;
// optional extractor passed lazily to BulkCommand
private MetadataExtractor metaExtractor;
private BulkEntryWriter bulkEntryWriter;
private BulkProcessor bulkProcessor;
//中间方法略
//这个方法中设置元数据
public void addRuntimeFieldExtractor(MetadataExtractor metaExtractor) {
this.metaExtractor = metaExtractor;
}
}
addRuntimeFieldExtractor方法来设置元数据的值,如果元数据为空的话,就有ES自己生成_id
三、代码实现
传入MetadataExtractor ,发现是个接口,唯一一个实现了MetadataExtractor,PerEntityPoolingMetadataExtractor,但是是个抽象类,需要我们自己去实现
getValue方法
public class CustomMetadataExtractor extends PerEntityPoolingMetadataExtractor {
private StringBuffer id = new StringBuffer(); //线程安全
public CustomMetadataExtractor(String id) {
synchronized (this){ //上锁
this.id.append(id);
super.version = EsMajorVersion.V_7_X; //这里指定ES版本
}
}
@Override
public Object getValue(Metadata metadata) {
// 在这里根据 metadata 类型返回相应的值,可能需要根据不同的 metadata 类型进行处理
switch (metadata) {
case ID:
// 处理 文档 元数据的逻辑
return id.toString(); // 替换为实际的 _id 值获取逻辑
default:
return null; // 如果未知的 metadata 类型,返回 null 或者适当的默认值
}
}
}
要往进传id,还得重写EsOutputFormat 这个类的write方法,把自定义的_id穿进去
public class CustomEsOutputFormat extends EsOutputFormat {
@Override
public RecordWriter getRecordWriter(FileSystem ignored, JobConf job, String name, Progressable progress) {
return new CustomEsRecordWriter(job, progress);
}
protected static class CustomEsRecordWriter extends EsRecordWriter {
public CustomEsRecordWriter(Configuration cfg, Progressable progressable) {
super(cfg, progressable);
}
@Override
public void write(Object key, Object value) throws IOException {
if (!initialized) {
initialized = true;
init();
}
String documentId = key.toString();
// 将元数据添加到文档内容中
CustomMetadataExtractor customMetadataExtractor = new CustomMetadataExtractor(documentId);
repository.addRuntimeFieldExtractor(customMetadataExtractor);
repository.writeToIndex(value);
}
}
}