ES作为强大的搜索引擎,HDFS是分布式文件系统。ES可以将自身的Document导入到HDFS中用作备份,ES也可以将存储在HDFS上的结构化文件导入为ES的中的Document。而ES-Hadoop正是这两者之间的一个connector
1,将数据从ES导出到HDFS
1.1,数据准备,在ES中创建Index和Type,并创建document。在我的例子中,Index是mydata,type是person,创建了两条如下图所示的document
1.2 在项目中引入ES-Hadoop库
<dependency> <groupId>org.elasticsearch</groupId> <artifactId>elasticsearch-hadoop</artifactId> <version>5.5.2</version> </dependency>
值得注意的是,上面的dependency只会引入ES-Hadoop相关的Jar包,和Hadoop相关的包,例如hadoop-common, hadoop-hdfs等等,依然还需要添加依赖。
1.3,创建从ES到Hadoop的数据迁移的Mapper类
package com.wjm.es_hadoop.example1; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import org.elasticsearch.hadoop.mr.LinkedMapWritable; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import java.io.IOException; class E2HMapper01 extends Mapper<Text, LinkedMapWritable, Text, LinkedMapWritable> { private static final Logger LOG = LoggerFactory.getLogger(E2HMapper01.class); @Override protected void setup(Context context) throws IOException, InterruptedException { super.setup(context); } @Override protected void map(Text key, LinkedMapWritable value, Context context) throws IOException, InterruptedException { LOG.info("key {} value {}", key, value); context.write(key, value); } @Override protected void cleanup(Context context) throws IOException, InterruptedException {