MapReduce集成

MapReduce集成

本文档在ESHadoop集群安装成功条件下,参照:

https://www.elastic.co/guide/en/elasticsearch/hadoop/5.6/mapreduce.html

版本信息:
HDP version : HDP-2.6.1.0

ES version : 5.6.0

MR version : 2.7.3

maven version:3.5

 

1. 依赖elasticsearch-hadoop.jar

1.1. Hadoop集群添加elasticsearch-hadoop环境

本文采用在maven项目pom文件中添加依赖方式:

<dependency>
    <groupId>org.elasticsearch</groupId>
    <artifactId>elasticsearch-hadoop</artifactId>
    <version>5.6.0</version>
</dependency>

*填坑要点:依赖pom version信息与ES version 保持一致

2. 准备数据

2.1. 把下列内容保存到blog.json

{"id":"1","title":"git简介","posttime":"2016-06-11","content":"svngit的最主要区别..."}
{"id":"2","title":"ava中泛型的介绍与简单使用","posttime":"2016-06-12","content":"基本操作:CRUD"}
{"id":"3","title":"SQL基本操作","posttime":"2016-06-13","content":"svngit的最主要区别..."}
{"id":"4","title":"Hibernate框架基础","posttime":"2016-06-14","content":"Hibernate框架基础..."}
{"id":"5","title":"Shell基本知识","posttime":"2016-06-15","content":"Shell是什么..."}

 

2.2. 启动hadoop,上传blog.jsonhdfs
hadoop fs -put blog.json /work

3. HDFS读取文档索引到ES

3.1. 代码如下

package com.hand.es.hadoop;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.elasticsearch.hadoop.cfg.ConfigurationOptions;
import org.elasticsearch.hadoop.mr.EsOutputFormat;

import java.io.IOException;

/**
 * Created by bee on 4/1/17.
 */
public class HdfsToES {

    public static class MyMapper extends Mapper<Writable, Writable, NullWritable, Text> {
        @Override
        public void map(Writable key, Writable value, Mapper<Writable, Writable, NullWritable, Text>.Context context)
                throws IOException, InterruptedException {
            Text docVal = new Text();
            docVal.set(value.toString());
            context.write(NullWritable.get(), docVal);
        }
    }


    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration conf = new Configuration();
        conf.setBoolean("mapred.map.tasks.speculative.execution", false);
        conf.setBoolean("mapred.reduce.tasks.speculative.execution", false);

        conf.set(ConfigurationOptions.ES_NODES, "10.211.55.241");
        conf.set(ConfigurationOptions.ES_PORT, "9200");
        conf.set(ConfigurationOptions.ES_INPUT_JSON, "yes");
//        conf.set(ConfigurationOptions.ES_MAPPING_ID, "id");
        conf.set(ConfigurationOptions.ES_RESOURCE, "company/input01");

        Job job = Job.getInstance(conf);
        job.setJobName("hdfs es write test");
        job.setMapperClass(HdfsToES.MyMapper.class);
        job.setJarByClass(HdfsToES.class);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(EsOutputFormat.class);

        job.setMapOutputKeyClass(NullWritable.class);
        job.setMapOutputValueClass(Text.class);

        // 设置输入路径
        FileInputFormat.setInputPaths(job, new Path
                ("hdfs://hdfs01.edcs.org:8020/work"));
        job.waitForCompletion(true);
    }
}

 

 

3.2. 代码过程解析

Map过程,按行读入,input kye的类型为Objectinput value的类型为Text。输出的keyNullWritable类型,NullWritableWritable的一个特殊类,实现方法为空实现,不从数据流中读数据,也不写入数据,只充当占位符。MapReduce中如果不需要使用键或值,就可以将键或值声明为NullWritable,这里把输出的key设置NullWritable类型。输出为Text类型,把json字符串序列化。

因为只需要写入,没有Reduce过程。在main函数中,首先创Configuration()类的一个对象conf,通过conf配置一些参数。

conf.setBoolean("mapred.map.tasks.speculative.execution", false);//关闭mapper阶段的执行推测

conf.setBoolean("mapred.reduce.tasks.speculative.execution", false);//关闭reducer阶段的执行推测

conf.set("es.nodes", "192.168.1.111:9200");//配置ElasticsearchIP和端口

conf.set("es.resource", "blog/csdn");//设置索引到Elasticsearch的索引名和类型名。

conf.set("es.mapping.id", "id");//设置文档id,这个参数”id”是文档中的id字段

conf.set("es.input.json", "yes");//指定输入的文件类型为json

job.setInputFormatClass(TextInputFormat.class);/设置输入流为文本类型

job.setOutputFormatClass(EsOutputFormat.class);//设置输出为EsOutputFormat类型。

job.setMapOutputKeyClass(NullWritable.class);//设置Map的输出key类型为NullWritable类型

job.setMapOutputValueClass(Test.class);//设置Map的输出value类型为BytesWritable类型

*注意事项

1. ES插件默认写入数据api是默认接收json数据,如果es.input.json默认或者指定为flase,ES生成默认以单个map出参为一个字段,存入ES中。很多网上案例会设置job.setMapOutputValueClass(BytesWritable.class);但是在5.6.0版本中会ES会出现转换异常,400的请求错误。

2. json需要制定es中的IDconf.set("es.mapping.id", "id")id参数不可以包含_idES中数据字段相同的配置。

 

4. ES导出到HDFS

4.1. 代码如下:

package com.hand.es.hadoop;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.elasticsearch.hadoop.cfg.ConfigurationOptions;
import org.elasticsearch.hadoop.mr.EsInputFormat;

import java.io.IOException;

public class EsToHdfs {

    public static class ESMap extends Mapper<Writable, Writable, NullWritable, Text> {
        @Override
        public void map(Writable key, Writable value, Mapper<Writable, Writable, NullWritable, Text>.Context context)
                throws IOException, InterruptedException {
            Text docVal = new Text();
            docVal.set(value.toString());
            context.write(NullWritable.get(), docVal);
        }
    }

    public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();
        conf.set(ConfigurationOptions.ES_NODES, "10.211.55.241");
        conf.set(ConfigurationOptions.ES_PORT, "9200");
        conf.set(ConfigurationOptions.ES_OUTPUT_JSON, "true");
        conf.set(ConfigurationOptions.ES_RESOURCE, "company/info");

        Job job = Job.getInstance(conf, EsToHdfs.class.getSimpleName());
        job.setJobName("es to hdfs test");
        // 指定自定义的Mapper阶段的任务处理类
        job.setMapperClass(ESMap.class);
        job.setJarByClass(EsToHdfs.class);
        job.setNumReduceTasks(0);
        // 设置map输出格式
        job.setMapOutputKeyClass(NullWritable.class);
        job.setMapOutputValueClass(Text.class);
        // 设置输入格式
        job.setInputFormatClass(EsInputFormat.class);
        // 设置输出路径
        FileOutputFormat.setOutputPath(job, new Path("hdfs://hdfs01.edcs.org:8020/data"));
        // 运行MR程序
        job.waitForCompletion(true);
    }
}

 

填坑笔记:

1.写入制定文件文件目录中,当前的hdfs不可存在

 

2.写入文件确定设置job.setJarByClass(EsToHdfs.class);

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值