hadoop 之 InputFormat类 --- NLineInputFormat 实例

最新推荐文章于 2021-06-17 08:06:26 发布

andrewgb

最新推荐文章于 2021-06-17 08:06:26 发布

阅读量1k

点赞数

分类专栏： hadoop MapReduce 文章标签： hadoop

本文链接：https://blog.csdn.net/andrewgb/article/details/49558827

版权

hadoop 同时被 2 个专栏收录

36 篇文章 0 订阅

订阅专栏

MapReduce

15 篇文章 0 订阅

订阅专栏

NLineInputFormat 介绍

文本由任务读取时，需要一种格式读入，KeyValueTextInputFormat 是InputFormat 类的一个具体子类，他定义的读取格式是这样的：

一行是一条记录;
读取后按照（key,value）对表示一条记录；
跟默认的TextInputFormat一样，key是字符偏移量，value是一行的所有内容；
N 表示一个Map可以处理的Record（记录）数量，也就是每个Map处理的行数；

应用实例

1.要处理的数据，tradeinfoIn文件

zhangsan@163.com    6000    0   2014-02-20
lisi@163.com    2000    0   2014-02-20
lisi@163.com    0   100 2014-02-20
zhangsan@163.com    3000    0   2014-02-20
wangwu@126.com  9000    0   2014-02-20
wangwu@126.com  0   200     2014-02-20

2.被Job任务读入后的格式:

<0,zhangsan@163.com  6000    0   2014-02-20>
<35,lisi@163.com,2000  0   2014-02-20>
<67,lisi@163.com    0 100 2014-02-20>
<98,zhangsan@163.com    3000  0   2014-02-20>
<134,wangwu@126.com 9000    0   2014-02-20>
<168,wangwu@126.com 0   200     2014-02-20>

3.代码

设置 mapreduce.input.lineinputformat.linespermap 属性，告诉每个Map应该处理的Map数量

conf.setInt("mapreduce.input.lineinputformat.linespermap", 2);

设置Job读取文件时按照NLineInputFormat格式

job.setInputFormatClass(NLineInputFormat.class);

package mapreduce.mr;

import java.io.IOException;
import java.util.HashMap;
import java.util.Map;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import mapreduce.bean.InfoBeanMy;

public class SumStepByTool extends Configured implements Tool{

    public static class SumStepByToolMapper extends Mapper<LongWritable, Text, Text, InfoBeanMy>{

        private InfoBeanMy outBean = new InfoBeanMy();
        private Text k = new Text();

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{

            String line = value.toString();
            String[] fields = line.split("\t");

            String account = fields[0];
            double income = Double.parseDouble(fields[1]);
            double expense = Double.parseDouble(fields[2]);

            outBean.setFields(account, income, expense);
            k.set(account);

            context.write(k, outBean);
        }
    }

    public static class SumStepByToolReducer extends Reducer<Text, InfoBeanMy, Text, InfoBeanMy>{

        private InfoBeanMy outBean = new InfoBeanMy();
        @Override
        protected void reduce(Text key, Iterable<InfoBeanMy> values, Context context) throws IOException, InterruptedException{
            double income_sum = 0;
            double expense_sum = 0;

            for(InfoBeanMy infoBeanMy : values)
            {
                income_sum += infoBeanMy.getIncome();
                expense_sum += infoBeanMy.getExpense();
            }
            outBean.setFields("", income_sum, expense_sum);
            context.write(key, outBean);
        }

    }


    public static class SumStepByToolPartitioner extends Partitioner<Text, InfoBeanMy>{

        private static Map<String, Integer> accountMap = new HashMap<String, Integer>(); 

        static {
            accountMap.put("zhangsan", 1);
            accountMap.put("lisi", 2);
            accountMap.put("wangwu", 3);
        }

        @Override
        public int getPartition(Text key, InfoBeanMy value, int numPartitions) {
            String keyString = key.toString();
            String name = keyString.substring(0, keyString.indexOf("@"));
            Integer part = accountMap.get(name);
            if (part == null )
            {
                part = 0;
            }
            return part;
        }

    }

    public int run(String[] args) throws Exception {
        Configuration conf = getConf();
        conf.setInt("mapreduce.input.lineinputformat.linespermap", 2);
        Job job = Job.getInstance(conf);
        job.setJarByClass(this.getClass());
        job.setJobName("SumStepByTool");

        //job.setInputFormatClass(TextInputFormat.class); //这个是默认的输入格式
        //job.setInputFormatClass(KeyValueTextInputFormat.class); //这个把一行记录的第一个区域当做key，其他区域作为value
        job.setInputFormatClass(NLineInputFormat.class);

        job.setMapperClass(SumStepByToolMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(InfoBeanMy.class);

        job.setReducerClass(SumStepByToolReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(InfoBeanMy.class);
        job.setNumReduceTasks(3);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));


        return job.waitForCompletion(true) ? 0:-1;
    }

    public static void main(String[] args) throws Exception {
        int exitCode = ToolRunner.run(new SumStepByTool(),args);
        System.exit(exitCode);
    }
}

注意

每个map处理的记录数是通过设置conf对象中的 “mapreduce.input.lineinputformat.linespermap”属性来控制的

conf.setInt("mapreduce.input.lineinputformat.linespermap", 2);

andrewgb

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hadoop 之 InputFormat类 --- NLineInputFormat 实例

NLineInputFormat 介绍文本由任务读取时，需要一种格式读入，KeyValueTextInputFormat 是InputFormat 类的一个具体子类，他定义的读取格式是这样的：一行是一条记录;读取后按照（key,value）对表示一条记录；跟默认的TextInputFormat一样，key是字符偏移量，value是一行的所有内容；N 表示一个Map可以处理的Record（记录
复制链接

扫一扫

专栏目录