6. FileInputFormat实现类

最新推荐文章于 2022-11-09 11:08:40 发布

喵先生呢

最新推荐文章于 2022-11-09 11:08:40 发布

阅读量673

点赞数 1

分类专栏： # MapReduce 文章标签：大数据 mapreduce

本文链接：https://blog.csdn.net/weixin_45267102/article/details/107271320

版权

MapReduce 专栏收录该内容

17 篇文章

订阅专栏

文章目录

FilInputFormat实现类

在运行MapReduce程序时，输入的文件格式包括：基于行的日志文件、二进制格式文件、数据库表等。那么，针对不同的数据类型，MapReduce是如何读取这些数据的呢？

抽象类FileInputFormat常见的实现类包括：TextInputFormat、KeyValueTextInputFormat、NLineInputFormat、CombineTextInputFormat和自定义InputFormat等。

在这里插入图片描述

下面逐一介绍一下FileInputFormat的这几个实现类。

1. TextInputFormat

TextInputFormat是默认的FileInputFormat实现类。

按行读取每条记录。

键key是存储该行在整个文件中的起始字节偏移量， LongWritable类型。
值value是这行的内容，不包括任何行终止符（换行符和回车符），Text类型。

示例

比如，一个分片包含了如下4条文本记录。

Rich learning form
Intelligent learning engine
Learning more convenient
From the real demand for more close to the enterprise

每条记录表示为以下键/值对

(0,Rich learning form)
(19,Intelligent learning engine)
(47,Learning more convenient)
(72,From the real demand for more close to the enterprise)

2. KeyValueTextInputFormat

每一行均为一条记录，被分隔符分割为key，value。

可以通过在驱动类中设置来设定分隔符
conf.set(KeyValueLineRecordReader.KEY_VALUE_SEPERATOR, "\t")
默认分隔符是tab（\t）。

示例

输入是一个包含4条记录的分片。其中——>表示一个（水平方向的）制表符。

line1 ——>Rich learning form
line2 ——>Intelligent learning engine
line3 ——>Learning more convenient
line4 ——>From the real demand for more close to the enterprise

每条记录表示为以下键/值对：

(line1,Rich learning form)
(line2,Intelligent learning engine)
(line3,Learning more convenient)
(line4,From the real demand for more close to the enterprise)

此时的键key是每行排在制表符之前的Text序列。

3. NLineInputFormat

如果使用NlineInputFormat，代表每个map进程处理的InputSplit不再按Block块去划分，而是按NlineInputFormat指定的行数N来划分。即输入文件的总行数/N=切片数，如果不整除，切片数=商+1。

示例

仍然以上面的4行输入为例。

Rich learning form
Intelligent learning engine
Learning more convenient
From the real demand for more close to the enterprise

例如，如果N是2，则每个输入分片包含两行。开启2个MapTask。

(0,Rich learning form)
(19,Intelligent learning engine)

另一个 mapper 则收到后两行：

(47,Learning more convenient)
(72,From the real demand for more close to the enterprise)

这里的键和值与TextInputFormat生成的一样。

4. 实操-KeyValueTextInputFormat使用案例

需求：统计输入文件中每一行的第一个单词相同的行数。

输入数据：
banzhang ni hao
xihuan hadoop banzhang
banzhang ni hao
xihuan hadoop banzhang
期望结果数据:
banzhang	2
xihuan	2
需求分析:

根据需求可知要把分隔符设置成空格，以及输入格式。

代码实现

KVTextMapper

/**
 * @Date 2020/7/9 22:34
 * @Version 10.21
 * @Author DuanChaojie
 */
public class KVTextMapper extends Mapper<Text, Text,Text, IntWritable> {
    IntWritable v = new IntWritable(1);

    @Override
    protected void map(Text key, Text value, Context context) throws IOException, InterruptedException {
        context.write(key,v);
    }
}

KVTextReducer

public class KVTextReducer extends Reducer<Text, IntWritable,Text,IntWritable> {
    IntWritable v = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable value : values) {
            sum += value.get();
        }
        v.set(sum);

        context.write(key,v);
    }
}

KVTextDriver

conf.set(KeyValueLineRecordReader.KEY_VALUE_SEPERATOR, " "); 设切割符

job.setInputFormatClass(KeyValueTextInputFormat.class); 设置输入格式

public class KVTextDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration conf = new Configuration();
        // 设置切割符
        conf.set(KeyValueLineRecordReader.KEY_VALUE_SEPERATOR, " ");

        Job job = Job.getInstance(conf);

        job.setJarByClass(KVTextDriver.class);
        job.setMapperClass(KVTextMapper.class);
        job.setReducerClass(KVTextReducer.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.setInputPaths(job,new Path(args[0]));
        // 设置输入格式
        job.setInputFormatClass(KeyValueTextInputFormat.class);

        FileOutputFormat.setOutputPath(job,new Path(args[1]));

        boolean result = job.waitForCompletion(true);

        System.exit(result?0:1);
    }
}

不要忘了设置输入和输出路径，结果与预期结果一致。

5. 实操-NLineInputFormat使用案例

需求：对每个单词进行个数统计，要求根据每个输入文件的行数来规定输出多少个切片。此案例要求每三行放入一个切片中。

输入数据：
banzhang ni hao
xihuan hadoop banzhang
banzhang ni hao
xihuan hadoop banzhang
banzhang ni hao
xihuan hadoop banzhang
banzhang ni hao
xihuan hadoop banzhang
banzhang ni hao
xihuan hadoop banzhang banzhang ni hao
xihuan hadoop banzhang
期望输出数据:
Number of splits:4
需求分析：

代码实现

NLineMapper

public class NLineMapper extends Mapper<LongWritable, Text, Text, LongWritable>{

	private Text k = new Text();
	private LongWritable v = new LongWritable(1);

	@Override
	protected void map(LongWritable key, Text value, Context context)	throws IOException, InterruptedException {

		 // 1 获取一行
        String line = value.toString();
    
        // 2 切割
        String[] splited = line.split(" ");

        // 3 循环写出
        for (int i = 0; i < splited.length; i++) {
        	k.set(splited[i]);
           context.write(k, v);
        }
	}
}

NLineReducer


public class NLineReducer extends Reducer<Text, LongWritable, Text, LongWritable>{

	LongWritable v = new LongWritable();

	@Override
	protected void reduce(Text key, Iterable<LongWritable> values,	Context context) throws IOException, InterruptedException {

        long sum = 0;

        // 1 汇总
        for (LongWritable value : values) {
            sum += value.get();
        }

        v.set(sum);

        // 2 输出
        context.write(key, v);
	}
}

NLineDriver

job.setInputFormatClass(NLineInputFormat.class); 使用NLineInputFormat处理记录数

NLineInputFormat.setNumLinesPerSplit(job, 3); 设置每个切片InputSplit中划分三条记录


public class NLineDriver {

	public static void main(String[] args) throws IOException, URISyntaxException, ClassNotFoundException, InterruptedException {
		 // 1 获取job对象
		 Configuration configuration = new Configuration();
        Job job = Job.getInstance(configuration);
        // 再这里赋值之后就不需要设置
        args = new String[]{"E:\\file\\test.txt","E:\\file\\output1"};

        // 2设置jar包位置，关联mapper和reducer
        job.setJarByClass(NLineDriver.class);
        job.setMapperClass(NLineMapper.class);
        job.setReducerClass(NLineReducer.class);

        // 3设置map输出kv类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);

        // 4设置最终输出kv类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        // 5设置输入输出数据路径
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        
        // 6设置每个切片InputSplit中划分三条记录
        NLineInputFormat.setNumLinesPerSplit(job, 3);

        // 7使用NLineInputFormat处理记录数
        job.setInputFormatClass(NLineInputFormat.class);
        
        // 8提交job
        job.waitForCompletion(true);
	}
}