在实践中应用Hadoop MapReduce 实验2 以tab space分隔的文本排序

最新推荐文章于 2021-11-24 16:07:24 发布

magina507

最新推荐文章于 2021-11-24 16:07:24 发布

阅读量1k

点赞数

分类专栏：大数据处理实验报告文章标签：大数据学习笔记

本文链接：https://blog.csdn.net/magina507/article/details/51592777

版权

大数据处理同时被 2 个专栏收录

14 篇文章 0 订阅

订阅专栏

实验报告

14 篇文章 0 订阅

订阅专栏

一、实验题目

编写MapReduce程序给以tab space分割的文本排序。

二、实验目的

遍历整个文本，搜索带tab space的句子并对它们进行排序。

三、任务分析

同上一个实验一样，处理文本，必然要先观察待处理文档，由于回车符的表示不同，需要在linux中查看，如下图：

可以看到文档中一共有21句话，并且通过tab space分开了。实验目的是将这21句话分开，然后排序。

因此mapper部分就很好写了，就是按行读取文件中的内容，空的部分不读。代码如下：（代码来自于群友广州-Carl的分享）

package com.apress.hadoop.examples.ch2;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;


public class SortingMapper extends Mapper<LongWritable, Text, Text, Text> {
	@Override
	public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
		String line = value.toString();
		if (!line.isEmpty()) {
			context.write(new Text(line), new Text("temp"));
		}
	}
}

其中字符串的值保存到了键值key中，因此在reducer中操作时，操作key值就可以了。

而由于MapReduce程序是自带排序功能的，因此reducer程序十分简单，如下：

package com.apress.hadoop.examples.ch2;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class SortingReducer extends Reducer<Text, Text, IntWritable, Text> {
	int index = 0;
	@Override
	protected void reduce(Text key, Iterable<Text> values, Context context)
			throws IOException, InterruptedException {
		index ++;
		context.write(new IntWritable(index), key);
	}
}

自定义一个变量index作为键值，而后面的键值key为mapper中的key，也就是句子本身，而且已经排好序，因此按顺序来就好了。

最后是driver程序：

package com.apress.hadoop.examples.ch2;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;

public class SortingDriver {
	public static void main(String[] args) throws Exception {
		Configuration conf =  new Configuration();
		Job job = new Job(conf, "sortingdata");
		job.setJarByClass(SortingDriver.class);
		job.setMapperClass(SortingMapper.class);
		job.setReducerClass(SortingReducer.class);
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(Text.class);
		job.setOutputKeyClass(IntWritable.class);
		job.setOutputValueClass(Text.class);
		job.setInputFormatClass(TextInputFormat.class);
		job.setOutputFormatClass(TextOutputFormat.class);
		
		FileInputFormat.addInputPath(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		
		boolean result = job.waitForCompletion(true);
		
		System.exit(result ? 0 : 1);
	}
}