MapReduce演练Grep（Sort）

最新推荐文章于 2022-08-29 19:01:08 发布

yangtom249

最新推荐文章于 2022-08-29 19:01:08 发布

阅读量336

点赞数

分类专栏： Hadoop

本文链接：https://blog.csdn.net/weixin_44153121/article/details/85283281

版权

Hadoop 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

GrepSort演示了使用系统自带的Mapper和Reducer类，即可满足运行；同时，也演示了两个job的串联运行，第一个job的结果作为第二个job的输入。

1、GrepSort类

/**
 * Extracts matching regexs from input files and counts them
 * 
 * @author cent
 *
 */
public class GrepSort extends Configured implements Tool {

	private GrepSort() {
	}

	public static void main(String[] args) {
		int res = 0;// 正常退出
		try {
			Configuration conf = new Configuration();
			// 指定namenode的hdfs协议的文件系统通信地址，可以指定一个主机+端口，也可以指定为一个namenode服务
			conf.set("fs.defaultFS", "hdfs://cos6743:9000");
			res = ToolRunner.run(conf, new GrepSort(), args);
		} catch (Exception e) {
			res = 1;
			e.printStackTrace();
		}
		// 结束当前正在运行中的java虚拟机;非零参数表示是非正常退出
		System.exit(res);
	}

	@Override
	public int run(String[] args) throws Exception {
		if (args.length < 3) {
			System.out.println("GrepSort <inDir> <outDir> <regex> [<group>]");
			ToolRunner.printGenericCommandUsage(System.out);
			return 2;
		}

		Configuration conf = getConf();
		//设置正则表达式
		conf.set(RegexMapper.PATTERN, args[2]);

		Path tempDir = new Path(args[1] + "/greptemp");
		PathUtil.deletePath(conf, tempDir);
		//没有用到GROUP
		if (args.length == 4)
			conf.set(RegexMapper.GROUP, args[3]);

		Job grepJob = Job.getInstance(conf);
		try {
			grepJob.setJobName("grep-search");
			grepJob.setJarByClass(GrepSort.class);

			FileInputFormat.setInputPaths(grepJob, args[0]);

			grepJob.setMapperClass(RegexMapper.class);
			grepJob.setCombinerClass(LongSumReducer.class);
			grepJob.setReducerClass(LongSumReducer.class);

			FileOutputFormat.setOutputPath(grepJob, tempDir);
			grepJob.setOutputFormatClass(SequenceFileOutputFormat.class);
			grepJob.setOutputKeyClass(Text.class);
			grepJob.setOutputValueClass(LongWritable.class);
			// true:print the progress to the user
			grepJob.waitForCompletion(true);

			Job sortJob = Job.getInstance(conf);
			sortJob.setJobName("grep-sort");
			sortJob.setJarByClass(GrepSort.class);

			// 所有基于文件的 InputFormat 实现的基类是 FileInputFormat，
			// 派生出针对文本文件格式的 TextInputFormat、 KeyValueTextInputFormat 和 NLineInputFormat
			// 针对二进制文件格式的 SequenceFileInputFormat 等
			FileInputFormat.setInputPaths(sortJob, tempDir);
			/**
			 * InputFormat 主要用于描述输入数据的格式，它提供以下两个功能。
			 * 数据切分：按照某个策略将输入数据切分成若干个split，以便确定 Map Task 个数以及对应的 split。
			 * 为 Mapper 提供输入数据： 给定某个 split，能将其解析成一个个 key/value 对。
			 */
			sortJob.setInputFormatClass(SequenceFileInputFormat.class);

			sortJob.setMapperClass(InverseMapper.class);

			sortJob.setNumReduceTasks(1); // write a single file
			Path resOut = new Path(args[1] + "/grepsort");
			PathUtil.deletePath(conf, resOut);
			/**
			 * OutputFormat 主要用于描述输出数据的格式，它能够将用户提供的 key/value 对写入特定格式的文件中。
			 */
			FileOutputFormat.setOutputPath(sortJob, resOut);
			// sort by decreasing freq
			sortJob.setSortComparatorClass(LongWritable.DecreasingComparator.class);

			sortJob.waitForCompletion(true);

			// 获取所有计数器的值
			Counters counters = sortJob.getCounters();
			for (CounterGroup group : counters) {
				System.out.println("--Group--" + group.getDisplayName()
						+ ": " + group.getName());
				for (Counter counter : group) {
					System.out.println(counter.getDisplayName() + ": "
							+ counter.getName() + ": " + counter.getValue());
				}
			}
		} catch (Exception ex) {
			ex.printStackTrace();
		} finally {
			// delete temp path
			FileSystem.get(conf).delete(tempDir, true);
		}

		return 0;
	}
}

2、运行

2.1、用法说明

如果参数不够，则显示用法说明，ToolRunner.printGenericCommandUsage公共部分。

2.2、运行参数

regex

3、查看结果

命令：hadoop fs -cat
hadoopcat

yangtom249

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
MapReduce演练Grep（Sort）

Grep - A map/reduce program that counts the matches of a regex in the input.GrepSort演示了使用系统自带的Mapper和Reducer类，即可满足运行；同时，也演示了两个job的串联运行，第一个job的结果是第二个job的输入。
复制链接

扫一扫

专栏目录