MapReduce入门“Hello World” ----WordCount

最新推荐文章于 2021-03-02 04:46:12 发布

kk(●￣(ｴ)￣●)

最新推荐文章于 2021-03-02 04:46:12 发布

阅读量702

点赞数

分类专栏：大数据学习

本文链接：https://blog.csdn.net/draught_bear/article/details/88698308

版权

大数据学习专栏收录该内容

3 篇文章 0 订阅

订阅专栏

项目结构

在这里插入图片描述

具体代码

WordCout.java

FileInputFormat.setInputPaths(job, new Path("/input/input.txt"));
这一步可以设置运行时参数，也就是String[] args
修改为

String[] otherArgs = (new GenericOptionsParser(conf, args)).getRemainingArgs();
        if(otherArgs.length < 2) {
            System.err.println("Usage: wordcount <in> [<in>...] <out>");
            System.exit(2);
        }
####中间省略######
for(int i = 0; i < otherArgs.length - 1; ++i) {
            FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
        }
FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 1]));

这样就不需要在频繁输入一大串的path信息，另外一个好处就是，当hdfs中的文件发生改变的时候，也不需要去修改path信息

在这里插入图片描述


package com.jxufe.xzy.wordcount;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;

public class WordCount {

	/**
	 * @param args
	 * @throws IOException 
	 * @throws InterruptedException 
	 * @throws ClassNotFoundException 
	 */
	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
		//配置信息
				Configuration conf = new Configuration();
				conf.set("fs.defaultFS","hdfs://Master:9000");
                conf.set("fs.hdfs.impl","org.apache.hadoop.hdfs.DistributedFileSystem");
				Job job = Job.getInstance(conf);
				
				//设置整个程序的类名
				job.setJarByClass(WordCount.class);
				job.setMapperClass(MMapper.class);//添加mapper类
				job.setReducerClass(RRducer.class);//添加reducer类
				job.setCombinerClass(RRducer.class);
				job.setOutputKeyClass(Text.class);//设置输出类型
				job.setOutputValueClass(IntWritable.class);//设置输出类型
				//设置输入输出文件夹
				FileInputFormat.setInputPaths(job, new Path("/input/input.txt"));
				FileOutputFormat.setOutputPath(job, new Path("/output"));
				System.exit(job.waitForCompletion(true)?0:1);
	}

}

MMaper.java

Maper的任务是将输入的文件<key,value>进行处理，得到一系列<k1,v1>,<k2,v2>…<kn,vn>类型数据，这些数据将通过Context传递给Reducer

package com.jxufe.xzy.wordcount;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Mapper;

public class MMapper extends Mapper<Object, Text, Text, IntWritable> {
	public static final IntWritable one = new IntWritable(1);
	private Text word = new Text();
	//Text可简单理解就是java中的String
	
	public void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException{
		//将value转换成String进行分词（分成一个一个的单词）,默认使用空格进行分词
		/*
			while (st.hasMoreElements()) {
            System.out.println(st.nextToken());
			}
			
			StringTokenizer(String str, String delim, boolean returnDelims) 
			第一个参数为需要进行分词的字符串，第二个参数为使用什么符号进行分词
			如果 returnDelims 标志为 true，则分隔符字符也作为标记返回
		*/

		StringTokenizer itr = new StringTokenizer(value.toString());
		while(itr.hasMoreElements()){
			this.word.set(itr.nextToken());
			context.write(this.word,one);
			//context相当于web中的session，在这里用于存储map生成的<k1,<v1,v2,....vn>>(还可以存储其他的运行时参数)
		}
		
	}

}

RRducer.java

Reducer的任务是从Mapper那里领取属于自己那一块的数据，对这一堆的<k1,v1>,<k2,v2>…<kn,vn>数据进行归并操作

package com.jxufe.xzy.wordcount;

import java.io.IOException;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Reducer;

public class RRducer extends Reducer<Text, IntWritable, Text, IntWritable> {

	public void reduce(Text key,Iterable<IntWritable> value, Reducer<Text,IntWritable,Text,IntWritable>.Context context) throws IOException, InterruptedException{
		int sum = 0;
		for(IntWritable	 val : value){
			sum += val.get();
		}
		context.write(key,new IntWritable(sum));
		
		
	}

}