Hadoop：MapReduce第一个例子WordsCount

最新推荐文章于 2024-04-15 22:58:32 发布

fuzuxian

最新推荐文章于 2024-04-15 22:58:32 发布

阅读量265

点赞数 1

文章标签： Hadoop MapReduce WordsCount

本文链接：https://blog.csdn.net/image_fzx/article/details/79393675

版权

Hadoop：MapReduce第一个例子WordsCount

一，如何在eclipse上建立自己的第一个项目

eclipse新建第一个java项目

二，这是wordscount.java程序

package wordscount.demo;

import java.io.IOException;  
import java.util.StringTokenizer;   
import org.apache.hadoop.conf.Configuration;  
import org.apache.hadoop.fs.Path;  
import org.apache.hadoop.io.IntWritable;  
import org.apache.hadoop.io.LongWritable;  
import org.apache.hadoop.io.Text;  
import org.apache.hadoop.mapreduce.Job;  
import org.apache.hadoop.mapreduce.Mapper;  
import org.apache.hadoop.mapreduce.Reducer;  
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;  
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;  
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 


public class wordscount {
	
	public static class WordCountMap extends
			Mapper<LongWritable, Text, Text, IntWritable> {

		private final IntWritable one = new IntWritable(1);
		private Text word = new Text();

		public void map(LongWritable key, Text value, Context context)
				throws IOException, InterruptedException {
			String line = value.toString();
			StringTokenizer token = new StringTokenizer(line);
			while (token.hasMoreTokens()) {
				word.set(token.nextToken());
				context.write(word, one);
			}
		}
	}

	public static class WordCountReduce extends
			Reducer<Text, IntWritable, Text, IntWritable> {

		public void reduce(Text key, Iterable<IntWritable> values,
				Context context) throws IOException, InterruptedException {
			int sum = 0;
			for (IntWritable val : values) {
				sum += val.get();
			}
			context.write(key, new IntWritable(sum));
		}
	}

	public static void main(String[] args) throws Exception {
		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf);
		
		job.setJarByClass(wordscount.class);
		job.setJobName("wordcount");

		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);

		job.setMapperClass(WordCountMap.class);
		job.setReducerClass(WordCountReduce.class);

		job.setInputFormatClass(TextInputFormat.class);
		job.setOutputFormatClass(TextOutputFormat.class);

		FileInputFormat.setInputPaths(job,"hdfs://Master:9000/wordcount/input");
		FileOutputFormat.setOutputPath(job, new Path("hdfs://Master:9000/wordcount/output"));

		job.waitForCompletion(true);
	}
}

三， eclipse上如何导入hadoop的jar文件

右键WC项目，Build Path -> Configure Bulid Path... -> Libraries -> Add External Jars... 添加所需jar包。hadoop编程所需的jar包在hadoop-2.7.3\share\hadoop\下的文件的下一层的jar包（如果有的话），以及hadoop-2.7.3\share\hadoop\common\lib的里的jar包。

其实，可以事先将hadoop-2.7.3中所有的jar 放在_lib文件夹里。

四，如何将wordscount.java 程序打包成 .jar 文件:

如何在eclipse将程序导出成jar文件

右键WC项目，Export -> Java -> JAR file.............

五，如何在Hadoop运行wordscount.jar

1）准备测试数据 wordtest.txt 及其 words.txt，在本地上：本地目录： /home/hadoop/words.txt

vi /home/hadoop/wordtest.txt

vi   /home/hadoop/wordtest.txt

Hello tom
Hello jim
Hello ketty
Hello world
Ketty tom

vi   /home/hadoop/words.txt

Hello tom
Hello jim
Hello ketty
Hello world
Ketty tom

2）在HDFS上建立 /wordcount/input 输入目录，这里有两种方法：

第一种：命令

在hdfs上创建输入数据文件夹：

hadoop fs -mkdir -p /wordcount/input

第二种：在eclipse上直接创建文件夹

a，下载 hadoop-eclipse-plugin-2.7.3.jar 插件，将这个插件复制到 /usr/lib/eclipse/plugins/ 里

sudo cp .......~/hadoop-eclipse-plugin-2.7.3.jar /usr/lib/eclipse/plugins/

b，重新打开eclipse， window ——> Preference

c， window ——> open perspective ——> other 点开 Map/Reduce；

window ——> show view ——> other 点开 Map/Reduce；

d，

左边的port：查看 core-site.xml , 左边的port：查看 hdfs-site.xml

e，现在eclipse上就会显示 DFS location

3）将本地的 words.txt 和 wordtest.txt上传到 hdfs上

hadoop fs –put /home/hadoop/words.txt /wordcount/input

查看： hadoop fs -cat /wordcount/input/words.txt

上传成功：

4）使用命令启动执行wordcount程序jar包

hadoop jar wordscount.jar /wordcount/input /wordcount/output

5）结果查看

程序运行成功：

结果如下：

hadoop fs -ls /wordcount/output/

hadoop fs -cat /wordcount/output/part-r-00000

六，程序分析

1、WordCountMap类继承了org.apache.hadoop.mapreduce.Mapper，4个泛型类型分别是map函数输入key的类型，输入value的类型，输出key的类型，输出value的类型。

2、WordCountReduce类继承了org.apache.hadoop.mapreduce.Reducer，4个泛型类型含义与map类相同。

3、map的输出类型与reduce的输入类型相同，而一般情况下，map的输出类型与reduce的输出类型相同，因此，reduce的输入类型与输出类型相同。

4、hadoop根据以下代码确定输入内容的格式：

job.setInputFormatClass(TextInputFormat.class);

TextInputFormat是hadoop默认的输入方法，它继承自FileInputFormat。在TextInputFormat中，它将数据集切割成小数据集InputSplit，每一个InputSplit由一个mapper处理。此外，InputFormat还提供了一个RecordReader的实现，将一个InputSplit解析成<key,value>的形式，并提供给map函数：

key：这个数据相对于数据分片中的字节偏移量，数据类型是LongWritable。

value：每行数据的内容，类型是Text。

因此，在本例中，map函数的key/value类型是LongWritable与Text。

5、Hadoop根据以下代码确定输出内容的格式：

job.setOutputFormatClass(TextOutputFormat.class);

TextOutputFormat是hadoop默认的输出格式，它会将每条记录一行的形式存入文本文件，如

the 30

happy 23

…