Hadoop 2.x MapReduce（MR V1）字数统计示例

最新推荐文章于 2022-04-26 20:21:55 发布

cunchi4221

最新推荐文章于 2022-04-26 20:21:55 发布

阅读量391

点赞数

文章标签：算法 python java 大数据 hadoop

原文链接：https://www.journaldev.com/8921/hadoop2-mapreduce-wordcounting-example

版权

Before reading this post, please go through my previous post at “How MapReduce Algorithm Works” to get some idea about MapReduce Algorithm. My previous post has already explained about “How MapReduce performs WordCounting” in theoretically.

在阅读本文之前，请仔细阅读我以前的文章“ MapReduce算法的工作原理”，以获取有关MapReduce算法的一些想法。我以前的文章已经从理论上解释了“ MapReduce如何执行字数统计”。

And if you are not familiar with HDFS Basic commands, please go through my post at “Hadoop HDFS Basic Developer Commands” to get some basic knowledge about how to execute HDFS Commands in CloudEra Environment.

而且，如果您不熟悉HDFS Basic命令，请浏览我的文章“ Hadoop HDFS Basic Developer Commands ”，以获取有关如何在CloudEra环境中执行HDFS命令的一些基本知识。

In this post, We are going to develop same WordCounting program using Hadoop 2 MapReduce API and test it in CloudEra Environment.

在本文中，我们将使用Hadoop 2 MapReduce API开发相同的WordCounting程序，并在CloudEra Environment中对其进行测试。

MapReduce WordCounting示例 (MapReduce WordCounting Example)

We need to write the following three programs to develop and test MapReduce WordCount example:

我们需要编写以下三个程序来开发和测试MapReduce WordCount示例：

Mapper Program
映射程序
Reducer Program
减速器程序
Client Program
客户程序

NOTE:-
To develop MapReduce Programs, there are two versions of MR API:

注意：-
要开发MapReduce程序，有两种版本的MR API：

One from Hadoop 1.x (MapReduce Old API)
一种来自Hadoop 1.x（MapReduce旧API）
Another from Hadoop 2.x (MapReduce New API)
另一个来自Hadoop 2.x（MapReduce New API）

In Hadoop 2.x, MapReduce Old API is deprecated. So we are gong to concentrate on MapReduce New API to develop this WordCount Example.

在Hadoop 2.x中，不赞成使用MapReduce Old API。因此，我们非常关注MapReduce New API，以开发此WordCount示例。

In CloudEra environment, They have already provided Eclipse IDE setup with Hadoop 2.x API. So it is very easy to develop and test MapReduce Programs using this setup.

在CloudEra环境中，他们已经提供了带有Hadoop 2.x API的Eclipse IDE设置。因此，使用此设置来开发和测试MapReduce程序非常容易。

To develop WordCount MapReduce Application, please use the following steps:

要开发WordCount MapReduce应用程序，请使用以下步骤：

Open Default Eclipse IDE provided by CloudEra Environment.
打开CloudEra Environment提供的默认Eclipse IDE。
We can use already created project or create a new Java Project.
我们可以使用已经创建的项目或创建新的Java项目。
For simplicity, I’m going to use existing “training” Java Project. They have already added all required Hadoop 2.x Jars to this project classpath. It is ready to use Eclipse Java Project.
为简单起见，我将使用现有的“培训” Java项目。他们已经将所有必需的Hadoop 2.x Jar添加到该项目类路径中。准备使用Eclipse Java Project。
Create WordCount Mapper Program
创建WordCount Mapper程序
Create WordCount Reducer Program
创建WordCount Reducer程序
Create WordCount Client Program to test this application
创建WordCount客户端程序以测试此应用程序

Let’s us start developing these three programs in next sections.

让我们在下一部分中开始开发这三个程序。

映射程序 (Mapper Program)

Create a “WordCountMapper” Java Class which extends Mapper class as shown below:

创建一个“ WordCountMapper” Java类，它扩展了Mapper类，如下所示：

package com.journaldev.hadoop.mrv1.wordcount;

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
	
	public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
		String w = value.toString();
		context.write(new Text(w), new IntWritable(1));
	}

}

Code Explanation:

代码说明：

Our WordCountMapper class has implemented Hadoop 2 MapReduce API class “Mapper”.
我们的WordCountMapper类已经实现了Hadoop 2 MapReduce API类“ Mapper”。
Mapper class has defined by using Generic Type as Mapper<LongWritable, Text, Text, IntWritable>
Mapper类通过使用通用类型定义为Mapper <LongWritable，Text，Text，IntWritable>
Here <LongWritable, Text, Text, IntWritable>

此处<LongWritable，Text，Text，IntWritable>

First two <LongWritable, Text> represents Input Data types to our WordCount’s Mapper Program.
前两个<LongWritable，Text>表示WordCount的Mapper程序的输入数据类型。
For Example:- In our example, we will give a File(Huge amount of Data, any format). Mapper reads each line from this file and give one unique number as shown below

例如：-在我们的示例中，我们将给出一个File（大量数据，任何格式）。映射器从该文件读取每一行，并给出一个唯一的编号，如下所示
```
<Unique_Long_Number, Line_Read_From_Input_File>
```
In Hadoop MapReduce API, it is equal to <LongWritable, Text>.

在Hadoop MapReduce API中，它等于<LongWritable，Text>。
Last two <Text, IntWritable> represents Output Data types of our WordCount’s Mapper Program.
最后两个<Text，IntWritable>表示我们的WordCount映射程序的输出数据类型。
For Example:- In our example, WordCount’s Mapper Program gives output as shown below

例如：-在我们的示例中，WordCount的Mapper程序给出如下所示的输出
```
<Unique_Word_From_Input_File, Word_Count>
```
In Hadoop MapReduce API, it is equal to <Text, IntWritable>.

在Hadoop MapReduce API中，它等于<Text，IntWritable>。

We have implemented Mapper’s map() method and provided our Mapping Function logic here.
我们已经实现了Mapper的map（）方法，并在此处提供了Mapping Function逻辑。
减速器程序 (Reducer Program)

Create a “WordCountReducer” Java Class which extends Reducer class as shown below:

创建一个“ WordCountReducer” Java类，它扩展了Reducer类，如下所示：
```
package com.journaldev.hadoop.mrv1.wordcount;

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
	
	public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
		int sum = 0;
		for (IntWritable val : values) {
		sum += val.get();
		}
		context.write(key, new IntWritable(sum));
	}
	
}
```
Code Explanation:

代码说明：
- Our WordCountReducer class has extended Hadoop 2 MapReduce API class: “Reducer”.
  我们的WordCountReducer类扩展了Hadoop 2 MapReduce API类：“ Reducer”。
- Reducer class has defined by using Generic Type as Mapper<Text, IntWritable, Text, IntWritable>
  Reducer类通过使用通用类型定义为Mapper <Text，IntWritable，Text，IntWritable>
- Here <Text, IntWritable, Text, IntWritable>
  
  此处<文本，可写，文本，可写>
1. First two <Text, IntWritable> represents Input Data types to our WordCount’s Reducer Program.
  前两个<Text，IntWritable>表示WordCount的Reducer程序的输入数据类型。
2. For Example:- In our example, our Mapper Program will give <Text, IntWritable> output, which will become the input of Reducer Program.
  
  例如：-在我们的示例中，我们的Mapper程序将提供<Text，IntWritable>输出，该输出将成为Reducer程序的输入。
```
<Unique_Word_From_Input_File, Word_Count>
```
  In Hadoop MapReduce API, it is equal to <Text, IntWritable>.
  
  在Hadoop MapReduce API中，它等于<Text，IntWritable>。
3. Last two <Text, IntWritable> represents Output Data types of our WordCount’s Reducer Program.
  最后两个<Text，IntWritable>表示WordCount的Reducer程序的输出数据类型。
4. For Example:- In our example, WordCount’s Reducer Program gives output as shown below
  
  例如：-在我们的示例中，WordCount的Reducer程序给出如下所示的输出
```
<Unique_Word_From_Input_File, Total_Word_Count>
```
  In Hadoop MapReduce API, it is equal to <Text, IntWritable>.
  
  在Hadoop MapReduce API中，它等于<Text，IntWritable>。
We have implemented Reducer’s reduce() method and provided our Reduce Function logic here.
我们已经实现了Reducer的reduce（）方法，并在此处提供了Reduce函数逻辑。
客户程序 (Client Program)

Create a “WordCountClient” Java Class with main() method as shown below:

使用main（）方法创建一个“ WordCountClient” Java类，如下所示：
```
package com.journaldev.hadoop.mrv1.wordcount;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;


public class WordCountClient {

	public static void main(String[] args) throws Exception {
		Job job = Job.getInstance(new Configuration());
		job.setJarByClass(WordCountClient.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		job.setMapperClass(WordCountMapper.class);
		job.setReducerClass(WordCountReducer.class);
		job.setInputFormatClass(TextInputFormat.class);
		job.setOutputFormatClass(TextOutputFormat.class);
		FileInputFormat.setInputPaths(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		boolean status = job.waitForCompletion(true);
		if (status) {
			System.exit(0);
		} 
		else {
			System.exit(1);
		}
	}

}
```
Code Explanation:

代码说明：
- Hadoop 2 MapReduce API has “Job” class at “org.apache.hadoop.mapreduce” package.
  Hadoop 2 MapReduce API在“ org.apache.hadoop.mapreduce”包中具有“ Job”类。
- Job Class is used to create Jobs (Map/Reduce Jobs) to perform our WordCounting tasks.
  作业类用于创建作业（映射/减少作业）以执行我们的字数统计任务。
- Client program is using Job Object’s setter methods to set all MapReduce Components like Mapper, Reducer, Input Data Type, Output Data type etc.
  客户端程序正在使用Job Object的setter方法来设置所有MapReduce组件，例如Mapper，Reducer，输入数据类型，输出数据类型等。
- These Jobs will perform our WordCounting Mapping and Reducing tasks.
  这些作业将执行我们的字数统计映射和精简任务。
NOTE:-

注意：-
- As we discussed in my previous post, MapReduce algorithm uses 3 functions: Map Function, Combine Function and Reduce Function.
  正如我们在上一篇文章中讨论的那样，MapReduce算法使用3个函数：映射函数，合并函数和归约函数。
- By observing these 3 programs, we can find out one thing that we have developed only only two functions : Map and Reduce. Then What about Combine function?
  通过观察这三个程序，我们可以发现仅开发了两个功能的一件事：Map和Reduce。那合并功能呢？
- That means we have used default Combine function logic available in Hadoop 2 MapReduce API.
  这意味着我们使用了Hadoop 2 MapReduce API中可用的默认Combine函数逻辑。
- We will discuss on “How to develop Combine Function” in my coming posts.
  我们将在以后的文章中讨论“如何开发合并功能”。
Now we have developed all required components (programs). It’s time to test it.

现在，我们已经开发了所有必需的组件（程序）。现在该进行测试了。

测试MapReduce WordCounting示例 (Test MapReduce WordCounting Example)

Our WordCounting project final structure looks like this:

我们的WordCounting项目最终结构如下所示：

Please use the following steps to test our MapReduce Application.

请使用以下步骤测试我们的MapReduce应用程序。
- Create our WordCount application JAR file using Eclipse IDE.
  使用Eclipse IDE创建我们的WordCount应用程序JAR文件。
- Execute the following “hadoop” command to run our WordCounting Application
  执行以下“ hadoop”命令以运行我们的WordCounting应用程序
- Syntax:-
  
  句法：-
```
hadoop jar <our-Jar-file-path> <Client-program>  <Input-Path> <Output-Path>
```
Let us assume that we have already created “/ram/mrv1/output” folder structure in Hadoop HDFS FileSytem. If you are not performed that, please go through my previous post at “Hadoop HDFS Basic Developer Commands” to create them.

让我们假设我们已经在Hadoop HDFS FileSytem中创建了“ / ram / mrv1 / output”文件夹结构。如果不执行该操作，请浏览我以前的文章“ Hadoop HDFS Basic Developer Commands ”以创建它们。

Example:-

例：-
```
hadoop jar /home/cloudera/JDWordCountMapReduceApp.jar  
       com.journaldev.hadoop.mrv1.wordcount.WordCountClient 
       /ram/mrv1/NASDAQ_daily_prices_C.csv
       /ram/mrv1/output
```
NOTE:-
Just for simple readability purpose, I’ve provided command into multiple lines. Please type this command in single line as shown below:

注意：-
出于简单易读的目的，我提供了多行命令。请在单行中键入此命令，如下所示：

By going through this log, we can observe that how Map and Reduce jobs work to solve our WordCounting problem.

通过查看该日志，我们可以观察到Map and Reduce作业如何解决WordCounting问题。
Execute the following “hadoop” command to view the output directory content
执行以下“ hadoop”命令以查看输出目录内容
```
hadoop fs -ls /ram/mrv1/output/
```
It shows the content of “/ram/mrv1/output/” directory as shown below:

它显示“ / ram / mrv1 / output /”目录的内容，如下所示：
Execute the following “hadoop” command to view our WordCounting Application output
执行以下“ hadoop”命令以查看我们的WordCounting应用程序输出
```
hadoop fs -cat /ram/mrv1/output/part-r-00000
```
This command displays WordCounting Application output. As my output file is too big, I’m not able to show you my file output here.

此命令显示WordCounting应用程序输出。由于我的输出文件太大，因此无法在此处显示我的文件输出。

NOTE:-
Here we have used some Hadoop HDFS commands to run and test our WordCounting Application. If you are not familiar with HDFS commands, please go through my “Hadoop HDFS Basic Developer Commands” post.

注意：-
在这里，我们使用了一些Hadoop HDFS命令来运行和测试WordCounting应用程序。如果您不熟悉HDFS命令，请阅读我的“ Hadoop HDFS基本开发人员命令 ”一文。

That’s it all about Hadoop 2.x MapReduce WordCounting Example. We will develop some more useful MapReduce programs in my coming posts.

这就是有关Hadoop 2.x MapReduce WordCounting示例的全部内容。在接下来的文章中，我们将开发一些更有用的MapReduce程序。

Please drop me a comment if you like my post or have any issues/suggestions.

如果您喜欢我的帖子或有任何问题/建议，请给我评论。

翻译自: https://www.journaldev.com/8921/hadoop2-mapreduce-wordcounting-example

cunchi4221

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Hadoop 2.x MapReduce（MR V1）字数统计示例

Before reading this post, please go through my previous post at “How MapReduce Algorithm Works” to get some idea about MapReduce Algorithm. My previous post has already explained about “How MapReduce ...
复制链接

扫一扫