Hadoop 2.x MapReduce(MR V1)字数统计示例

Before reading this post, please go through my previous post at “How MapReduce Algorithm Works” to get some idea about MapReduce Algorithm. My previous post has already explained about “How MapReduce performs WordCounting” in theoretically.

在阅读本文之前,请仔细阅读我以前的文章“ MapReduce算法的工作原理”,以获取有关MapReduce算法的一些想法。 我以前的文章已经从理论上解释了“ MapReduce如何执行字数统计”。

And if you are not familiar with HDFS Basic commands, please go through my post at “Hadoop HDFS Basic Developer Commands” to get some basic knowledge about how to execute HDFS Commands in CloudEra Environment.

而且,如果您不熟悉HDFS Basic命令,请浏览我的文章“ Hadoop HDFS Basic Developer Commands ”,以获取有关如何在CloudEra环境中执行HDFS命令的一些基本知识。

In this post, We are going to develop same WordCounting program using Hadoop 2 MapReduce API and test it in CloudEra Environment.

在本文中,我们将使用Hadoop 2 MapReduce API开发相同的WordCounting程序,并在CloudEra Environment中对其进行测试。

MapReduce WordCounting示例 (MapReduce WordCounting Example)

We need to write the following three programs to develop and test MapReduce WordCount example:

我们需要编写以下三个程序来开发和测试MapReduce WordCount示例:

  1. Mapper Program

    映射程序
  2. Reducer Program

    减速器程序
  3. Client Program

    客户程序

NOTE:-
To develop MapReduce Programs, there are two versions of MR API:

注意:-
要开发MapReduce程序,有两种版本的MR API:

  1. One from Hadoop 1.x (MapReduce Old API)

    一种来自Hadoop 1.x(MapReduce旧API)
  2. Another from Hadoop 2.x (MapReduce New API)

    另一个来自Hadoop 2.x(MapReduce New API)

In Hadoop 2.x, MapReduce Old API is deprecated. So we are gong to concentrate on MapReduce New API to develop this WordCount Example.

在Hadoop 2.x中,不赞成使用MapReduce Old API。 因此,我们非常关注MapReduce New API,以开发此WordCount示例。

In CloudEra environment, They have already provided Eclipse IDE setup with Hadoop 2.x API. So it is very easy to develop and test MapReduce Programs using this setup.

在CloudEra环境中,他们已经提供了带有Hadoop 2.x API的Eclipse IDE设置。 因此,使用此设置来开发和测试MapReduce程序非常容易。

To develop WordCount MapReduce Application, please use the following steps:

要开发WordCount MapReduce应用程序,请使用以下步骤:

  • Open Default Eclipse IDE provided by CloudEra Environment.

    打开CloudEra Environment提供的默认Eclipse IDE。
  • We can use already created project or create a new Java Project.

    我们可以使用已经创建的项目或创建新的Java项目。
  • For simplicity, I’m going to use existing “training” Java Project. They have already added all required Hadoop 2.x Jars to this project classpath. It is ready to use Eclipse Java Project.

    为简单起见,我将使用现有的“培训” Java项目。 他们已经将所有必需的Hadoop 2.x Jar添加到该项目类路径中。 准备使用Eclipse Java Project。
  • Create WordCount Mapper Program

    创建WordCount Mapper程序
  • Create WordCount Reducer Program

    创建WordCount Reducer程序
  • Create WordCount Client Program to test this application

    创建WordCount客户端程序以测试此应用程序

Let’s us start developing these three programs in next sections.

让我们在下一部分中开始开发这三个程序。

映射程序 (Mapper Program)

Create a “WordCountMapper” Java Class which extends Mapper class as shown below:

创建一个“ WordCountMapper” Java类,它扩展了Mapper类,如下所示:

package com.journaldev.hadoop.mrv1.wordcount;

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
	
	public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
		String w = value.toString();
		context.write(new Text(w), new IntWritable(1));
	}

}

Code Explanation:

代码说明:

  • Our WordCountMapper class has implemented Hadoop 2 MapReduce API class “Mapper”.

    我们的WordCountMapper类已经实现了Hadoop 2 MapReduce API类“ Mapper”。
  • Mapper class has defined by using Generic Type as Mapper<LongWritable, Text, Text, IntWritable>

    Mapper类通过使用通用类型定义为Mapper <LongWritable,Text,Text,IntWritable>
  • Here <LongWritable, Text, Text, IntWritable>

    此处<LongWritable,Text,Text,IntWritable>

  1. First two <LongWritable, Text> represents Input Data types to our WordCount’s Mapper Program.

    前两个<LongWritable,Text>表示WordCount的Mapper程序的输入数据类型。
  2. For Example:- In our example, we will give a File(Huge amount of Data, any format). Mapper reads each line from this file and give one unique number as shown below

    例如:-在我们的示例中,我们将给出一个File(大量数据,任何格式)。 映射器从该文件读取每一行,并给出一个唯一的编号,如下所示

    <Unique_Long_Number, Line_Read_From_Input_File>

    In Hadoop MapReduce API, it is equal to <LongWritable, Text>.

    在Hadoop MapReduce API中,它等于<LongWritable,Text>。

  3. Last two <Text, IntWritable> represents Output Data types of our WordCount’s Mapper Program.

    最后两个<Text,IntWritable>表示我们的WordCount映射程序的输出数据类型。
  4. For Example:- In our example, WordCount’s Mapper Program gives output as shown below

    例如:-在我们的示例中,WordCount的Mapper程序给出如下所示的输出

    <Unique_Word_From_Input_File, Word_Count>

    In Hadoop MapReduce API, it is equal to <Text, IntWritable>.

    在Hadoop MapReduce API中,它等于<Text,IntWritable>。

  • We have implemented Mapper’s map() method and provided our Mapping Function logic here.

    我们已经实现了Mapper的map()方法,并在此处提供了Mapping Function逻辑。
  • 减速器程序 (Reducer Program)

    Create a “WordCountReducer” Java Class which extends Reducer class as shown below:

    创建一个“ WordCountReducer” Java类,它扩展了Reducer类,如下所示:

    package com.journaldev.hadoop.mrv1.wordcount;
    
    import java.io.IOException;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Reducer;
    
    public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    	
    	public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
    		int sum = 0;
    		for (IntWritable val : values) {
    		sum += val.get();
    		}
    		context.write(key, new IntWritable(sum));
    	}
    	
    }

    Code Explanation:

    代码说明:

    • Our WordCountReducer class has extended Hadoop 2 MapReduce API class: “Reducer”.

      我们的WordCountReducer类扩展了Hadoop 2 MapReduce API类:“ Reducer”。
    • Reducer class has defined by using Generic Type as Mapper<Text, IntWritable, Text, IntWritable>

      Reducer类通过使用通用类型定义为Mapper <Text,IntWritable,Text,IntWritable>
    • Here <Text, IntWritable, Text, IntWritable>

      此处<文本,可写,文本,可写>

    1. First two <Text, IntWritable> represents Input Data types to our WordCount’s Reducer Program.

      前两个<Text,IntWritable>表示WordCount的Reducer程序的输入数据类型。
    2. For Example:- In our example, our Mapper Program will give <Text, IntWritable> output, which will become the input of Reducer Program.

      例如:-在我们的示例中,我们的Mapper程序将提供<Text,IntWritable>输出,该输出将成为Reducer程序的输入。

      <Unique_Word_From_Input_File, Word_Count>

      In Hadoop MapReduce API, it is equal to <Text, IntWritable>.

      在Hadoop MapReduce API中,它等于<Text,IntWritable>。

    3. Last two <Text, IntWritable> represents Output Data types of our WordCount’s Reducer Program.

      最后两个<Text,IntWritable>表示WordCount的Reducer程序的输出数据类型。
    4. For Example:- In our example, WordCount’s Reducer Program gives output as shown below

      例如:-在我们的示例中,WordCount的Reducer程序给出如下所示的输出

      <Unique_Word_From_Input_File, Total_Word_Count>

      In Hadoop MapReduce API, it is equal to <Text, IntWritable>.

      在Hadoop MapReduce API中,它等于<Text,IntWritable>。

  • We have implemented Reducer’s reduce() method and provided our Reduce Function logic here.

    我们已经实现了Reducer的reduce()方法,并在此处提供了Reduce函数逻辑。
  • 客户程序 (Client Program)

    Create a “WordCountClient” Java Class with main() method as shown below:

    使用main()方法创建一个“ WordCountClient” Java类,如下所示:

    package com.journaldev.hadoop.mrv1.wordcount;
    
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
    
    
    public class WordCountClient {
    
    	public static void main(String[] args) throws Exception {
    		Job job = Job.getInstance(new Configuration());
    		job.setJarByClass(WordCountClient.class);
    		job.setOutputKeyClass(Text.class);
    		job.setOutputValueClass(IntWritable.class);
    		job.setMapperClass(WordCountMapper.class);
    		job.setReducerClass(WordCountReducer.class);
    		job.setInputFormatClass(TextInputFormat.class);
    		job.setOutputFormatClass(TextOutputFormat.class);
    		FileInputFormat.setInputPaths(job, new Path(args[0]));
    		FileOutputFormat.setOutputPath(job, new Path(args[1]));
    		boolean status = job.waitForCompletion(true);
    		if (status) {
    			System.exit(0);
    		} 
    		else {
    			System.exit(1);
    		}
    	}
    
    }

    Code Explanation:

    代码说明:

    • Hadoop 2 MapReduce API has “Job” class at “org.apache.hadoop.mapreduce” package.

      Hadoop 2 MapReduce API在“ org.apache.hadoop.mapreduce”包中具有“ Job”类。
    • Job Class is used to create Jobs (Map/Reduce Jobs) to perform our WordCounting tasks.

      作业类用于创建作业(映射/减少作业)以执行我们的字数统计任务。
    • Client program is using Job Object’s setter methods to set all MapReduce Components like Mapper, Reducer, Input Data Type, Output Data type etc.

      客户端程序正在使用Job Object的setter方法来设置所有MapReduce组件,例如Mapper,Reducer,输入数据类型,输出数据类型等。
    • These Jobs will perform our WordCounting Mapping and Reducing tasks.

      这些作业将执行我们的字数统计映射和精简任务。

    NOTE:-

    注意:-

    • As we discussed in my previous post, MapReduce algorithm uses 3 functions: Map Function, Combine Function and Reduce Function.

      正如我们在上一篇文章中讨论的那样,MapReduce算法使用3个函数:映射函数,合并函数和归约函数。
    • By observing these 3 programs, we can find out one thing that we have developed only only two functions : Map and Reduce. Then What about Combine function?

      通过观察这三个程序,我们可以发现仅开发了两个功能的一件事:Map和Reduce。 那合并功能呢?
    • That means we have used default Combine function logic available in Hadoop 2 MapReduce API.

      这意味着我们使用了Hadoop 2 MapReduce API中可用的默认Combine函数逻辑。
    • We will discuss on “How to develop Combine Function” in my coming posts.

      我们将在以后的文章中讨论“如何开发合并功能”。

    Now we have developed all required components (programs). It’s time to test it.

    现在,我们已经开发了所有必需的组件(程序)。 现在该进行测试了。

    测试MapReduce WordCounting示例 (Test MapReduce WordCounting Example)

    Our WordCounting project final structure looks like this:

    我们的WordCounting项目最终结构如下所示:

    Please use the following steps to test our MapReduce Application.

    请使用以下步骤测试我们的MapReduce应用程序。

    • Create our WordCount application JAR file using Eclipse IDE.

      使用Eclipse IDE创建我们的WordCount应用程序JAR文件。
    • Execute the following “hadoop” command to run our WordCounting Application

      执行以下“ hadoop”命令以运行我们的WordCounting应用程序
    • Syntax:-

      句法:-

    hadoop jar <our-Jar-file-path> <Client-program>  <Input-Path> <Output-Path>

    Let us assume that we have already created “/ram/mrv1/output” folder structure in Hadoop HDFS FileSytem. If you are not performed that, please go through my previous post at “Hadoop HDFS Basic Developer Commands” to create them.

    让我们假设我们已经在Hadoop HDFS FileSytem中创建了“ / ram / mrv1 / output”文件夹结构。 如果不执行该操作,请浏览我以前的文章“ Hadoop HDFS Basic Developer Commands ”以创建它们。

    Example:-

    例:-

    hadoop jar /home/cloudera/JDWordCountMapReduceApp.jar  
           com.journaldev.hadoop.mrv1.wordcount.WordCountClient 
           /ram/mrv1/NASDAQ_daily_prices_C.csv
           /ram/mrv1/output

    NOTE:-
    Just for simple readability purpose, I’ve provided command into multiple lines. Please type this command in single line as shown below:

    注意:-
    出于简单易读的目的,我提供了多行命令。 请在单行中键入此命令,如下所示:

    By going through this log, we can observe that how Map and Reduce jobs work to solve our WordCounting problem.

    通过查看该日志,我们可以观察到Map and Reduce作业如何解决WordCounting问题。

  • Execute the following “hadoop” command to view the output directory content

    执行以下“ hadoop”命令以查看输出目录内容
  • hadoop fs -ls /ram/mrv1/output/

    It shows the content of “/ram/mrv1/output/” directory as shown below:

    它显示“ / ram / mrv1 / output /”目录的内容,如下所示:

  • Execute the following “hadoop” command to view our WordCounting Application output

    执行以下“ hadoop”命令以查看我们的WordCounting应用程序输出
  • hadoop fs -cat /ram/mrv1/output/part-r-00000

    This command displays WordCounting Application output. As my output file is too big, I’m not able to show you my file output here.

    此命令显示WordCounting应用程序输出。 由于我的输出文件太大,因此无法在此处显示我的文件输出。

    NOTE:-
    Here we have used some Hadoop HDFS commands to run and test our WordCounting Application. If you are not familiar with HDFS commands, please go through my “Hadoop HDFS Basic Developer Commands” post.

    注意:-
    在这里,我们使用了一些Hadoop HDFS命令来运行和测试WordCounting应用程序。 如果您不熟悉HDFS命令,请阅读我的“ Hadoop HDFS基本开发人员命令 ”一文。

    That’s it all about Hadoop 2.x MapReduce WordCounting Example. We will develop some more useful MapReduce programs in my coming posts.

    这就是有关Hadoop 2.x MapReduce WordCounting示例的全部内容。 在接下来的文章中,我们将开发一些更有用的MapReduce程序。

    Please drop me a comment if you like my post or have any issues/suggestions.

    如果您喜欢我的帖子或有任何问题/建议,请给我评论。

    翻译自: https://www.journaldev.com/8921/hadoop2-mapreduce-wordcounting-example

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值