编写自己的MapReduce程序 - 基于Hadoop-2.10.1

晓之木初

已于 2023-05-24 11:36:29 修改

阅读量426

点赞数

分类专栏：《Hadoop权威指南2》文章标签： hadoop mapreduce

于 2021-04-06 12:47:56 首次发布

本文链接：https://blog.csdn.net/u014454538/article/details/115455606

版权

《Hadoop权威指南2》专栏收录该内容

20 篇文章 1 订阅

订阅专栏

本文详细介绍了如何部署伪分布式Hadoop集群，包括配置环境变量、启动HDFS和YARN，以及解决hostname问题。接着，展示了如何编写MapReduce程序，包括Mapper、Reducer和主函数，并通过Maven进行依赖管理。最后，运行MR程序处理数据，成功输出结果。文章还提及了运行时的一些注意事项，如避免输出目录已存在和检查ResourceManager的Web界面。

摘要由CSDN通过智能技术生成

1. 部署伪分布式Hadoop集群

集群版本：2.10.1，下载链接：hadoop-2.10.1
部署方式参考官方文档：Pseudo-Distributed Operation

与官方文档不一致的地方:

启动hdfs之前，需要为hadoop-env.sh配置真实的JAVA_HOME路径，而非${JAVA_HOME}
启动hdfs后，应该有三个进程：DataNode、 NameNode、SecondaryNameNode

yarn-site.xml中需要指定resourcemanager的端口号，避免8088端扣已被占用

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
       <name>yarn.resourcemanager.webapp.address</name>
       <value>hadoop:23030</value>
    </property>
</configuration>

在这里插入图片描述

启动yarn前，同样需要配置真实的JAVA_HOME路径。这个文件中JAVA_HOME被注释掉了，可以直接添加以下内容：
```
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
```
启动了yarn后，新增java进程：NodeManager、ResourceManager

修改配置Hadoop环境变量，实现无需进入对应目录，即可执行命令

export HADOOP_HOME=/home/hadoop/hadoop-2.10.1
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export  HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export  HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib:$HADOOP_COMMON_LIB_NATIVE_DIR"

2. 编写MR程序

配置maven依赖

<properties>
    <hadoop.version>2.10.1</hadoop.version>
</properties>

<dependencies>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>${hadoop.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>${hadoop.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-hdfs</artifactId>
        <version>${hadoop.version}</version>
    </dependency>
</dependencies>

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.8.1</version>
            <configuration>
                <source>1.8</source>
                <target>1.8</target>
            </configuration>
        </plugin>
    </plugins>
</build>

编写mapper函数

package com.hadoop.score.mr;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class ScoreMap extends Mapper<LongWritable, Text, Text, IntWritable> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        
        // 按行处理数据，先split出科目和成绩
        String[] data = line.split(",");
        Integer score = Integer.valueOf(data[1]);

        // 构建输出内容
        context.write(new Text(data[0]), new IntWritable(score));
    }
}

编写reducer函数

package com.hadoop.score.mr;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class ScoreReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        // 获取每条数据的最大值
        int maxScore = Integer.MIN_VALUE;
        for (IntWritable score : values) {
            maxScore = Math.max(maxScore, score.get());
        }

        // 构建输出数据
        context.write(key, new IntWritable(maxScore));
    }
}

编写主函数，创建并配置job

package com.hadoop.score.mr;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class ScoreMR {
    public static void main(String[] args) throws Exception {
        if (args.length != 2) {
            System.out.println("Usage: ScoreMR <input_path> <output_path>");
            System.exit(-1);
        }

        // 创建一个MR工作，这是client的一个执行工作单元
        Job job = new Job();
        job.setJarByClass(ScoreMR.class);
        job.setJobName("score mapreduce");

        // 设置输入/输出目录
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        // 设置mapper和reducer
        job.setMapperClass(ScoreMap.class);
        job.setReducerClass(ScoreReducer.class);

        // 设置输出数据key-value格式
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        // 等待作业完成后退出
        System.exit(job.waitForCompletion(true) ? 0 : -1);
    }
}

3. 运行MR程序

先准备数据文件：score.txt

math,86
english,124
english,80
math,93
math,77

将文件上传到hdfs，完整路径为/user/hadoop/input
```
hdfs dfs -put score.txt input/
```

执行MR程序

 hadoop jar score-mr-1.0-SNAPSHOT.jar com/hadoop/score/mr/ScoreMR input/score.txt output/score_2

自己的程序一直执行不成功，暂时记录一下错误

[2021-04-06 01:16:31.227]Container exited with a non-zero exit code 255. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
Apr 06, 2021 1:16:26 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
INFO: Registering org.apache.hadoop.mapreduce.v2.app.webapp.JAXBContextResolver as a provider class
Apr 06, 2021 1:16:26 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
INFO: Registering org.apache.hadoop.yarn.webapp.GenericExceptionHandler as a provider class
Apr 06, 2021 1:16:26 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
INFO: Registering org.apache.hadoop.mapreduce.v2.app.webapp.AMWebServices as a root resource class
Apr 06, 2021 1:16:26 AM com.sun.jersey.server.impl.application.WebApplicationImpl _initiate
INFO: Initiating Jersey application, version 'Jersey: 1.9 09/02/2011 11:17 AM'
Apr 06, 2021 1:16:27 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
INFO: Binding org.apache.hadoop.mapreduce.v2.app.webapp.JAXBContextResolver to GuiceManagedComponentProvider with the scope "Singleton"
Apr 06, 2021 1:16:27 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
INFO: Binding org.apache.hadoop.yarn.webapp.GenericExceptionHandler to GuiceManagedComponentProvider with the scope "Singleton"
Apr 06, 2021 1:16:28 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
INFO: Binding org.apache.hadoop.mapreduce.v2.app.webapp.AMWebServices to GuiceManagedComponentProvider with the scope "PerRequest"
log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapreduce.v2.app.MRAppMaster).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

错误原因： 机器的hostname中有下划线 —— 解决办法： 修改机器的hostname（包括/etc/hostname和/etc/hosts），重启机器，重启Hadoop集群。
感谢博客: Hadoop运行job程序报错 exitCode=255

重新执行，输出如下：

21/04/06 09:50:46 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
21/04/06 09:50:47 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
21/04/06 09:50:48 INFO input.FileInputFormat: Total input files to process : 1
21/04/06 09:50:49 INFO mapreduce.JobSubmitter: number of splits:1
21/04/06 09:50:49 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1617673573380_0002
21/04/06 09:50:49 INFO conf.Configuration: resource-types.xml not found
21/04/06 09:50:49 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
21/04/06 09:50:49 INFO resource.ResourceUtils: Adding resource type - name = memory-mb, units = Mi, type = COUNTABLE
21/04/06 09:50:49 INFO resource.ResourceUtils: Adding resource type - name = vcores, units = , type = COUNTABLE
21/04/06 09:50:50 INFO impl.YarnClientImpl: Submitted application application_1617673573380_0002
21/04/06 09:50:50 INFO mapreduce.Job: The url to track the job: http://hadoop:23030/proxy/application_1617673573380_0002/
21/04/06 09:50:50 INFO mapreduce.Job: Running job: job_1617673573380_0002
21/04/06 09:51:01 INFO mapreduce.Job: Job job_1617673573380_0002 running in uber mode : false
21/04/06 09:51:01 INFO mapreduce.Job:  map 0% reduce 0%
21/04/06 09:51:10 INFO mapreduce.Job:  map 100% reduce 0%
21/04/06 09:51:18 INFO mapreduce.Job:  map 100% reduce 100%
21/04/06 09:51:19 INFO mapreduce.Job: Job job_1617673573380_0002 completed successfully
21/04/06 09:51:19 INFO mapreduce.Job: Counters: 49
        File System Counters
                FILE: Number of bytes read=67
                FILE: Number of bytes written=416357
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=158
                HDFS: Number of bytes written=20
                HDFS: Number of read operations=6
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters 
                Launched map tasks=1
                Launched reduce tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=6625
                Total time spent by all reduces in occupied slots (ms)=5080
                Total time spent by all map tasks (ms)=6625
                Total time spent by all reduce tasks (ms)=5080
                Total vcore-milliseconds taken by all map tasks=6625
                Total vcore-milliseconds taken by all reduce tasks=5080
                Total megabyte-milliseconds taken by all map tasks=6784000
                Total megabyte-milliseconds taken by all reduce tasks=5201920
        Map-Reduce Framework
                Map input records=5
                Map output records=5
                Map output bytes=51
                Map output materialized bytes=67
                Input split bytes=111
                Combine input records=0
                Combine output records=0
                Reduce input groups=2
                Reduce shuffle bytes=67
                Reduce input records=5
                Reduce output records=2
                Spilled Records=10
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=257
                CPU time spent (ms)=2470
                Physical memory (bytes) snapshot=469913600
                Virtual memory (bytes) snapshot=4328771584
                Total committed heap usage (bytes)=326631424
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters 
                Bytes Read=47
        File Output Format Counters 
                Bytes Written=20

执行输出分析：
① 输入数据被划分成一个split，因此应该分配一个map任务

② 整个jod由一个map任务，一个reduce任务构成

③ map和reduce操作的输入和输出的记录数
执行结果，与预期相符
```
english 124
math    93
```

4. 一些注意事项

为job指定inputPath和outputPath时，要求outputPath对应的目录是不存在的。如果存在，则会报错如下：

21/04/06 09:47:18 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://hadoop:9000/user/hadoop/output/wordcount_1 already exists
        at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:146)
        ...

作业的运行详情，可以通过resourceManager对应的wen界面去查看。默认端口是8088，本人指定的是23030端口

晓之木初

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
编写自己的MapReduce程序 - 基于Hadoop-2.10.1

1 部署伪分布式Hadoop集群集群版本：2.10.1，下载链接：hadoop-2.10.1部署方式参考官方文档：Pseudo-Distributed Operation与官方文档不一致的地方:启动hdfs之前，需要为hadoop-env.sh配置真实的JAVA_HOME路径，而非${JAVA_HOME}启动hdfs后，应该有三个进程：DataNode、 NameNode、SecondaryNameNodeyarn-site.xml中需要指定resourcemanager的端口号
复制链接

扫一扫

专栏目录