编写自己的MapReduce程序 - 基于Hadoop-2.10.1

本文详细介绍了如何部署伪分布式Hadoop集群,包括配置环境变量、启动HDFS和YARN,以及解决hostname问题。接着,展示了如何编写MapReduce程序,包括Mapper、Reducer和主函数,并通过Maven进行依赖管理。最后,运行MR程序处理数据,成功输出结果。文章还提及了运行时的一些注意事项,如避免输出目录已存在和检查ResourceManager的Web界面。
摘要由CSDN通过智能技术生成
1. 部署伪分布式Hadoop集群

与官方文档不一致的地方:

  1. 启动hdfs之前,需要为hadoop-env.sh配置真实的JAVA_HOME路径,而非${JAVA_HOME}

  2. 启动hdfs后,应该有三个进程:DataNode、 NameNode、SecondaryNameNode

  3. yarn-site.xml中需要指定resourcemanager的端口号,避免8088端扣已被占用

    <configuration>
        <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
        </property>
        <property>
           <name>yarn.resourcemanager.webapp.address</name>
           <value>hadoop:23030</value>
        </property>
    </configuration>
    

    在这里插入图片描述

  4. 启动yarn前,同样需要配置真实的JAVA_HOME路径。这个文件中JAVA_HOME被注释掉了,可以直接添加以下内容:

    export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
    
  5. 启动了yarn后,新增java进程:NodeManager、ResourceManager

  6. 修改配置Hadoop环境变量,实现无需进入对应目录,即可执行命令

    export HADOOP_HOME=/home/hadoop/hadoop-2.10.1
    export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
    export  HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
    export  HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib:$HADOOP_COMMON_LIB_NATIVE_DIR"
    
2. 编写MR程序
  • 配置maven依赖

    <properties>
        <hadoop.version>2.10.1</hadoop.version>
    </properties>
    
    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
    </dependencies>
    
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.8.1</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
        </plugins>
    </build>
    
  • 编写mapper函数

    package com.hadoop.score.mr;
    
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Mapper;
    
    import java.io.IOException;
    
    public class ScoreMap extends Mapper<LongWritable, Text, Text, IntWritable> {
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();
            
            // 按行处理数据,先split出科目和成绩
            String[] data = line.split(",");
            Integer score = Integer.valueOf(data[1]);
    
            // 构建输出内容
            context.write(new Text(data[0]), new IntWritable(score));
        }
    }
    
  • 编写reducer函数

    package com.hadoop.score.mr;
    
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Reducer;
    
    import java.io.IOException;
    
    public class ScoreReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
        @Override
        protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            // 获取每条数据的最大值
            int maxScore = Integer.MIN_VALUE;
            for (IntWritable score : values) {
                maxScore = Math.max(maxScore, score.get());
            }
    
            // 构建输出数据
            context.write(key, new IntWritable(maxScore));
        }
    }
    
  • 编写主函数,创建并配置job

    package com.hadoop.score.mr;
    
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    
    public class ScoreMR {
        public static void main(String[] args) throws Exception {
            if (args.length != 2) {
                System.out.println("Usage: ScoreMR <input_path> <output_path>");
                System.exit(-1);
            }
    
            // 创建一个MR工作,这是client的一个执行工作单元
            Job job = new Job();
            job.setJarByClass(ScoreMR.class);
            job.setJobName("score mapreduce");
    
            // 设置输入/输出目录
            FileInputFormat.addInputPath(job, new Path(args[0]));
            FileOutputFormat.setOutputPath(job, new Path(args[1]));
    
            // 设置mapper和reducer
            job.setMapperClass(ScoreMap.class);
            job.setReducerClass(ScoreReducer.class);
    
            // 设置输出数据key-value格式
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(IntWritable.class);
    
            // 等待作业完成后退出
            System.exit(job.waitForCompletion(true) ? 0 : -1);
        }
    }
    
3. 运行MR程序
  • 先准备数据文件:score.txt

    math,86
    english,124
    english,80
    math,93
    math,77
    
  • 将文件上传到hdfs,完整路径为/user/hadoop/input

    hdfs dfs -put score.txt input/
    
  • 执行MR程序

     hadoop jar score-mr-1.0-SNAPSHOT.jar com/hadoop/score/mr/ScoreMR input/score.txt output/score_2
    
  • 自己的程序一直执行不成功,暂时记录一下错误

    [2021-04-06 01:16:31.227]Container exited with a non-zero exit code 255. Error file: prelaunch.err.
    Last 4096 bytes of prelaunch.err :
    Last 4096 bytes of stderr :
    Apr 06, 2021 1:16:26 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
    INFO: Registering org.apache.hadoop.mapreduce.v2.app.webapp.JAXBContextResolver as a provider class
    Apr 06, 2021 1:16:26 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
    INFO: Registering org.apache.hadoop.yarn.webapp.GenericExceptionHandler as a provider class
    Apr 06, 2021 1:16:26 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
    INFO: Registering org.apache.hadoop.mapreduce.v2.app.webapp.AMWebServices as a root resource class
    Apr 06, 2021 1:16:26 AM com.sun.jersey.server.impl.application.WebApplicationImpl _initiate
    INFO: Initiating Jersey application, version 'Jersey: 1.9 09/02/2011 11:17 AM'
    Apr 06, 2021 1:16:27 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
    INFO: Binding org.apache.hadoop.mapreduce.v2.app.webapp.JAXBContextResolver to GuiceManagedComponentProvider with the scope "Singleton"
    Apr 06, 2021 1:16:27 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
    INFO: Binding org.apache.hadoop.yarn.webapp.GenericExceptionHandler to GuiceManagedComponentProvider with the scope "Singleton"
    Apr 06, 2021 1:16:28 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
    INFO: Binding org.apache.hadoop.mapreduce.v2.app.webapp.AMWebServices to GuiceManagedComponentProvider with the scope "PerRequest"
    log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapreduce.v2.app.MRAppMaster).
    log4j:WARN Please initialize the log4j system properly.
    log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
    
    
  • 错误原因: 机器的hostname中有下划线 —— 解决办法: 修改机器的hostname(包括/etc/hostname/etc/hosts),重启机器,重启Hadoop集群。

  • 感谢博客: Hadoop运行job程序报错 exitCode=255

  • 重新执行,输出如下:

    21/04/06 09:50:46 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
    21/04/06 09:50:47 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
    21/04/06 09:50:48 INFO input.FileInputFormat: Total input files to process : 1
    21/04/06 09:50:49 INFO mapreduce.JobSubmitter: number of splits:1
    21/04/06 09:50:49 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1617673573380_0002
    21/04/06 09:50:49 INFO conf.Configuration: resource-types.xml not found
    21/04/06 09:50:49 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
    21/04/06 09:50:49 INFO resource.ResourceUtils: Adding resource type - name = memory-mb, units = Mi, type = COUNTABLE
    21/04/06 09:50:49 INFO resource.ResourceUtils: Adding resource type - name = vcores, units = , type = COUNTABLE
    21/04/06 09:50:50 INFO impl.YarnClientImpl: Submitted application application_1617673573380_0002
    21/04/06 09:50:50 INFO mapreduce.Job: The url to track the job: http://hadoop:23030/proxy/application_1617673573380_0002/
    21/04/06 09:50:50 INFO mapreduce.Job: Running job: job_1617673573380_0002
    21/04/06 09:51:01 INFO mapreduce.Job: Job job_1617673573380_0002 running in uber mode : false
    21/04/06 09:51:01 INFO mapreduce.Job:  map 0% reduce 0%
    21/04/06 09:51:10 INFO mapreduce.Job:  map 100% reduce 0%
    21/04/06 09:51:18 INFO mapreduce.Job:  map 100% reduce 100%
    21/04/06 09:51:19 INFO mapreduce.Job: Job job_1617673573380_0002 completed successfully
    21/04/06 09:51:19 INFO mapreduce.Job: Counters: 49
            File System Counters
                    FILE: Number of bytes read=67
                    FILE: Number of bytes written=416357
                    FILE: Number of read operations=0
                    FILE: Number of large read operations=0
                    FILE: Number of write operations=0
                    HDFS: Number of bytes read=158
                    HDFS: Number of bytes written=20
                    HDFS: Number of read operations=6
                    HDFS: Number of large read operations=0
                    HDFS: Number of write operations=2
            Job Counters 
                    Launched map tasks=1
                    Launched reduce tasks=1
                    Data-local map tasks=1
                    Total time spent by all maps in occupied slots (ms)=6625
                    Total time spent by all reduces in occupied slots (ms)=5080
                    Total time spent by all map tasks (ms)=6625
                    Total time spent by all reduce tasks (ms)=5080
                    Total vcore-milliseconds taken by all map tasks=6625
                    Total vcore-milliseconds taken by all reduce tasks=5080
                    Total megabyte-milliseconds taken by all map tasks=6784000
                    Total megabyte-milliseconds taken by all reduce tasks=5201920
            Map-Reduce Framework
                    Map input records=5
                    Map output records=5
                    Map output bytes=51
                    Map output materialized bytes=67
                    Input split bytes=111
                    Combine input records=0
                    Combine output records=0
                    Reduce input groups=2
                    Reduce shuffle bytes=67
                    Reduce input records=5
                    Reduce output records=2
                    Spilled Records=10
                    Shuffled Maps =1
                    Failed Shuffles=0
                    Merged Map outputs=1
                    GC time elapsed (ms)=257
                    CPU time spent (ms)=2470
                    Physical memory (bytes) snapshot=469913600
                    Virtual memory (bytes) snapshot=4328771584
                    Total committed heap usage (bytes)=326631424
            Shuffle Errors
                    BAD_ID=0
                    CONNECTION=0
                    IO_ERROR=0
                    WRONG_LENGTH=0
                    WRONG_MAP=0
                    WRONG_REDUCE=0
            File Input Format Counters 
                    Bytes Read=47
            File Output Format Counters 
                    Bytes Written=20
    
  • 执行输出分析:
    ① 输入数据被划分成一个split,因此应该分配一个map任务
    在这里插入图片描述
    ② 整个jod由一个map任务,一个reduce任务构成
    在这里插入图片描述
    ③ map和reduce操作的输入和输出的记录数
    在这里插入图片描述

  • 执行结果,与预期相符

    english 124
    math    93
    
4. 一些注意事项
  • 为job指定inputPathoutputPath时,要求outputPath对应的目录是不存在的。如果存在,则会报错如下:

    21/04/06 09:47:18 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
    org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://hadoop:9000/user/hadoop/output/wordcount_1 already exists
            at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:146)
            ...
    
  • 作业的运行详情,可以通过resourceManager对应的wen界面去查看。默认端口是8088,本人指定的是23030端口
    在这里插入图片描述

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值