Hadoop集群大数据解决方案之IDE配Maven实现MapReduce 程序实战（五）

最新推荐文章于 2021-12-07 23:46:34 发布

╭⌒若隐_RowYet——大数据

最新推荐文章于 2021-12-07 23:46:34 发布

阅读量550

点赞数

分类专栏： Hadoop 文章标签： mapreduce

本文链接：https://blog.csdn.net/LXWalaz1s1s/article/details/104107022

版权

Hadoop 专栏收录该内容

12 篇文章 3 订阅

订阅专栏

准备

关于mapreduce的原理，说的再天花乱坠，也还是原理，如图1，用程序练习一个简单的word count程序,采用IDE+Maven简单的撸一把，下一篇博客可以搞点有趣的，哎呀，不得不说Maven这个东西真是太好用了，想起以前手动导依赖的Jar包，简直苦不堪言。
在这里插入图片描述

图1 mapreduce wordcount 图谱

测试数据如下,在集群hdfs上/tmp/wcinput/wordcount.xt下文件：

hello myname hadoop world lau hello map
hdfs spark hive reduce map zookeeper
world word log hdfs
myname hive hi lau

编写项目

表1 Java数据类型和Writable类型对比

Java基本类型	Writable实现	序列化大小（字节）
boolean	BooleanWritable	1
byte	ByteWritable	1
short	ShortWritable	2
int	IntWritable	4
int	VintWritable(可变长)	1~5
float	FloatWritable	4
long	LongWritable	8
long	VlongWritable（可变长）	1~9
double	DoubleWritable	8
String	Text	最大2G

首先普及下，Hadoop自带有org.apache,hadoop.io包，包含广泛的Writable类，即有自己的数据类型（真是的，既然是Java写的，还搞得这么生硬，吐槽下！），对比如表1，特变说明下，char是存储在IntWritable内，Writable类不能直接赋给Java的基本数据类型，需要通过.get()和.set两个方法勇于读取或存储封装的Writable类的值给到Java的基本类型的变量，具体可参考下面Reducer Class里面“ sum=sum+i.get(); ”这句语句；

在这里插入图片描述

图2 新建一个Maven的项目

1）打开IDE，新建一个Maven项目，如图2，Maven的pom.xml配置如下,只要引用两个dependency就行了，连上网，会自动去下载mapreduce需要依赖的Jar包，怎一个爽子了得。

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>

<groupId>org.example</groupId>
<artifactId>hadoop_mapreduce</artifactId>
<version>1.0-SNAPSHOT</version>

<dependencies>
   <dependency>
       <groupId>org.apache.hadoop</groupId>
       <artifactId>hadoop-common</artifactId>
       <version>2.8.5</version>
   </dependency>

   <dependency>
       <groupId>org.apache.hadoop</groupId>
       <artifactId>hadoop-mapreduce-client-core</artifactId>
       <version>2.8.5</version>
   </dependency>
</dependencies>

</project>

2）新建一个wordcount的package，然后在该package下新建一个WcMapper的Java Class,开始编Map程序，具体代码内容如下：

package wordcount;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;
import java.util.StringTokenizer;

//每个自定义的Mapper都继承于Mappeer基类,以及声明自己Mapper的输入key，value和输出key value
public class WcMapper extends Mapper <LongWritable, Text,Text, IntWritable>
{
  @Override
  //重写map方法，默认读入流每次调用map方法会传入split中的一行数据key：该数据所在文件中的位置下标，value，这行数据
  protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
  {
      String line = value.toString();  //value本身是text型，需要转化成string
      StringTokenizer st =new StringTokenizer(line);  //利用StringTokenizer将一行数据按空格分开，StringTokenizer方法默认按空格分开
      while (st.hasMoreTokens())
      {
          String world =st.nextToken();
          context.write(new Text(world),new IntWritable(1)); //每有一个单词，输出该单词并且计数为1
      }

  }
}

3）这次重在实现程序，所以简单的只写Map和Reduce，整个Shuffle阶段都交给MapReduce框架默认处理，接下来就编写Reduce部分，一样在package wordcount下新建一个Java Class文件WcReducer，具体代码如下：

package wordcount;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

//每个自定义的reducer都需要继承于reduce的基类Reducer，并先声明自定义reducer的输入输出类型
public class WcReducer extends Reducer<Text, IntWritable, Text,IntWritable> {
    @Override
    //重写reduce函数，注意reduce前会把相同key shuffle输出结果合并掉，所以得到的value是一个很大的集合，如key=aaa,value=[1,2,4,6,7]
    // 但是又不好把
    //整个集合传给reduce，就选择传一个迭代器iterable
    protected void reduce(Text key, Iterable<IntWritable> iterable, Context context) throws IOException, InterruptedException
    {
        int sum=0;
        for(IntWritable i:iterable)
        {
            sum=sum+i.get(); //i是IntWritable，采用i.get()将i转成整形
        }
        context.write(key,new IntWritable(sum)); //输出每次的单词和统计整数

    }
}

4）Map和Reduce写完了，接下来就是让这个Job Run起来的主函数编写，在同一个package下，新建个WcJobRun的Java Class文件，具体代码如下：

package wordcount;


import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;


public class WcJobRun {
    public static void main(String[] args)
    {
        //获取环境变量,设置提交该Job的mapred.job.tracker
        Configuration conf =new Configuration();
        
        //配置mapreduce.job.tracker，
        //和集群mapred-site.xml里面的属性 保持一致即可，
        //此句也可以不写，直接省略。
        conf.set("mapreduce.job.tracker","dw-cluster-master:9001");//这里

        try
        {
            //mapreduce输出结果会自动创建folder，
            //但是如果指定的输出target folder如果已存在，是会报错的，       
            //这段是做容错，可以让程序rerun
            Path outputPath= new Path(args[2]);
            FileSystem fileSystem =FileSystem.get(conf);
            if(fileSystem.exists(outputPath)){
                fileSystem.delete(outputPath,true);
                System.out.println("outputPath is exist,but has deleted!");
            }

            Job myjob= Job.getInstance(conf);
            myjob.setJarByClass(WcJobRun.class);//指定调用的WcJobRun Class打成Jar再跑
            myjob.setMapperClass(WcMapper.class);//指定Map类
            myjob.setReducerClass(WcReducer.class);//指定Reduce类
            myjob.setCombinerClass(WcReducer.class);//设置combiner，也可以省略
            myjob.setMapOutputKeyClass(Text.class);//指定Map的输出key类型
            myjob.setMapOutputValueClass(IntWritable.class);//指定Map输出的value的类型
            myjob.setNumReduceTasks(1);//指定reduce的个数，如果不写，默认为1个
            //为什么用args[1]，因为args[0]第一个参数留给main方法所在的Class
            FileInputFormat.addInputPath(myjob,new Path(args[1]));//指定整个Job的输入文件路径，args[1]表示调用Jar包时，紧跟Jar包的第二个参数
            //FileInputFormat.addInputPath(myjob,new Path("/tmp/wcinput/wordcount.xt"));
//指定整个Job的输出文件路径，args[2]表示调用Jar包时，紧跟Jar包的第三个参数
            FileOutputFormat.setOutputPath(myjob,new Path(args[2]));
            //FileOutputFormat.setOutputPath(myjob,new Path("/tmp/wcoutput"));
            System.exit(myjob.waitForCompletion(true)?0:1);//等待Job完成，正确完成则退出
        }
        catch (Exception e)
        {
            e.printStackTrace();
        }



    }
}

打包部署

1）将程序打成Jar包，如图3,点击菜单栏File，选择Project Structure…，跳转图4，然后选择Artifact，然后中间栏的“+”号，然后选择Jar，选择From modules with dependence，然后调转界图5，图5重点说一下：

第一点：JAR files from libraries,如果选择了“extract to the target JAR”则只会打包自己的程序，如果勾选了“copy to the out directory and link via manifest”则会吧Maven依赖的所有Jar包都打包出来，比如那一些了的hadoop Jar包，这里我们显然不需要这一些列Jar，因为集群里面都有，所以勾选“extract to the target JAR”；
第二点：选择打包的路劲，最好保留到…/src/即可，如“D:\workspace\hadoop_mapreduce\src\”，后面的就不要带出来，因为如果你文件排版不规范的话，打包太细，会遗漏文件，坑的自己吐血的那种；
第三点：如果反复打包，会报错图6，找到该报错所说的文件，删除即可。

2）以上一切就绪后，点击确定跳转图7，注意下标红部分，Name看不顺眼可以改，一切好了后，点击Apply 和Ok后会回到主界面；
3）然后如图8，点击菜单栏的Build，选择Build Artifact,跳转图9，选择Build，即可大功告成；

4）在该项目下，如图10，多了个out folder，里面的jar包就是你想要的（图10 我测试了很多功能，打包了多次，实际上只会有一个jar包和folder）；
5）将最终的Jar上传到集群即可完成部署。

在这里插入图片描述

图3 打成Jar步骤1

在这里插入图片描述

图4 打成Jar步骤2

在这里插入图片描述

图5 打成Jar步骤3

在这里插入图片描述

图6 IDE反复打包报错

在这里插入图片描述

图7 打成Jar步骤4

在这里插入图片描述

图8 打成Jar步骤5

在这里插入图片描述

图9 打成Jar步骤6

在这里插入图片描述

图10 生成最终的Jar包

在集群上调用

1）在集群上调用其实很简单，一句指令（我的hadoop配置过环境变量）：

hadoop jar /home/liuxiaowei/Java/wc_count/hadoop_mapreduce_v6.jar wordcount.WcJobRun /tmp/wcinput/wordcount.xt /tmp/wcoutput3

2）指令解析：

hadoop jar jar_path args[0] args[1] args[2] 
hadoop jar 是调jar的通用格式;
后面接jar包的路径;
后面接的第一个参数最好是main函数所在的Class，然后第二个参数是在WcJobRun Class指定的args[1]，指的是整个mapreduce job的输入文件;
第三个参数是WcJobRun Class指定的args[2]，指的是整个mapreduce job的输出文件;
现在想想为什么要用args[1]和args[2]而直接跳过args[0]，就很明白了。

3）整体结果如下：

[liuxiaowei@dw-cluster-master wc_count]$ hadoop jar /home/liuxiaowei/Java/wc_count/hadoop_mapreduce_v6.jar wordcount.WcJobRun /tmp/wcinput/wordcount.xt /tmp/wcoutput3
outputPath is exist,but has deleted!
20/02/01 17:11:11 INFO client.RMProxy: Connecting to ResourceManager at dw-cluster-master/10.216.10.141:8032
20/02/01 17:11:12 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
20/02/01 17:11:12 INFO input.FileInputFormat: Total input files to process : 1
20/02/01 17:11:12 INFO mapreduce.JobSubmitter: number of splits:1
20/02/01 17:11:13 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1578394893972_0051
20/02/01 17:11:13 INFO impl.YarnClientImpl: Submitted application application_1578394893972_0051
20/02/01 17:11:13 INFO mapreduce.Job: The url to track the job: http://dw-cluster-master:8088/proxy/application_1578394893972_0051/
20/02/01 17:11:13 INFO mapreduce.Job: Running job: job_1578394893972_0051
20/02/01 17:11:19 INFO mapreduce.Job: Job job_1578394893972_0051 running in uber mode : false
20/02/01 17:11:19 INFO mapreduce.Job:  map 0% reduce 0%
20/02/01 17:11:24 INFO mapreduce.Job:  map 100% reduce 0%
20/02/01 17:11:30 INFO mapreduce.Job:  map 100% reduce 100%
20/02/01 17:11:30 INFO mapreduce.Job: Job job_1578394893972_0051 completed successfully
20/02/01 17:11:30 INFO mapreduce.Job: Counters: 49
        File System Counters
                FILE: Number of bytes read=248
                FILE: Number of bytes written=321577
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=234
                HDFS: Number of bytes written=107
                HDFS: Number of read operations=6
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters
                Launched map tasks=1
                Launched reduce tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=3195
                Total time spent by all reduces in occupied slots (ms)=3417
                Total time spent by all map tasks (ms)=3195
                Total time spent by all reduce tasks (ms)=3417
                Total vcore-milliseconds taken by all map tasks=3195
                Total vcore-milliseconds taken by all reduce tasks=3417
                Total megabyte-milliseconds taken by all map tasks=3271680
                Total megabyte-milliseconds taken by all reduce tasks=3499008
        Map-Reduce Framework
                Map input records=4
                Map output records=21
                Map output bytes=200
                Map output materialized bytes=248
                Input split bytes=116
                Combine input records=0
                Combine output records=0
                Reduce input groups=14
                Reduce shuffle bytes=248
                Reduce input records=21
                Reduce output records=14
                Spilled Records=42
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=96
                CPU time spent (ms)=1920
                Physical memory (bytes) snapshot=1478197248
                Virtual memory (bytes) snapshot=12747206656
                Total committed heap usage (bytes)=1520959488
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=118
        File Output Format Counters
                Bytes Written=107