十四、简单WordCount案例编写


        前言:如果需要在本地测试MR程序需要安装hadoop,就跟本地安装jdk一个道理,JAVA_HOME变成HADOOP_HOME,最后在cmd测试输入hadoop命令看安装结果,是不是跟jdk配置一样

hadoop包:链接:https://pan.baidu.com/s/1bPlkKnYLXsfOjMtcK1Nq8g
密码:nzqg

一、计算任务分析

1、需求:统计指定文本文件里面每个单词出现的次数。
文件内容如下:
java	java	hello	scala	java
baidu	alib	meituan	scala	baidu
alib	java	scala	java	alib
2、编写程序

计算如上文本内容每个单词出现次数

2.1 工程搭建

1.新建maven工程 mr-study

2.导入pom依赖关系

<dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>RELEASE</version>
        </dependency>
        <dependency>
            <groupId>org.apache.logging.log4j</groupId>
            <artifactId>log4j-core</artifactId>
            <version>2.8.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.7.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.7.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.7.2</version>
        </dependency>
    </dependencies>
  1. resources下新建log4j.properties,导入如下内容,用于查看任务运行日志信息
log4j.rootLogger=INFO, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
log4j.appender.logfile=org.apache.log4j.FileAppender
log4j.appender.logfile.File=target/spring.log
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n


2.2 Mapper编写
/**
 * 编写wordcount map类:完成数据输入
 *
 * FileInputFormat默认实现是TextInputFormat,所以是一行一行读入数据,key是偏移量,v是读取行字符串
 */
public class WordcountMapper extends Mapper<LongWritable, Text,Text, IntWritable> {

    // 输出数据格式,也就是reduce接受的格式
    Text k = new Text();
    IntWritable v = new IntWritable(1);

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        //获取当前行文本
        String line = value.toString();

        //按照 \t 分割字符
        String[] words = line.split("\t");

        //循环将每个字符写出,reduce会按照相同key的数据聚集到一起
        for (String word : words) {
            k.set(word);
            //word  1
            //java  1
            context.write(k,v);
        }
    }
}


2.3 reducer编写
package com.cjy.mr.wordcount;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;
import java.util.Iterator;

/**
 * 编写reducer类:计算相同单词出现的次数
 * 从map接受数据类型如下:
 *
 * (java,(1,1,1,1))
 */
public class WordcountReducer extends Reducer<Text,IntWritable,Text, IntWritable> {

    int sum ;
    IntWritable  v = new IntWritable();
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

        // 1 累加求和
        sum = 0;
        for (IntWritable count : values) {
            sum += count.get();
        }

        // 2 输出
        v.set(sum);
        //java  5
        //hello 4
        context.write(key,v);
    }
}



2.4 driver编写
package com.cjy.mr.wordcount;


import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordcountDriver {

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        // 1 获取配置信息以及封装任务
        Configuration configuration = new Configuration();
        Job job = Job.getInstance(configuration);

        // 2 设置jar加载路径
        job.setJarByClass(WordcountDriver.class);

        // 3 设置map和reduce类
        job.setMapperClass(WordcountMapper.class);
        job.setReducerClass(WordcountReducer.class);

        // 4 设置map输出
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        // 5 设置最终输出kv类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

//        // 6 设置输入和输出路径 
//如下动态参数接收用于hadoop集群测试
//        FileInputFormat.setInputPaths(job, new Path(args[0]));
//        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        FileInputFormat.setInputPaths(job, new Path("/Users/chenjunying/Downloads/wd.txt"));//改为自己本地文件输入位置
        FileOutputFormat.setOutputPath(job, new Path("/Users/chenjunying/Downloads/out/"));	//结果输出位置

        // 7 提交
        boolean result = job.waitForCompletion(true);

        System.exit(result ? 0 : 1);
    }
}
2.5 结果
alib	3
baidu	2
hello	1
java	5
meituan	1
scala	3
2.6 运行日志分析

只看重要部分的日志信息

处理一个文件
2020-05-09 21:42:14,255 INFO [org.apache.hadoop.mapreduce.lib.input.FileInputFormat] - Total input paths to process : 1 
分了一个片,使用默认TextinputFormat,按照128MB的块大小来分片的所以只有一个
2020-05-09 21:42:14,310 INFO [org.apache.hadoop.mapreduce.JobSubmitter] - number of splits:1
任务id:我这里是本地运行所以 job后边有个local
2020-05-09 21:42:14,390 INFO [org.apache.hadoop.mapreduce.JobSubmitter] - Submitting tokens for job: job_local751775058_0001
合并排序
2020-05-09 21:42:14,766 INFO [org.apache.hadoop.mapred.Merger] - Merging 1 sorted segments
只有看到 map 100% reduce 100% 才能确定任务正确执行完毕
2020-05-09 21:42:15,522 INFO [org.apache.hadoop.mapreduce.Job] -  map 100% reduce 100%


二、打包集群运行测试

1、添加打包插件依赖
<build>
        <plugins>
            <plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>2.3.2</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
            <plugin>
                <artifactId>maven-assembly-plugin </artifactId>
                <configuration>
                    <descriptorRefs>
                        <!--运行所需要jar打到一个jar中去-->
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                    <archive>
                        <manifest>
                            <mainClass>com.cjy.mr.wordcount.WordcountDriver</mainClass>
                        </manifest>
                    </archive>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
2、使用maven打成jar与wd.txt 一起上传至集群
jar包上传至集群本地,wd.txt需要上传到hdfs
1.上传文件
[root@hadoop102 hadoop-2.7.2]# hadoop fs -put wd.txt /

2.测试
[root@hadoop102 hadoop-2.7.2]# hadoop jar mr-study-1.0-SNAPSHOT-jar-with-dependencies.jar com.cjy.mr.wordcount.WordcountDriver /wd.txt /out/

在这里插入图片描述

3、集群测试日志查看


[root@hadoop102 hadoop-2.7.2]# hadoop jar mr-study-1.0-SNAPSHOT-jar-with-dependencies.jar /wd.txt /out/
申请资源
20/05/09 22:17:40 INFO client.RMProxy: Connecting to ResourceManager at hadoop103/10.211.55.103:8032

20/05/09 22:17:41 INFO mapreduce.JobSubmitter: number of splits:1

job后面就没有local了

20/05/09 22:17:41 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1589033576096_0003
20/05/09 22:17:42 INFO impl.YarnClientImpl: Submitted application application_1589033576096_0003
20/05/09 22:17:42 INFO mapreduce.Job: The url to track the job: http://hadoop103:8088/proxy/application_1589033576096_0003/
20/05/09 22:17:42 INFO mapreduce.Job: Running job: job_1589033576096_0003
20/05/09 22:17:48 INFO mapreduce.Job: Job job_1589033576096_0003 running in uber mode : false
20/05/09 22:17:48 INFO mapreduce.Job:  map 0% reduce 0%
20/05/09 22:17:52 INFO mapreduce.Job:  map 100% reduce 0%
20/05/09 22:18:00 INFO mapreduce.Job:  map 100% reduce 100%

demo地址:https://github.com/chenjy512/bigdata_study/tree/master/mr-zhdemo/mr-study

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值