Hadoop学习笔记（六）WordCount程序实例

最新推荐文章于 2023-02-26 11:20:08 发布

梧桐1233

最新推荐文章于 2023-02-26 11:20:08 发布

阅读量1.2k

点赞数

分类专栏： Hadoop学习笔记文章标签： hadoop big data hdfs

本文链接：https://blog.csdn.net/qq_40432544/article/details/121364233

版权

Hadoop学习笔记专栏收录该内容

10 篇文章 2 订阅

订阅专栏

WordCount程序实例

需求

在给定的文本文件中统计输出每一个单词出现的总次数

（1）文本数据：hello.txt

ss ss
cls cls
jiao
banzhang
xue
hadoop

（2）期望输出数据

banzhang 1
cls  2
hadoop  1
jiao 1
ss  2
xue 1

1、先创建Maven工程并添加所需依赖：

<dependencies>
    <dependency>
        <groupId>junit</groupId>
        <artifactId>junit</artifactId>
        <version>4.12</version>
    </dependency>
    <dependency>
        <groupId>org.apache.logging.log4j</groupId>
        <artifactId>log4j-slf4j-impl</artifactId>
        <version>2.12.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>3.1.3</version>
    </dependency>
</dependencies>

2、在Resource 目录下创建 `log4j2.xml`文件并填入

<?xml version="1.0" encoding="UTF-8"?>
<Configuration status="error" strict="true" name="XMLConfig">
    <Appenders>
        <!-- 类型名为Console，名称为必须属性 -->
        <Appender type="Console" name="STDOUT">
            <!-- 布局为PatternLayout的方式，
            输出样式为[INFO] [2018-01-22 17:34:01][org.test.Console]I'm here -->
            <Layout type="PatternLayout"
                    pattern="[%p] [%d{yyyy-MM-dd HH:mm:ss}][%c{10}]%m%n" />
        </Appender>
    </Appenders>
    <Loggers>
        <!-- 可加性为false -->
        <Logger name="test" level="info" additivity="false">
            <AppenderRef ref="STDOUT" />
        </Logger>
        <!-- root loggerConfig设置 -->
        <Root level="info">
            <AppenderRef ref="STDOUT" />
        </Root>
    </Loggers>
</Configuration>

3、按照 MapReduce 编程规范，分别编写 Mapper，Reducer，Driver。

3.1 WCMapper

WCMapper 负责整理数据，每次读入一行数据，输出每个单词。形如

aaa 1

/**
    在该类中去实现MapTask中实现的业务逻辑代码
     Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> :
        两组：
        KEYIN：读取数据的偏移量的类型
        VALUEIN：读取的一行一行的数据的类型

        KEYOUT：写出的key的类型（在这是单词）
        VALUEOUT：写出的value的类型（在这是单词的数量）
     */
public class WCMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
    private Text outKey=new Text();
    private LongWritable outValue =new LongWritable();
    /**
     *  map方法用来实现MapTask中需要实现的业务逻辑代码
     *  map方法在被循环调用，每调用一次传入一行数据
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //1.将Text转成String(可以使用StringAPI)
        String line = value.toString();
        //将数据按“ ”进行分割并存入数据。
        String[] words = line.split(" ");
        //写出数组中的每个词
        for (String word : words) {
            //设置key，value值
            outKey.set(word);
            outValue.set(1);
            //将key,value写出去
            context.write(outKey,outValue);
        }
    }
}

3.2 WCReducer

WCReducer 每次读入Map传入的一组数据，例如 key 为"aaa"的数据有两条，WCReducer就会一次读入所有 key 为"aaa"的数据。

aaa 1
aaa 1

WCReducer 的目标就是统计 key 为"aaa"的数据条数。

输出：

aaa 2

/**
    在该类中去实现ReduceTask中需要实现的业务逻辑代码
     Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
        两组：
            KEYIN ：读取的Key的类型（Mapper写出的key的类型）
            VALUEIN ：读取的value的类型（Mapper写出的value的类型）

            KEYOUT：写出的key的类型（在这是单词的类型）
            VALUEOUT：写出的value的类型（在这是单词的数量的类型）

 */
public class WCReduce extends Reducer <Text, LongWritable,Text,LongWritable>{
    private LongWritable outValue = new LongWritable();
    /**
     * 在该方法中去实现ReduceTask中需要实现的业务逻辑代码
     * 该方法在被循环调用每调用一次传入一组数据
     * @param key 读取的key
     * @param values 读取的所有的value
     * @param context 上下文（在这用来将数据写出去）
     * @throws IOException
     * @throws InterruptedException
     *
     * aaa   1
     * aaa   1    ========>            aaa  2
     *
     */
    @Override
    protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
        long sum = 0;
        for (LongWritable value : values) {
            sum += value.get();
        }
        outValue.set(sum);
        context.write(key,outValue);
    }
}

3.3 WCDriver

3.3.1 生成jar包并放到服务器上运行的写法（WCDriver)

public class WCDriver {
    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
        //1 获取配置信息以及封装任务
        Configuration conf =new Configuration();
        Job job =Job.getInstance(conf);

        //2 设置Jar加载路径--将main方法所在的类传过去（如果是本地运行可以不写）
        job.setJarByClass(WCDriver.class);
		//3 设置map和reduce类
        job.setMapperClass(WCMapper.class);
        job.setReducerClass(WCReduce.class);
		//4 设置map输出
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);
		//5 设置最终输出kv类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        //6 设置数据输入输出路径
        FileInputFormat.setInputPaths(job,new Path(args[0]));
        FileOutputFormat.setOutputPath(job,new Path(args[1]));
		//7 提交
        boolean b =job.waitForCompletion(true);
        System.exit(b?0:1);
    }
}

后接第4节。

，生成jar包并将jar包放到服务器上运行

3.3.1 在Windows上向集群提交任务(WCDriver2)

相比上个 WCDriver 中的代码，WCDriver2 多了对 conf 变量添加的一些配置。

public class WCDriver2 {
    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
        //1 获取配置信息以及封装任务
        Configuration conf =new Configuration();
        //设置在集群运行的相关参数-设置HDFS,NAMENODE的地址
        conf.set("fs.defaultFS", "hdfs://hadoop102:8020");
        //指定MR运行在Yarn上
        conf.set("mapreduce.framework.name","yarn");
        //指定MR可以在远程集群运行
        conf.set("mapreduce.app-submission.cross-platform","true");
        //指定yarn resourcemanager的位置
        conf.set("yarn.resourcemanager.hostname", "hadoop103");

        Job job =Job.getInstance(conf);

        //2 设置Jar加载路径--将main方法所在的类传过去（如果是本地运行可以不写）
//        job.setJarByClass(WCDriver2.class);
        job.setJar("D:\\Study_Code\\MRDemo\\target\\MRDemo-1.0-SNAPSHOT.jar");
        //3 设置map和reduce类
        job.setMapperClass(WCMapper.class);
        job.setReducerClass(WCReduce.class);
        //4 设置map输出
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);
        //5 设置最终输出kv类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        //6 设置数据输入输出路径
        FileInputFormat.setInputPaths(job,new Path(args[0]));
        FileOutputFormat.setOutputPath(job,new Path(args[1]));
        //7 提交
        boolean b =job.waitForCompletion(true);
        System.exit(b?0:1);
    }
}

在这里插入图片描述

注意配置的端口要与服务器Hadoop配置文件中的一致。

4、将 jar 包放至服务器上运行

1、生成jar包

在这里插入图片描述

2、将jar包丢到服务器上

3、运行 jar 包中的WCDriver类

hadoop jar  MRDemo-1.0-SNAPSHOT.jar com.atguigu.wordcount.WCDriver /input/hello.txt /output

4、打开网页端 hadoop102:9870 确认结果。

梧桐1233

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Hadoop学习笔记（六）WordCount程序实例

WordCount程序实例需求在给定的文本文件中统计输出每一个单词出现的总次数（1）文本数据：hello.txtss sscls clsjiaobanzhangxuehadoop （2）期望输出数据banzhang 1cls 2hadoop 1jiao 1ss 2xue 11、先创建Maven工程并添加所需依赖：<dependencies> <dependency> <groupId>junit
复制链接

扫一扫