1.MapReduce入门-MapReduce进程、常用序列化类型、WordCount实例

最新推荐文章于 2021-08-20 21:44:39 发布

页川叶川

最新推荐文章于 2021-08-20 21:44:39 发布

阅读量271

点赞数

分类专栏： MapReduce学习笔记文章标签： mapreduce hadoop

本文链接：https://blog.csdn.net/affluent6/article/details/118540021

版权

MapReduce学习笔记专栏收录该内容

2 篇文章 0 订阅

订阅专栏

本文目录如下：

第1章 MapReduce概述

第1章 MapReduce概述

1.1 MapReduce进程

—个完整的MapReduce程序在分布式运行时有三类实例进程:
在这里插入图片描述

1.2 常用数据序列化类型

Java类型	Hadoop Writable类型
Boolean	BooleanWritable
Byte	ByteWritable
Int	IntWritable
Float	FloatWritable
Long	LongWritable
Double	DoubleWritable
String	Text
Map	MapWritable
Array	ArrayWritable

1.3 MapReduce编程规范

用户编写的程序分成三个部分：Mapper、Reducer和Driver。

1.3.1 Mapper阶段

(1) 用户自定义的Mapper要继承自己的父类
(2) Mapper的输入数据是KV对的形式(KV的类型可自定义)
(3) Mapper中的业务逻辑写在map()方法中
(4) Mapper的输出数据是KV对的形式(KV的类型可自定义)
(5) map()方法(MapTask进程)对每一个<K,V>调用一次

1.3.2 Reducer阶段

(1) 用户自定义的Reducer要继承自己的父类
(2) Reducer的输入数据类型对应Mapper的输出数据类型，也是KV
(3) Reducer的业务逻辑写在reduce()方法中
(4) ReduceTask进程对每一组相同K的<K,V>组调用一次reduce()方法

1.3.3 Driver阶段

相当于YARN集群的客户端，用于提交我们整个程序到YARN集群，提交的是封装了MapReduce程序相关运行参数的job对象。

1.4 WordCount实例

需求：在给定的文本文件(hello.txt)中统计输出每一个单词出现的总次数

在这里插入图片描述

1.4.1 创建一个Maven工程MapReduce-0100-WordCount

略

1.4.2 导入相应依赖

<dependencies>
	<dependency>
		<groupId>junit</groupId>
		<artifactId>junit</artifactId>
		<version>RELEASE</version>
	</dependency>
	<dependency>
		<groupId>org.apache.logging.log4j</groupId>
		<artifactId>log4j-core</artifactId>
		<version>2.8.2</version>
	</dependency>
	<dependency>
		<groupId>org.apache.hadoop</groupId>
		<artifactId>hadoop-common</artifactId>
		<version>2.7.2</version>
	</dependency>
	<dependency>
		<groupId>org.apache.hadoop</groupId>
		<artifactId>hadoop-client</artifactId>
		<version>2.7.2</version>
	</dependency>
	<dependency>
		<groupId>org.apache.hadoop</groupId>
		<artifactId>hadoop-hdfs</artifactId>
		<version>2.7.2</version>
	</dependency>
</dependencies>

1.4.3 配置日志信息

在项目的src/main/resources目录下，新建一个文件，命名为“log4j.properties”，填入如下信息：

log4j.rootLogger=INFO, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
log4j.appender.logfile=org.apache.log4j.FileAppender
log4j.appender.logfile.File=target/spring.log
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n

1.4.4 编写程序

(1) 编写Mapper类

package com.xqzhao.wordcount;

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

/**
 *KEYIN: map阶段输入的key的类型: LongWritable
 *VALUEIN: map阶段输入value类型: Text
 *KEYOUT: map阶段输出的Key类型: Text
 *VALUEOUT: map阶段输出的value类型: IntWritable
 */
public class WcMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
    private Text word = new Text();
    private IntWritable one = new IntWritable(1);

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // 拿到这一行数据
        String line = value.toString();

        // 按照空格切分数据
        String[] words = line.split(" ");

        // 遍历数组，把单词编程（word, 1）的形式交给框架
        for (String word : words) {
            this.word.set(word);
            context.write(this.word, this.one);
        }
    }
}

(2) 编写Reducer类

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class WcReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

    int sum;
    private IntWritable total = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {

        // 1 累加求和
        int sum = 0;
        for (IntWritable value : values) {
            sum += value.get();
        }

        // 2 包装结果并输出
        total.set(sum);
        context.write(key, total);
    }
}

(3) 编写Driver类:

package com.xqzhao.wordcount;

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WcDriver {

    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {

        // 1 获取配置信息以及封装任务
        Job job = Job.getInstance(new Configuration());

        // 2 设置jar加载路径
        job.setJarByClass(WcDriver.class);

        // 3 设置map和reduce类
        job.setMapperClass(WcMapper.class);
        job.setReducerClass(WcReducer.class);

        // 4 设置map输出
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        // 5 设置最终输出kv类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        // 6 设置输入和输出路径      注: 输出路径必须是一个不存在的文件夹
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        // 7 提交
        boolean result = job.waitForCompletion(true);
        System.exit(result ? 0 : 1);
    }
}

hello.txt文件中有很多行数据，那么究竟是如何一行一行执行的呢，答案就在Mapper类里的一部分代码，如下所示：

public void run(Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context context) throws IOException, InterruptedException {
        this.setup(context);

        try {
            while(context.nextKeyValue()) {
                this.map(context.getCurrentKey(), context.getCurrentValue(), context);
            }
        } finally {
            this.cleanup(context);
        }

    }

这部分代码中的whiile循环就是上面问题的答案。

1.4.5 打包到集群测试

(0) 用maven打jar包，需要添加相应的依赖
注意：工程主类名称处要替换为自己的工程主类名称

<build>
	<plugins>
		<plugin>
			<artifactId>maven-compiler-plugin</artifactId>
			<version>2.3.2</version>
			<configuration>
				<source>1.8</source>
				<target>1.8</target>
			</configuration>
		</plugin>
		<plugin>
			<artifactId>maven-assembly-plugin </artifactId>
			<configuration>
				<descriptorRefs>
					<descriptorRef>jar-with-dependencies</descriptorRef>
				</descriptorRefs>
				<archive>
					<manifest>
						<mainClass>com.xqzhao.wordcount.WcDriver</mainClass>
					</manifest>
				</archive>
			</configuration>
			<executions>
				<execution>
					<id>make-assembly</id>
					<phase>package</phase>
					<goals>
						<goal>single</goal>
					</goals>
				</execution>
			</executions>
		</plugin>
	</plugins>
</build>

注意：如果工程上显示红叉。在项目上右键->maven->update project即可。

(1) 将程序打成jar包，然后拷贝到Hadoop集群中
步骤详情：右键->Run as->maven install。等待编译完成就会在项目的target文件夹中生成jar包。如果看不到。在项目上右键-》Refresh，即可看到。修改不带依赖的jar包名称为wc.jar，并拷贝该jar包到Hadoop集群。

(2) 启动Hadoop集群
参考Hadoop完全运行模式第4.8.2小节。

(3) 执行WordCount程序

[xqzhao@hadoop100 software]$ hadoop jar  wc.jar
 com.xqzhao.wordcount.WcDriver /user/xqzhao/input /user/xqzhao/output

声明：本文是学习时记录的笔记，如有侵权请告知删除！
原视频地址:https://www.bilibili.com/video/BV1Me411W7PV

页川叶川

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
1.MapReduce入门-MapReduce进程、常用序列化类型、WordCount实例

第1章 MapReduce概述1.1 MapReduce进程—个完整的MapReduce程序在分布式运行时有三类实例进程:1.2 常用数据序列化类型Java类型Hadoop Writable类型BooleanBooleanWritableByteByteWritableIntIntWritableFloatFloatWritableLongLongWritableDoubleDoubleWritableStringText
复制链接

扫一扫