文章目录
WordCount案例实操
1. 官方WordCount源码
采用反编译工具反编译源码,发现WordCount案例有
Map类、Reduce类和驱动类
。且数据的类型是Hadoop自身封装的序列化类型。
2. 常用数据序列化类型
常用的数据类型对应的Hadoop数据序列化类型
Java类型 | Hadoop Writable类型 |
---|---|
boolean | BooleanWritable |
byte | ByteWritable |
int | IntWritable |
float | FloatWritable |
long | LongWritable |
double | DoubleWritable |
String | Text |
map | MapWritable |
array | ArrayWritable |
3. MapReduce编程规范
用户编写的程序分成三个部分:
Mapper、Reducer和Driver
。
3.1 Mapper阶段
- 用户自定义的Mapper要继承自己的父类
- Mapper的
输入数据是KV对的形式
(KV的类型可自定义)- Mapper中的业务逻辑写在
map()方法
中- Mapper的输出数据是KV对的形式(KV的类型可自定义)
- map()方法(MapTask进程)
对每一个<K,V>调用一次
3.2 Reducer阶段
- 用户自定义的Reducer要继承自己的父类
- Reducer的输入数据类型对应Mapper的输出数据类型,也是KV
- Reducer的业务逻辑写在
reduce()方法
中ReduceTask进程对每一组相同k的<k,v>组调用一次reduce()方法
3.3 Driver阶段
相当于YARN集群的客户端,用于提交我们整个程序到YARN集群,提交的是封装了MapReduce程序相关运行参数的job对象。
4. 实现WordCount
在给定的文本文件中统计输出每一个单词出现的总次数
按照MapReduce编程规范,分别编写Mapper,Reducer,Driver
4.1 搭建环境
创建maven工程MapReduceWordCount
引入依赖
<dependencies> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.13</version> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-core</artifactId> <version>2.8.2</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.7.2</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.7.2</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.7.2</version> </dependency> </dependencies>
log4j.properties
log4j.rootLogger=INFO, stdout log4j.appender.stdout=org.apache.log4j.ConsoleAppender log4j.appender.stdout.layout=org.apache.log4j.PatternLayout log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n log4j.appender.logfile=org.apache.log4j.FileAppender log4j.appender.logfile.File=target/spring.log log4j.appender.logfile.layout=org.apache.log4j.PatternLayout log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n
4.2 WordCountMapper
注意引包不要引错了!
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
/**
* @Date 2020/7/7 18:17
* @Version 10.21
* @Author DuanChaojie
* 1、LongWritable:表示worder传入KEY的数据类型,默认是一行起始偏移量
* 2、Text:表示word传入VALUE的数据类型,默认是下一行的文本内容
* 3、Text:表示自己map方法产生产生的结果数据类型KEY
* 4、IntWritable:表示自己map方法产生的结果数据的VALUE类型
*/
public class WordCountMapper extends Mapper<LongWritable,Text, Text, IntWritable> {
Text k = new Text();
IntWritable v = new IntWritable(1);
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// 1.获取一行
String line = value.toString();
// 2.切割
String[] words = line.split(" ");
// 3.输出
for (String word : words) {
k.set(word);
context.write(k,v);
}
}
}
4.3 WordCountReduce
package com.atguigu.mapreduce;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
/**
* @Date 2020/7/7 18:17
* @Version 10.21
* @Author DuanChaojie
*/
public class WordCountReduce extends Reducer<Text, IntWritable, Text,IntWritable> {
int sum;
IntWritable v = new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
// 1.累加求和
sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
// 2.输出
v.set(sum);
context.write(key,v);
}
}
4.4 WordCountDriver
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
/**
* @Date 2020/7/7 18:17
* @Version 10.21
* @Author DuanChaojie
*/
public class WordCountDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
// 1,获取配置信息以及封装任务
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
// 2,设置jar加载路径
job.setJarByClass(WordCountDriver.class);
// 3,设置map和reduce类
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReduce.class);
// 4,设置map输出
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
// 5,设置最终输出kv类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// 6,设置输入和输出路径
FileInputFormat.setInputPaths(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
// 7,提交
boolean result = job.waitForCompletion(true);
System.exit(result ? 0:1);
}
}
5. 测试
本地测试
设置参数
结果
集群上测试
用maven打jar包,需要添加的打包插件依赖
<build> <plugins> <plugin> <artifactId>maven-compiler-plugin</artifactId> <version>2.3.2</version> <configuration> <source>1.8</source> <target>1.8</target> </configuration> </plugin> <plugin> <artifactId>maven-assembly-plugin </artifactId> <configuration> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> <archive> <manifest> <!--为自己工程主类--> <mainClass>com.atguigu.mapreduce.WordCountDriver</mainClass> </manifest> </archive> </configuration> <executions> <execution> <id>make-assembly</id> <phase>package</phase> <goals> <goal>single</goal> </goals> </execution> </executions> </plugin> </plugins> </build>
将MapReduceWordCount-1.0-SNAPSHOT.jar
重命名为wc.jar
之后拷贝到Hadoop集群中。
启动Hadoop集群
执行WordCount程序
hadoop jar wc.jar com.atguigu.mapreduce.WordCountDriver /user/atguigu/input /user/atguigu/output
执行后的结果
下载后查看内容