一、概述
MapReduce是一种分布式计算模型,由Google提出,主要用于搜索领域,解决海量数据的计算问题.
MapReduce是分布式运行的,由两个阶段组成:Map和Reduce,Map阶段是一个独立的程序,有很多个节点同时运行,每个节点处理一部分数据。Reduce阶段是一个独立的程序,有很多个节点同时运行,每个节点处理一部分数据【在这先把reduce理解为一个单独的聚合程序即可】。
MapReduce框架都有默认实现,用户只需要覆盖map()和reduce()两个函数,即可实现分布式计算,非常简单。
这两个函数的形参和返回值都是<key、value>,使用的时候一定要注意构造<k,v>。
二、原理
三、通过Java实现MapReduce的架包原理
Java基本类型 | Writable | 序列化大小(字节) |
布尔型(boolean) | BooleanWritable | 1 |
字节型(byte) | ByteWritable | 1 |
整型(int) | IntWritable | 4 |
VIntWritable | 1~5 | |
浮点型(float) | FloatWritable | 4 |
长整型(long) | LongWritable | 8 |
VLongWritable | 1~9 | |
双精度浮点型(double) | DoubleWritable | 8 |
map阶段
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
/**
* Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
* <p>
* LongWritable :表示对应Java中的Long,其对应的值是数据读取的偏移量
* Text:表示读取文件中的一行数据,对应java中的String
* <p>
* Text:表示输出数据的Key为text类型
* IntWritable:表示输出数据的value为 IntWritable 类型 对应JAVA中的int
*/
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {
// 不需要调用父类的map方法
// super.map(key, value, context);
String[] words = value.toString().split(" ");
// System.out.println(words);
for (String word : words) {
context.write(new Text(word), new IntWritable(1));
}
}
}
reduce阶段
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
/**
* Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
*/
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
/**
* Reduce端的具体实现方法
*
* @param key Map端传过来的key
* @param values Map端传过来相同Key的values的集合
* @param context 上下文操作对象
*/
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
// 不需要调用父类的reduce
// super.reduce(key, values, context);
int count = 0;
for (IntWritable value : values) {
count += value.get();
}
context.write(key, new IntWritable(count));
}
}
driver实现
FileInputFormat是所有以文件作为数据源的InputFormat实现的基类,FileInputFormat保存作为job输入的所有文件,并实现了对输入文件计算splits的方法。至于获得记录的方法是有不同的子类——TextInputFormat进行实现的。
TextInputFormat 是默认的处理类,处理普通文本文件 文件中每一行作为一个记录,他将每一行在文件中的起始偏移量作为key,每一行的内容作为value 默认以\n或回车键作为一行记录
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
// Drive用于提交我们的Job任务
public class WordCountDriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
// conf.set();
// getInstance(Configuration conf) 获取Job的实例对象
Job job = Job.getInstance(conf);
// 设置驱动的类
job.setJarByClass(WordCountDriver.class);
// job.setJar();
// 设置Mapper和Reducer的具体实现类
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
// 设置Map端输出的数据类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
// 设置Reduce端输出的数据类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
/**
* setInputPaths(Job job, String commaSeparatedPaths)
*
* void setOutputPath(Job job, Path outputDir) {
*/
// 设置输入输入路径
FileInputFormat.setInputPaths(job,new Path("D:\\上课资料\\hadoop\\day53\\2022年4月1日\\代码\\HadoopCode15\\input\\wordcount.txt"));
FileOutputFormat.setOutputPath(job, new Path("D:\\上课资料\\hadoop\\day53\\2022年4月1日\\代码\\HadoopCode15\\output"));
// job提交
boolean res = job.waitForCompletion(true);
// System.exit(res ? 0 : 1);
}
}
四、序列化
把对象转换为字节序列的过程称为对象的序列化。
把字节序列恢复为对象的过程称为对象的反序列化。 说的再直接点,序列化的目的就是为了跨进程传递格式化数据
需求: * 1.读取学生信息表 * 2.将学生信息表中数据进行切分,并包装成自定义的序列化类 * 3.在reduce端直接将序列化类写入磁盘,并通过to_String方法将当前对象结果进行转换
对象类
import org.apache.hadoop.io.Writable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
public class Student implements Writable {
String id = "";
String name = "";
String age = "";
String clazz = "";
public Student() {
}
public Student(String id, String name, String age, String clazz) {
this.id = id;
this.name = name;
this.age = age;
this.clazz = clazz;
}
@Override
public String toString() {
return id + "\t" + name + "\t" + age + "\t" + clazz;
}
public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public String getAge() {
return age;
}
public void setAge(String age) {
this.age = age;
}
public String getClazz() {
return clazz;
}
public void setClazz(String clazz) {
this.clazz = clazz;
}
@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(id);
out.writeUTF(name);
out.writeUTF(age);
out.writeUTF(clazz);
}
@Override
public void readFields(DataInput in) throws IOException {
id = in.readUTF();
name = in.readUTF();
age = in.readUTF();
clazz = in.readUTF();
}
}
map
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class WritableMapper extends Mapper<LongWritable, Text, Text, Student> {
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Student>.Context context) throws IOException, InterruptedException {
String[] columns = value.toString().split(",");
Student student = new Student(columns[0], columns[1], columns[2], columns[4]);
context.write(new Text(columns[0]),student);
}
}
reduce
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class WritableReducer extends Reducer<Text, Student, Text, Student> {
@Override
protected void reduce(Text key, Iterable<Student> values, Reducer<Text, Student, Text, Student>.Context context) throws IOException, InterruptedException {
for (Student student : values) {
context.write(key,student);
}
}
}
driver
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class WritableDriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(WritableDriver.class);
job.setMapperClass(WritableMapper.class);
job.setReducerClass(WritableReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Student.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Student.class);
FileInputFormat.setInputPaths(job,new Path("D:\\CodeSpace\\HadoopCode15\\JoinInput\\students.txt"));
FileOutputFormat.setOutputPath(job, new Path("D:\\CodeSpace\\HadoopCode15\\WritableOutput"));
boolean res = job.waitForCompletion(true);
System.exit(res ? 0 : 1);
}
}
五、join操作,将多个文件通过MapReduce阶段写入到本地文件
student
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.util.Waitable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
public class Student implements Writable {
// 406
// 施笑槐,22,女,文科六班
String name = "";
String age = "";
String gender = "";
String clazz = "";
String score = "";
@Override
public String toString() {
return name + "\t" + age + "\t" + gender + "\t" + clazz + "\t" + score;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public String getAge() {
return age;
}
public void setAge(String age) {
this.age = age;
}
public String getGender() {
return gender;
}
public void setGender(String gender) {
this.gender = gender;
}
public String getClazz() {
return clazz;
}
public void setClazz(String clazz) {
this.clazz = clazz;
}
public String getScore() {
return score;
}
public void setScore(String score) {
this.score = score;
}
public Student() {
}
public Student(String name, String age, String gender, String clazz, String score) {
this.name = name;
this.age = age;
this.gender = gender;
this.clazz = clazz;
this.score = score;
}
@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(name);
out.writeUTF(age);
out.writeUTF(gender);
out.writeUTF(clazz);
out.writeUTF(score);
}
@Override
public void readFields(DataInput in) throws IOException {
name = in.readUTF();
age = in.readUTF();
gender = in.readUTF();
clazz = in.readUTF();
score = in.readUTF();
}
}
map
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import java.io.IOException;
public class ReduceJoinMapper extends Mapper<LongWritable, Text, Text, Text> {
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context) throws IOException, InterruptedException {
FileSplit inputSplit = (FileSplit)context.getInputSplit();
String pathName = inputSplit.getPath().getName();//获取文件名
if(pathName.contains("students.txt")){
String[] columns = value.toString().split(",");
context.write(new Text(columns[0]),new Text(columns[1]+','+columns[2]+','+columns[3]+','+columns[4]));
}else {
String[] columns = value.toString().split("\t");
context.write(new Text(columns[0]),new Text(columns[1]));
}
}
}
reduce
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class ReduceJoinReducer extends Reducer<Text, Text, Text, Student> {
@Override
protected void reduce(Text key, Iterable<Text> values, Reducer<Text, Text, Text, Student>.Context context) throws IOException, InterruptedException {
// 406
// 施笑槐,22,女,文科六班
Student student = new Student();
for (Text value : values) {
String[] columns = value.toString().split(",");
if (columns.length > 1){
student.name = columns[0];
student.age = columns[1];
student.gender = columns[2];
student.clazz = columns[3];
}else {
student.score = value.toString();
}
}
context.write(key,student);
}
}
driver修改一下map和reduce的输出类型
六、设置分区,输出分类文件,对map数据进行分类。
对序列化中的文件进行分区操作
对象,map,reduce端不动,加入partition端
因为接受的是map端数据。所以数据类型和map端输出数据类型相同
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
public class PartitionOperator extends Partitioner<Text,Student> {
/**
* Get the partition number for a given key (hence record) given the total
* number of partitions i.e. number of reduce-tasks for the job.
*
* <p>Typically a hash function on a all or a subset of the key.</p>
*
* @param text the key to be partioned.
* @param student the entry value.
* @param numPartitions the total number of partitions.
* @return the partition number for the <code>key</code>.
*/
@Override
public int getPartition(Text text, Student student, int numPartitions) {
//将学生分成 理科 文科 ==》 reduceTask数量 2
if(student.clazz.contains("理科")){
return 0;
}else {
return 1;
}
}
}
driver端,在中间加上下面的代码,进行分区的实现
// 将数值设置为2 表示设置ReduceTask数量为2 同时分区数也为2
// 如果ReduceTask数量大于分区数那么必然有部分ReduceTask没有分配到数据,体现在结果文件为空
job.setNumReduceTasks(2);
job.setPartitionerClass(PartitionOperator.class);