MapReduce

一、概述

MapReduce是一种分布式计算模型,由Google提出,主要用于搜索领域,解决海量数据的计算问题.

MapReduce是分布式运行的,由两个阶段组成:Map和Reduce,Map阶段是一个独立的程序,有很多个节点同时运行,每个节点处理一部分数据。Reduce阶段是一个独立的程序,有很多个节点同时运行,每个节点处理一部分数据【在这先把reduce理解为一个单独的聚合程序即可】。

MapReduce框架都有默认实现,用户只需要覆盖map()和reduce()两个函数,即可实现分布式计算,非常简单。

这两个函数的形参和返回值都是<key、value>,使用的时候一定要注意构造<k,v>。

二、原理

三、通过Java实现MapReduce的架包原理

Java基本类型

Writable

序列化大小(字节)

布尔型(boolean)

BooleanWritable

1

字节型(byte)

ByteWritable

1

整型(int)

IntWritable

4

VIntWritable

1~5

浮点型(float)

FloatWritable

4

长整型(long)

LongWritable

8

VLongWritable

1~9

双精度浮点型(double)

DoubleWritable

8

map阶段

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;


/**
 * Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
 * <p>
 * LongWritable :表示对应Java中的Long,其对应的值是数据读取的偏移量
 * Text:表示读取文件中的一行数据,对应java中的String
 * <p>
 * Text:表示输出数据的Key为text类型
 * IntWritable:表示输出数据的value为 IntWritable 类型 对应JAVA中的int
 */
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {
        // 不需要调用父类的map方法
//        super.map(key, value, context);
        String[] words = value.toString().split(" ");
//        System.out.println(words);
        for (String word : words) {
            context.write(new Text(word), new IntWritable(1));
        }
    }
}

reduce阶段

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;


import java.io.IOException;

/**
 * Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
 */
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

    /**
     * Reduce端的具体实现方法
     *
     * @param key     Map端传过来的key
     * @param values  Map端传过来相同Key的values的集合
     * @param context 上下文操作对象
     */
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
        // 不需要调用父类的reduce
//        super.reduce(key, values, context);
        int count = 0;
        for (IntWritable value : values) {
            count += value.get();
        }
        context.write(key, new IntWritable(count));
    }
}

driver实现

 FileInputFormat是所有以文件作为数据源的InputFormat实现的基类,FileInputFormat保存作为job输入的所有文件,并实现了对输入文件计算splits的方法。至于获得记录的方法是有不同的子类——TextInputFormat进行实现的。

TextInputFormat 是默认的处理类,处理普通文本文件 文件中每一行作为一个记录,他将每一行在文件中的起始偏移量作为key,每一行的内容作为value 默认以\n或回车键作为一行记录

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

// Drive用于提交我们的Job任务
public class WordCountDriver {
    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
        Configuration conf = new Configuration();
//        conf.set();
        // getInstance(Configuration conf) 获取Job的实例对象
        Job job = Job.getInstance(conf);

        // 设置驱动的类
        job.setJarByClass(WordCountDriver.class);
//        job.setJar();

        // 设置Mapper和Reducer的具体实现类
        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);

        // 设置Map端输出的数据类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        // 设置Reduce端输出的数据类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        /**
         * setInputPaths(Job job, String commaSeparatedPaths)
         *
         * void setOutputPath(Job job, Path outputDir) {
         */
        // 设置输入输入路径
        FileInputFormat.setInputPaths(job,new Path("D:\\上课资料\\hadoop\\day53\\2022年4月1日\\代码\\HadoopCode15\\input\\wordcount.txt"));
        FileOutputFormat.setOutputPath(job, new Path("D:\\上课资料\\hadoop\\day53\\2022年4月1日\\代码\\HadoopCode15\\output"));

        // job提交
        boolean res = job.waitForCompletion(true);
//        System.exit(res ? 0 : 1);
    }
}

四、序列化

把对象转换为字节序列的过程称为对象的序列化。

把字节序列恢复为对象的过程称为对象的反序列化。 说的再直接点,序列化的目的就是为了跨进程传递格式化数据

需求:
*      1.读取学生信息表
*      2.将学生信息表中数据进行切分,并包装成自定义的序列化类
*      3.在reduce端直接将序列化类写入磁盘,并通过to_String方法将当前对象结果进行转换

对象类

import org.apache.hadoop.io.Writable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class Student implements Writable {
    String id = "";
    String name = "";
    String age = "";
    String clazz = "";

    public Student() {
    }

    public Student(String id, String name, String age, String clazz) {
        this.id = id;
        this.name = name;
        this.age = age;
        this.clazz = clazz;
    }

    @Override
    public String toString() {
        return id + "\t" + name + "\t" + age + "\t" + clazz;
    }

    public String getId() {
        return id;
    }

    public void setId(String id) {
        this.id = id;
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public String getAge() {
        return age;
    }

    public void setAge(String age) {
        this.age = age;
    }

    public String getClazz() {
        return clazz;
    }

    public void setClazz(String clazz) {
        this.clazz = clazz;
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(id);
        out.writeUTF(name);
        out.writeUTF(age);
        out.writeUTF(clazz);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        id = in.readUTF();
        name = in.readUTF();
        age = in.readUTF();
        clazz = in.readUTF();
    }
}

map

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;


public class WritableMapper extends Mapper<LongWritable, Text, Text, Student> {
    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Student>.Context context) throws IOException, InterruptedException {
        String[] columns = value.toString().split(",");
        Student student = new Student(columns[0], columns[1], columns[2], columns[4]);
        context.write(new Text(columns[0]),student);
    }
}

reduce

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;

public class WritableReducer extends Reducer<Text, Student, Text, Student> {

   
    @Override
    protected void reduce(Text key, Iterable<Student> values, Reducer<Text, Student, Text, Student>.Context context) throws IOException, InterruptedException {
        for (Student student : values) {
            context.write(key,student);
        }
    }
}

driver

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;


public class WritableDriver {
    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
        Configuration conf = new Configuration();

        Job job = Job.getInstance(conf);

        job.setJarByClass(WritableDriver.class);

        job.setMapperClass(WritableMapper.class);
        job.setReducerClass(WritableReducer.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Student.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Student.class);

        FileInputFormat.setInputPaths(job,new Path("D:\\CodeSpace\\HadoopCode15\\JoinInput\\students.txt"));
        FileOutputFormat.setOutputPath(job, new Path("D:\\CodeSpace\\HadoopCode15\\WritableOutput"));

        boolean res = job.waitForCompletion(true);
        System.exit(res ? 0 : 1);
    }
}

五、join操作,将多个文件通过MapReduce阶段写入到本地文件

student

import org.apache.hadoop.io.Writable;
import org.apache.hadoop.util.Waitable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class Student implements Writable {
    // 406
    // 施笑槐,22,女,文科六班
    String name = "";
    String age = "";
    String gender = "";
    String clazz = "";
    String score = "";

    @Override
    public String toString() {
        return name + "\t" + age + "\t" + gender + "\t" + clazz + "\t" + score;
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public String getAge() {
        return age;
    }

    public void setAge(String age) {
        this.age = age;
    }

    public String getGender() {
        return gender;
    }

    public void setGender(String gender) {
        this.gender = gender;
    }

    public String getClazz() {
        return clazz;
    }

    public void setClazz(String clazz) {
        this.clazz = clazz;
    }

    public String getScore() {
        return score;
    }

    public void setScore(String score) {
        this.score = score;
    }

    public Student() {
    }

    public Student(String name, String age, String gender, String clazz, String score) {
        this.name = name;
        this.age = age;
        this.gender = gender;
        this.clazz = clazz;
        this.score = score;
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(name);
        out.writeUTF(age);
        out.writeUTF(gender);
        out.writeUTF(clazz);
        out.writeUTF(score);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        name = in.readUTF();
        age = in.readUTF();
        gender = in.readUTF();
        clazz = in.readUTF();
        score = in.readUTF();
    }
}

map

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import java.io.IOException;


public class  ReduceJoinMapper extends Mapper<LongWritable, Text, Text, Text> {


    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context) throws IOException, InterruptedException {
        FileSplit inputSplit = (FileSplit)context.getInputSplit();
        String pathName = inputSplit.getPath().getName();//获取文件名
        if(pathName.contains("students.txt")){
            String[] columns = value.toString().split(",");
            context.write(new Text(columns[0]),new Text(columns[1]+','+columns[2]+','+columns[3]+','+columns[4]));
        }else {
            String[] columns = value.toString().split("\t");
            context.write(new Text(columns[0]),new Text(columns[1]));
        }
    }
}

reduce

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;

public class ReduceJoinReducer extends Reducer<Text, Text, Text, Student> {

    @Override
    protected void reduce(Text key, Iterable<Text> values, Reducer<Text, Text, Text, Student>.Context context) throws IOException, InterruptedException {
        // 406
        // 施笑槐,22,女,文科六班
        Student student = new Student();

        for (Text value : values) {
            String[] columns = value.toString().split(",");
            if (columns.length > 1){
                student.name = columns[0];
                student.age = columns[1];
                student.gender = columns[2];
                student.clazz = columns[3];
            }else {
                student.score = value.toString();
            }
        }

        context.write(key,student);

    }
}

driver修改一下map和reduce的输出类型

六、设置分区,输出分类文件,对map数据进行分类。

对序列化中的文件进行分区操作

对象,map,reduce端不动,加入partition端

因为接受的是map端数据。所以数据类型和map端输出数据类型相同

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

public class PartitionOperator extends Partitioner<Text,Student> {
    /**
     * Get the partition number for a given key (hence record) given the total
     * number of partitions i.e. number of reduce-tasks for the job.
     *
     * <p>Typically a hash function on a all or a subset of the key.</p>
     *
     * @param text          the key to be partioned.
     * @param student       the entry value.
     * @param numPartitions the total number of partitions.
     * @return the partition number for the <code>key</code>.
     */
    @Override
    public int getPartition(Text text, Student student, int numPartitions) {
        //将学生分成 理科  文科  ==》 reduceTask数量 2
        if(student.clazz.contains("理科")){
            return 0;
        }else {
            return 1;
        }
    }
}

driver端,在中间加上下面的代码,进行分区的实现

// 将数值设置为2 表示设置ReduceTask数量为2  同时分区数也为2
        // 如果ReduceTask数量大于分区数那么必然有部分ReduceTask没有分配到数据,体现在结果文件为空
        job.setNumReduceTasks(2);
        job.setPartitionerClass(PartitionOperator.class);

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值