MapReduce原理及编程（java实现Map、Reduce）

最新推荐文章于 2021-03-23 22:27:40 发布

文絮卿舀

最新推荐文章于 2021-03-23 22:27:40 发布

阅读量2.1k

点赞数

文章标签： hadoop 大数据 java

本文链接：https://blog.csdn.net/wuhahabanana/article/details/106546806

版权

Hadoop架构

HDFS - 分布式文件系统
MapReduce - 分布式计算框架
YARN - 分布式资源管理系统
Common

什么是MapReduce?
MapReduce是一个分布式计算框架
它将大型数据操作作业分解为可以跨服务器集群并行执行的单个任务。
起源于Google
适用于大规模数据处理场景
每个节点处理存储在该节点的数据
每个job包含Map和Reduce两部分

MapReduce的设计思想
分而治之–
简化并行计算的编程模型
构建抽象模型：Map和Reduce–
开发人员专注于实现Mapper和Reducer函数
隐藏系统层细节–
开发人员专注于业务逻辑实现

MapReduce特点

优点–
易于编程
可扩展性
高容错性
高吞吐量
不适用领域–
难以实时计算
不适合流式计算

MapReduce实现WordCount

在这里插入图片描述
对于MapReduce的原理可以参考此处，分析很到位
——————————————————————————

MapReduce执行过程

数据定义格式
map: (K1,V1) → list (K2,V2)
reduce: (K2,list(V2)) → list (K3,V3)
MapReduce执行过程
Mapper
Combiner
Partitioner
Shuffle and Sort
Reducer

在这里插入图片描述

WordCount代码

创建项目使用模板 maven-archetype-quickstart
项目执行前需要在window环境中配置hadoop
或将程序打胖包置入安装了hadoop的linux环境中运行

pom文件

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>org.example</groupId>
  <artifactId>mymr</artifactId>
  <version>1.0-SNAPSHOT</version>

  <name>mymr</name>
  <!-- FIXME change it to the project's website -->
  <url>http://www.example.com</url>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <maven.compiler.source>1.8</maven.compiler.source>
    <maven.compiler.target>1.8</maven.compiler.target>
  </properties>

  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.11</version>
      <scope>test</scope>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs -->
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-hdfs</artifactId>
      <version>2.6.0</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>2.6.0</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-common</artifactId>
      <version>2.6.0</version>
    </dependency>



  </dependencies>

  <build>

      <plugins>
        <plugin>
          <artifactId>maven-compiler-plugin</artifactId>
          <version>2.3.2</version>
          <configuration>
            <source>1.8</source>
            <target>1.8</target>
          </configuration>
        </plugin>
        <plugin>
          <artifactId>maven-assembly-plugin</artifactId>
          <configuration>
            <descriptorRefs>
              <descriptorRef>jar-with-dependencies</descriptorRef>
            </descriptorRefs>
            <archive>
              <manifest>
                <mainClass>services.wc.MyDriver</mainClass>
              </manifest>
            </archive>
          </configuration>
          <executions>
            <execution>
              <id>make-assembly</id>
              <phase>package</phase>
              <goals>
                <goal>single</goal>
              </goals>
            </execution>
          </executions>
        </plugin>
      </plugins>

  </build>
</project>

模拟实现Map

package services.wc;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class MyMapper extends Mapper<LongWritable, Text,Text,LongWritable> {
    private LongWritable one = new LongWritable(1);
    private Text word = new Text();


    @Override
    protected void map(LongWritable key, Text value, Context context) throws  IOException, InterruptedException {
        String[] wds = value.toString().split(" ");
        for (String word : wds) {
            Text wd = new Text(word);
            context.write(wd,one);
        }
    }
}

尤其需要注意，Text类是org.apache.hadoop.io.Text，此处容易导错包，易造成不出结果

模拟实现Reduce

package services.wc;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class MyReduce extends Reducer<Text, LongWritable,Text,LongWritable> {
    private LongWritable res = new LongWritable();

    @Override
    protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
        long ressum = 0;
        for (LongWritable one : values) {
            ressum+=one.get();
        }
        res.set(ressum);
        context.write(key,res);
    }
}

模拟任务启动

package services.wc;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;


import java.io.File;


public class MyDriver {
    public static void main(String[] args) throws Exception {
        File file=new File("d://eee");
        if (file.exists()){
            file.delete();
        }
        Configuration conf = new Configuration();
		//解决filesystem默认配置无法正常读取文件
        conf.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
        conf.set("fs.file.impl", org.apache.hadoop.fs.LocalFileSystem.class.getName());
        //准备一个空任务
        Job job = Job.getInstance(conf);
        //设置该任务的主启动类
        job.setJarByClass(MyDriver.class);
        //设置任务的输入数据源
        FileInputFormat.addInputPath(job,new Path("D://study/abc.txt"));
        //设置你的Mapper任务类
        job.setMapperClass(MyMapper.class);
        //设置Mapper任务类的输出数据类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);
        //设置你的Reduce任务类
        job.setReducerClass(MyReduce.class);
        //设置Reduce任务类的输出数据类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);
        //设置任务的输出数据目标
        FileOutputFormat.setOutputPath(job,new Path("d://eee"));
        //启动任务并执行
        job.waitForCompletion(true);
    }
}