MapReduce增强

最新推荐文章于 2020-04-14 02:35:37 发布

lsy107816

最新推荐文章于 2020-04-14 02:35:37 发布

阅读量139

点赞数

分类专栏：大数据文章标签： mapTask 分区规约分组 reduceTask

本文链接：https://blog.csdn.net/lsy107816/article/details/84975393

版权

大数据专栏收录该内容

18 篇文章 1 订阅

订阅专栏

MapReduce增强

1、分区

在mapreduce中，通过指定分区将一个区的数据发送到同一个reduce中处理，分区数不能大于reduceTask的数量

注意：在进行分区时，只能打成jar包发布到集群上去运行，不能在本地运行

在需要分区时，需要编写一个自定义的partitioner类并且继承Partitioner这个类，传入map阶段的输出结果，重写其中的getPartition方法，通过返回值来判断将数据传输到哪个分区

在编写main函数时需要添加分区类以及reduceTask的个数，确保分区书和reduceTask个数保持一致

2、mapreduce排序以及序列化

序列化：是指把结构化对象转化为字节流

反序列化：把字节流转为结构化对象

要在进程间传递对象或持久化对象时，需要序列化对象成字节流，反之要将接收到的字节流转换成对象，进行反序列化

Writable是Hadoop的序列化格式，实现writable接口就可以实现序列化，Writable还有子接口WritableComparable，这个接口既可以对Key进行序列化，又可以对key进行比较、排序

案例：

数据：

a	1
a	9
b	3
a	7
b	8
b	10
a	5

需求：将第一列按照字典顺序排列，第一列相同时，第二列按照升序排列

第一步：自定义数据类型和比较器

import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class PairWritable implements WritableComparable<PairWritable> {

    private String first;
    private int second;

    public String getFirst() {
        return first;
    }

    public void setFirst(String first) {
        this.first = first;
    }

    public int getSecond() {
        return second;
    }

    public void setSecond(int second) {
        this.second = second;
    }

    @Override
    public String toString() {
        return first + "-----" + second;
    }

    @Override
    public int compareTo(PairWritable o) {
        int i = this.first.compareTo(o.first);
        if (i != 0) {
            return i;
        } else {
            int i1 = Integer.valueOf(this.second).compareTo(Integer.valueOf(o.second));
            return i1;
        }
    }

    @Override
    public void write(DataOutput output) throws IOException {
        output.writeUTF(first);
        output.writeInt(second);
    }

    @Override
    public void readFields(DataInput input) throws IOException {
        this.first = input.readUTF();
        this.second = input.readInt();
    }
}

第二步：编写map逻辑

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class PairWritableMapper extends Mapper<LongWritable, Text, PairWritable, NullWritable> {
    private PairWritable pairWritable = new PairWritable();

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] split = value.toString().split("\t");
        pairWritable.setFirst(split[0]);
        pairWritable.setSecond(Integer.valueOf(split[1]));
        context.write(pairWritable, NullWritable.get());
    }
}

第三步：编写reduce类

import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class PairWritableReducer extends Reducer<PairWritable, NullWritable, PairWritable, NullWritable> {
    @Override
    protected void reduce(PairWritable key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
        for (NullWritable value : values) {
            context.write(key, NullWritable.get());
        }
    }
}

第四步：编写程序运行main方法

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class PairWritableJobMain extends Configured implements Tool {
    @Override
    public int run(String[] strings) throws Exception {
        Job job = Job.getInstance(super.getConf(), "job");
        job.setJarByClass(PairWritableJobMain.class);
        job.setInputFormatClass(TextInputFormat.class);
        TextInputFormat.addInputPath(job, new Path("file:///D:\\排序\\input"));
        job.setMapperClass(PairWritableMapper.class);
        job.setMapOutputKeyClass(PairWritable.class);
        job.setMapOutputValueClass(NullWritable.class);
        job.setReducerClass(PairWritableReducer.class);
        job.setOutputKeyClass(PairWritable.class);
        job.setOutputValueClass(NullWritable.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        TextOutputFormat.setOutputPath(job, new Path("file:///D:\\排序\\output"));
        boolean b = job.waitForCompletion(true);
        return b ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        Configuration configuration = new Configuration();
        Tool tool = new PairWritableJobMain();
        int run = ToolRunner.run(configuration, tool, args);
        System.exit(run);
    }
}

3、MpaReduce当中的计数器

hadoop内置的计数器列表

MapReduce任务计数器	org.apache.hadoop.mapreduce.TaskCounter
文件系统计数器	org.apache.hadoop.mapreduce.FileSystemCounter
FileInputFormat计数器	org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter
FileOutputFormat计数器	org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter
作业计数器	org.apache.hadoop.mapreduce.JobCounter

计数器可以有两种方式来实现

第一种：通过context上下文对象可以获取计数器记录

Counter counter = context.getCounter("MR_COUNT", "MapRecordCounter");
        counter.increment(1L);

第二种：通过enum枚举类型来定义计数器

public static enum Counter{
        REDUCE_INPUT_RECORDS, REDUCE_INPUT_VAL_NUMS,

    }

context.getCounter(Counter.REDUCE_INPUT_RECORDS).increment(1L);
context.getCounter(Counter.REDUCE_INPUT_VAL_NUMS).increment(1L);

4、MpaReduce的combiner规约

在每一个map都会出现大量的输出，combiner就是对map阶段的输出先做一次合并，以减少在map和reduce节点之间的数据传输量提升网络的IO性能，是mapreduce的一种优化手段

1、combiner是mr程序的一种组件

2、combiner组件的父类是reducer

3、combiner和reducer的区别在于运行位置的不同

4、combiner的意义是对每一个maptask的输出进行局部汇总

5、具体实现步骤：

自定义combiner继承reducer。重写reduce方法

在job中设置：job.setCombinerClass(CustomCombiner.class)
t combiner 能够应用的前提是不能影响最终的业务逻辑，而且，combiner 的输出 kv 应该跟 reducer 的输入 kv 类型要对应起来

5、MapReduce综合练习

数据文件：

1363157985066	13726230503	00-FD-07-A4-72-B8:CMCC	120.196.100.82	i02.c.aliimg.com	游戏娱乐	24	27	2481	24681	200
1363157995052 	13826544101	5C-0E-8B-C7-F1-E0:CMCC	120.197.40.4	jd.com	京东购物	4	0	264	0	200
1363157991076 	13926435656	20-10-7A-28-CC-0A:CMCC	120.196.100.99	taobao.com	淘宝购物	2	4	132	1512	200
1363154400022 	13926251106	5C-0E-8B-8B-B1-50:CMCC	120.197.40.4	cnblogs.com	技术门户	4	0	240	0	200
1363157993044 	18211575961	94-71-AC-CD-E6-18:CMCC-EASY	120.196.100.99	iface.qiyi.com	视频网站	15	12	1527	2106	200
1363157995074 	84138413	5C-0E-8B-8C-E8-20:7DaysInn	120.197.40.4	122.72.52.12	未知	20	16	4116	1432	200
1363157993055 	13560439658	C4-17-FE-BA-DE-D9:CMCC	120.196.100.99	sougou.com	综合门户	18	15	1116	954	200
1363157995033 	15920133257	5C-0E-8B-C7-BA-20:CMCC	120.197.40.4	sug.so.360.cn	信息安全	20	20	3156	2936	200
1363157983019 	13719199419	68-A1-B7-03-07-B1:CMCC-EASY	120.196.100.82	baidu.com	综合搜索	4	0	240	0	200
1363157984041 	13660577991	5C-0E-8B-92-5C-20:CMCC-EASY	120.197.40.4	s19.cnzz.com	站点统计	24	9	6960	690	200
1363157973098 	15013685858	5C-0E-8B-C7-F7-90:CMCC	120.197.40.4	rank.ie.sogou.com	搜索引擎	28	27	3659	3538	200
1363157986029 	15989002119	E8-99-C4-4E-93-E0:CMCC-EASY	120.196.100.99	www.umeng.com	站点统计	3	3	1938	180	200
1363157992093 	13560439658	C4-17-FE-BA-DE-D9:CMCC	120.196.100.99	zhilian.com	招聘门户	15	9	918	4938	200
1363157986041 	13480253104	5C-0E-8B-C7-FC-80:CMCC-EASY	120.197.40.4	csdn.net	技术门户	3	3	180	180	200
1363157984040 	13602846565	5C-0E-8B-8B-B6-00:CMCC	120.197.40.4	2052.flash2-http.qq.com	综合门户	15	12	1938	2910	200
1363157995093 	13922314466	00-FD-07-A2-EC-BA:CMCC	120.196.100.82	img.qfc.cn	图片大全	12	12	3008	3720	200
1363157982040 	13502468823	5C-0A-5B-6A-0B-D4:CMCC-EASY	120.196.100.99	y0.ifengimg.com	综合门户	57	102	7335	110349	200
1363157986072 	18320173382	84-25-DB-4F-10-1A:CMCC-EASY	120.196.100.99	input.shouji.sogou.com	搜索引擎	21	18	9531	2412	200
1363157990043 	13925057413	00-1F-64-E1-E6-9A:CMCC	120.196.100.55	t3.baidu.com	搜索引擎	69	63	11058	48243	200
1363157988072 	13760778710	00-FD-07-A4-7B-08:CMCC	120.196.100.82	http://youku.com/	视频网站	2	2	120	120	200
1363157985079 	13823070001	20-7C-8F-70-68-1F:CMCC	120.196.100.99	img.qfc.cn	图片浏览	6	3	360	180	200
1363157985069 	13600217502	00-1F-64-E2-E8-B1:CMCC	120.196.100.55	www.baidu.com	综合门户	18	138	1080	186852	200

上网流量统计

需求一：统计求和

统计每个手机号的上行流量总和、下行流量总和，上行总流量之和和下行总流量之和

第一步：自定义map的输出value的对象FlowBean

import org.apache.hadoop.io.Writable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class FlowBean implements Writable {
    private Integer upFlow;
    private Integer downFlow;
    private Integer upCountFlow;
    private Integer downCountFlow;

    @Override
    public String toString() {
        return upFlow + "-" + downFlow + "-" + upCountFlow + "-" + downCountFlow;
    }

    public Integer getUpFlow() {
        return upFlow;
    }

    public void setUpFlow(Integer upFlow) {
        this.upFlow = upFlow;
    }

    public Integer getDownFlow() {
        return downFlow;
    }

    public void setDownFlow(Integer downFlow) {
        this.downFlow = downFlow;
    }

    public Integer getUpCountFlow() {
        return upCountFlow;
    }

    public void setUpCountFlow(Integer upCountFlow) {
        this.upCountFlow = upCountFlow;
    }

    public Integer getDownCountFlow() {
        return downCountFlow;
    }

    public void setDownCountFlow(Integer downCountFlow) {
        this.downCountFlow = downCountFlow;
    }

    @Override
    public void write(DataOutput output) throws IOException {
        output.writeInt(upFlow);
        output.writeInt(downFlow);
        output.writeInt(upCountFlow);
        output.writeInt(downCountFlow);
    }

    @Override
    public void readFields(DataInput input) throws IOException {
        this.upFlow = input.readInt();
        this.upCountFlow = input.readInt();
        this.downFlow = input.readInt();
        this.downCountFlow = input.readInt();
    }
}

第二步：自定义map的逻辑

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class FlowMapper extends Mapper<LongWritable, Text, Text, FlowBean> {
    private FlowBean flowBean = new FlowBean();

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] split = value.toString().split("\t");
        flowBean.setUpFlow(Integer.parseInt(split[6]));
        flowBean.setDownFlow(Integer.parseInt(split[7]));
        flowBean.setUpCountFlow(Integer.parseInt(split[8]));
        flowBean.setDownCountFlow(Integer.parseInt(split[9]));
        context.write(new Text(split[1]), flowBean);
    }
}

第三步：自定义reducer逻辑

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class FlowReducer extends Reducer<Text, FlowBean, Text, FlowBean> {
    private FlowBean flowBean = new FlowBean();

    @Override
    protected void reduce(Text key, Iterable<FlowBean> values, Context context) throws IOException, InterruptedException {
        Integer upFlow = 0;
        Integer downFlow = 0;
        Integer upCountFlow = 0;
        Integer downCountFlow = 0;
        for (FlowBean value : values) {
            upFlow += value.getUpFlow();
            downFlow += value.getDownFlow();
            upCountFlow += value.getUpCountFlow();
            downCountFlow += value.getDownCountFlow();
        }
        flowBean.setUpFlow(upFlow);
        flowBean.setDownFlow(downFlow);
        flowBean.setUpCountFlow(upCountFlow);
        flowBean.setDownCountFlow(downCountFlow);
        context.write(key, flowBean);
    }
}

第四步：程序main

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class FlowJobMain extends Configured implements Tool {
    @Override
    public int run(String[] strings) throws Exception {
        Job job = Job.getInstance(super.getConf(), "job");
        job.setJarByClass(FlowBean.class);
        job.setInputFormatClass(TextInputFormat.class);
        TextInputFormat.addInputPath(job, new Path("file:///D:\\流量统计\\input"));
        job.setMapperClass(FlowMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(FlowBean.class);
        job.setReducerClass(FlowReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(FlowBean.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        TextOutputFormat.setOutputPath(job, new Path("file:///D:\\流量统计\\output"));
        boolean b = job.waitForCompletion(true);
        return b ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        Configuration configuration = new Configuration();
        Tool tool = new FlowJobMain();
        int run = ToolRunner.run(configuration, tool, args);
        System.exit(run);
    }
}

需求二：上行流量按照倒序排列

以需求一的输出数据为排序的输入数据

第一步：自定义FlowBean实现WritableComparable实现比较排序

o1.compareTo(o2):当返回正数，o1排在o2后面，返回负数，o1放在o2前面

import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class FlowBean implements WritableComparable<FlowBean> {

    private Integer upFlow;
    private Integer downFlow;
    private Integer upCountFlow;
    private Integer downCountFlow;

    @Override
    public String toString() {
        return upFlow + "-" + downFlow + "-" + upCountFlow + "-" + downCountFlow;
    }

    public Integer getUpFlow() {
        return upFlow;
    }

    public void setUpFlow(Integer upFlow) {
        this.upFlow = upFlow;
    }

    public Integer getDownFlow() {
        return downFlow;
    }

    public void setDownFlow(Integer downFlow) {
        this.downFlow = downFlow;
    }

    public Integer getUpCountFlow() {
        return upCountFlow;
    }

    public void setUpCountFlow(Integer upCountFlow) {
        this.upCountFlow = upCountFlow;
    }

    public Integer getDownCountFlow() {
        return downCountFlow;
    }

    public void setDownCountFlow(Integer downCountFlow) {
        this.downCountFlow = downCountFlow;
    }

    @Override
    public int compareTo(FlowBean o) {
        int i = this.upCountFlow.compareTo(o.upCountFlow);
        return i > 0 ? -1 : 1;
    }

    @Override
    public void write(DataOutput output) throws IOException {
        output.writeInt(upFlow);
        output.writeInt(downFlow);
        output.writeInt(upCountFlow);
        output.writeInt(downCountFlow);
    }

    @Override
    public void readFields(DataInput input) throws IOException {
        this.upFlow = input.readInt();
        this.upCountFlow = input.readInt();
        this.downFlow = input.readInt();
        this.downCountFlow = input.readInt();
    }
}

第二步：自定义map逻辑

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class FlowSortMapper extends Mapper<LongWritable, Text, FlowBean, Text> {
    private FlowBean flowBean = new FlowBean();

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] split = value.toString().split("\t");
        flowBean.setUpFlow(Integer.parseInt(split[1]));
        flowBean.setDownFlow(Integer.parseInt(split[2]));
        flowBean.setUpCountFlow(Integer.parseInt(split[3]));
        flowBean.setDownCountFlow(Integer.parseInt(split[4]));
        context.write(flowBean, new Text(split[0]));
    }
}

第三步：自定义reducer逻辑

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class FlowSortReducer extends Reducer<FlowBean, Text, Text, FlowBean> {
    @Override
    protected void reduce(FlowBean key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        for (Text value : values) {
            context.write(value, key);
        }
    }
}

第四步：程序运行main函数

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class FlowSortJobMain extends Configured implements Tool {
    @Override
    public int run(String[] strings) throws Exception {
        Job job = Job.getInstance(super.getConf(), "job");
        job.setJarByClass(FlowSortJobMain.class);
        job.setInputFormatClass(TextInputFormat.class);
        TextInputFormat.addInputPath(job, new Path("file:///D:\\流量统计\\inputsort"));
        job.setMapperClass(FlowSortMapper.class);
        job.setMapOutputKeyClass(FlowBean.class);
        job.setMapOutputValueClass(Text.class);
        job.setReducerClass(FlowSortReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(FlowBean.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        TextOutputFormat.setOutputPath(job, new Path("file:///D:\\流量统计\\outputsort"));
        boolean b = job.waitForCompletion(true);
        return b ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        Configuration configuration = new Configuration();
        Tool tool = new FlowSortJobMain();
        int run = ToolRunner.run(configuration, tool, args);
        System.exit(run);
    }
}

需求三：手机号码分区

改良一下需求一，按照手机号进行分区，根据前三个数字进行分区

自定义分区规则

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

public class FlowPartition extends Partitioner<Text, FlowBean> {
    @Override
    public int getPartition(Text text, FlowBean flowBean, int i) {
        String line = text.toString();
        if (line.startsWith("135")) {
            return 0;
        } else if (line.startsWith("136")) {
            return 1;
        } else if (line.startsWith("137")) {
            return 2;
        } else if (line.startsWith("138")) {
            return 3;
        } else if (line.startsWith("139")) {
            return 4;
        } else {
            return 5;
        }
    }
}

jobmain中添分区设置和reducertask数量

job.setPartitionerClass(FlowPartition.class);
job.setNumReduceTasks(6);

更改输入输出路径，并打包到集群上去运行

TextInputFormat.addInputPath(job, new Path("hdfs://node01:8020/input"));
TextOutputFormat.setOutputPath(job, new Path("hdfs://node01:8020/output"));

6、MapTask运行机制详解和map任务的并行度

MapTask的运行过程：

TextInputFormat读取数据
调用map逻辑，默认是一个切片（也就是一个block块）对应一个mapTask
数据写入到环形缓冲区，默认的缓冲区的大小为100M，环形缓冲区相当于一个数组
数据一直往环形缓冲区写，数据在缓冲区实现分区，排序，规约，分组
等到数据量达到缓冲区的80%（80M）时，启动溢写线程，将内存当中的80M数据溢写到磁盘
等到mapTask完成后，磁盘上可能存在很多的小文件，这些文件已经做好了局部的排序。分区。规约等操作，再把这些小文件进行合并，变成一个大文件
等待reduce阶段来拉取文件

mapTask的一些基础配置设置（mapred-site.xml）

设置	默认的设置参数
设置环形缓冲区的内存值大小	mapreduce.task.io.sort.mb 100M
设置溢写百分比	mapreduce.map.sort.spill.percent 0
设置溢写数据目录	mapreduce.cluster.local.dir ${hadoop.tmp.dir}/mapred/local
设置一次最多合并多少个溢写文件	mapreduce.task.io.sort.factor 10

7、ReduceTask的运行过程

启动线程去mapTask拷贝数据，拉取属于每个reducetask的数据
数据的合并，将拉去过来的数据进行合并，合并可能发生在内存中，有可能在磁盘中，也有可能同时在内存和磁盘中，合并的同时进行分组的操作
调用reduce的逻辑
数据输出

注意：maptask的个数通过block块的个数来确认，reducetask的个数不能确认，需要反复的设置job.setNumReduceTask()来确定个数

8、Hadoop中的压缩

文件压缩有两大好处，节约磁盘空间，加速数据在网络和磁盘上的传输

通过命令bin/hadoop checknative来查看hadoop支持的压缩算法

各种压缩算法对应的Java类

压缩格式	对应使用的Java类
DEFLATE	org.apache.hadoop.io.compress.DeFaultCodec
gzip	org.apache.hadoop.io.compress.GZipCodec
bzip2	org.apache.hadoop.io.compress.BZip2Codec
LZO	com.hadoop.compression.lzo.LzopCodec
LZ4	org.apache.hadoop.io.compress.Lz4Codec
Snappy	org.apache.hadoop.io.compress.SnappyCodec

在代码中设置压缩方式：

设置map阶段的压缩

configuration.set("mapreduce.map.output.compress","true");
configuration.set("mapreduce.map.output.compress.codec","org.apache.hadoop.io.compress.SnappyCodec");

设置reduce阶段的压缩

configuration.set("mapreduce.output.fileoutputformat.compress","true");
configuration.set("mapreduce.output.fileoutputformat.compress.type","RECORD");
configuration.set("mapreduce.output.fileoutputformat.compress.codec","org.apache.hadoop.io.compress.SnappyCodec");

lsy107816

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
MapReduce增强

MapReduce增强1、分区在mapreduce中，通过指定分区将一个区的数据发送到同一个reduce中处理，分区数不能大于reduceTask的数量注意：在进行分区时，只能打成jar包发布到集群上去运行，不能在本地运行在需要分区时，需要编写一个自定义的partitioner类并且继承Partitioner这个类，传入map阶段的输出结果，重写其中的getPartition方法，通过返回...
复制链接

扫一扫