0 为何reduce也会有分组:
文件1--->map1分组---> 张三一组, 李四一组
文件2--->map2分组---> 张三一组, 李四一组
在map阶段,文件1和文件2仅仅在本map内分组但是map1和map2之间不会分组,因此只有在reduce的时候才能将所有数据合并并分组。
0.1
map任务 ---> 由调用文件hdfs的block个数决定
map函数: 调用文件每一行调用一次
reduce任务 ---> 由分区决定,分区代码需要自定义实现,默认分一个区。
具体见 hadoop patition 分区简介和自定义
reduce函数: 由map处理后得到的分组个数决定调用多少次
1 在eclipse写自定义reduce时,
要么Context带上泛型,
class MyReducer2 extends Reducer<LongWritable, LongWritable, LongWritable, LongWritable>{
protected void reduce(LongWritable k2, Iterable<LongWritable> v2s,
org.apache.hadoop.mapreduce.Reducer<LongWritable,LongWritable,LongWritable,LongWritable>.Context context)
throws IOException, InterruptedException {
System.out.println("reduce2");
}
}
要么不带泛型 也不需要带上包路径:
class MyReducer1 extends Reducer<LongWritable, LongWritable, LongWritable, LongWritable>{
protected void reduce(LongWritable k2, java.lang.Iterable<LongWritable> v2s, Context context) throws java.io.IOException ,InterruptedException {
System.out.println("reduce");
};
}
如果 带上包路径又不带上泛型,则reduce走不进去: 这种写法eclipse会有黄色波浪线提示,提示你应该加上泛型
class MyReducer2 extends Reducer<LongWritable, LongWritable, LongWritable, LongWritable>{
protected void reduce(LongWritable k2, Iterable<LongWritable> v2s,
org.apache.hadoop.mapreduce.Reducer.Context context)
throws IOException, InterruptedException {
System.out.println("reduce2");
}
}
2 在map节点自定义key(一般是个实体类)时,如果这个类的属性有string类型,那么在流输入输出写法和
long等的写法不同,具体如下:
public static class MyUser implements Writable, DBWritable{
int id;
String name;
@Override
public void write(DataOutput out) throws IOException {
out.writeInt(id);
Text.writeString(out, name); // 使用org.apache.hadoop.io.Text类实现读写
}
@Override
public void readFields(DataInput in) throws IOException {
this.id = in.readInt();
this.name = Text.readString(in); // // 使用org.apache.hadoop.io.Text类实现读写
}
否则报错如下:
java.io.DataInputStream.readFully(Unknown Source)