常用重写函数
(代码案例整理自网上,具体实现以API为准)
学习了Hadoop有一小段时间了,简单的了解了hadoop的原理,也尝试着编写代码解决一些实际的问题。在这里也对hadoop常用,实践中可调整的一些参数项进行了整理。
1. Partitioner
Shuffle 阶段在Map端的核心。决定了Mapper的输出将由哪一个reducer处理
public static class MyPartitionerPar implements Partitioner<Text, Text> {
/**
* getPartition()方法的
* 输入参数:键/值对<key,value>与reducer数量numPartitions
* 输出参数:分配的Reducer编号,这里是result
* */
@Override
public int getPartition(Text key, Text value, int numPartitions) {
// TODO Auto-generated method stub
int result = 0;
System.out.println("numPartitions--" + numPartitions);
if (key.toString().equals("long")) {
result = 0 % numPartitions;
} else if (key.toString().equals("short")) {
result = 1 % numPartitions;
} else if (key.toString().equals("right")) {
result = 2 % numPartitions;
}
System.out.println("result--" + result);
return result;
}
重写后需要在conf中加上
conf.setPartitionerClass(MyPartitionerPar.class);
2. setup
Mapper阶段之前调用,可以加载表头信息,也可以起到数据过滤的作用。
public void setup(Context context) throws IOException, InterruptedException {
BufferedReader in = null;
try {
// 从当前作业中获取要缓存的文件
Path[] paths = DistributedCache.getLocalCacheFiles(context.getConfiguration());
String deptIdName = null;
for (Path path : paths) {
// 对部门文件字段进行拆分并缓存到deptMap中
if (path.toString().contains("dept")) {
in = new BufferedReader(new FileReader(path.toString()));
while (null != (deptIdName = in.readLine())) {
// 对部门文件字段进行拆分并缓存到deptMap中
// 其中Map中key为部门编号,value为所在部门名称
deptMap.put(deptIdName.split(",")[0], deptIdName.split(",")[1]);
}
}
}
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
if (in != null) {
in.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
3. run
Job运行时调用,该函数调用了setup() 和 map() ,重写可实现多线程。
public void run(Context context) throws IOException, InterruptedException {
setup(context);//只运行一次,可以重载实现自己的功能,比如获得Configuration中的参数
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
cleanup(context);
}
4.IntWritable.Comparator
重写shuffle阶段的比较器来实现不同的排序规则
private static class IntWritableDecreasingComparator extends IntWritable.Comparator {
public int compare(WritableComparable a, WritableComparable b) {
return -super.compare(a, b);
}
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
return -super.compare(b1, s1, l1, b2, s2, l2);
}
}
job.setNumReduceTasks(1);
job.setSortComparatorClass(IntWritableDecreasingComparator.class);
5.WritableComparable
(自定义的序列化的数据类型)
例如定义一个”LongWritable,LongWritable”的类
import java.io
import org.apache.hadoop.io
public class NumPair impements WritableComparable<NumPair>{
private LongWritable line;
private LongWritable location;
****自定义构造函数************
public NumPair(){
set(new LongWritable(1),new LongWritable(0));
}
public NumPair(LongWritable first LongWritable second){
set(first,second);
}
public NumPair(int first int second){
set(new LongWritable(first),new LongWritable(second));
}
public void set(LongWritable first LongWritable second){
this.line = first;
this.location = second;
}
**************重写5项******************
public void readFields(DataInput in) throws IOException {
line.readFields(in);
location.readFields(in);
}
public void write(DataOutput out) throws IOException {
line.write(out);
location.write(out);
}
public boolean equals(NumPair o){
if ((this.line == o.line)&&(this.location == o.location)){
return true;
}else{
return false;
}
}
public int hasCode(){
return line.hasCode()*13+location.hashCode();
}
public int compareTo(NumPair o){
if((this.line == o.line)&&(this.location == o.location)){
return 0;
}else {
return -1;
}
}
6.数据压缩
加快磁盘IO和网络传输的速度。大大缩短了Job执行的时间
1 output of the map
job.setBoolean("mapreduce.compress.map.output",true);
2 output of the reduce
job.setBoolean("mapreduce.output.compress",true);
job.setClass("mapreduce.output.compression.codec",GzipCode.class,CompressinCodec.class);
7.SequenceFile
写文件方法
SequenceFile.Writer writer = null;
writer = SequenceFile.createWriter(fs, conf, path , key.getClass(), value.getClass(),CompressionType.BLOCK);
writer.append(key, value);
读文件方法
SequenceFile.Reader reader = null;
read = new SequenceFile.Reader(fs, path, conf);
reader.next(key,value);
Writable key = (Writable) ReflectionUtils.newInstance(reader.getKeyClass(),conf);
Writable value = (Writable)ReflectionUtils.newInstance(reader.getValueClass(), conf);
好处
- 支持压缩,且可定制为基于Record或Block压缩(Block级压缩性能较优)
- 本地化任务支持:因为文件可以被切分,因此MapReduce任务时数据的本地化情况应该是非常好的。
- 难度低:因为是Hadoop框架提供的API,业务逻辑侧的修改比较简单。
8.MapFile
MapFile是经过排序后的SequenceFile文件。
有两个值分别是
index offset
索引 偏移量
写文件方法
MapFile.Writer writer = null;
writer = new MapFile.Writer(conf,fs,uri, key.getClass(), value.getClass(),CompressionType.BLOCK);
writer.append(key, value);
读文件方法
MapFile.Reader reader = null;
read = new MapFile.Reader(fs, uri, conf);
WritableComparable key = (WritableComparable) ReflectionUtils.newInstance(reader.getKeyClass(),conf);
Writable value = (Writable)ReflectionUtils.newInstance(reader.getValueClass(), conf);
while(reader.next(key,value)){
System.out.printf("%s\n",value)
}
9.RecordReader
自定义RecordReader:
RecordReader设置了数据读取的方式,每读取一条记录都会调用RecordReader类
系统默认的RecordReader是LineRecordReader,如TextInputFormat;而SequenceFileInputFormat的RecordReader是SequenceFileRecordReader
重写
public class MyRecordReader extends RecordReader {
和
public class MyFileInputFormat extends FileInputFormat
最后设置
job.setInputFormatClass(MyFileInputFormat.class);