Hadoop 常用参数项整理

常用重写函数

(代码案例整理自网上,具体实现以API为准)

学习了Hadoop有一小段时间了,简单的了解了hadoop的原理,也尝试着编写代码解决一些实际的问题。在这里也对hadoop常用,实践中可调整的一些参数项进行了整理。

1. Partitioner
Shuffle 阶段在Map端的核心。决定了Mapper的输出将由哪一个reducer处理

 public static class MyPartitionerPar implements Partitioner<Text, Text> { 
 /** 
         * getPartition()方法的 
         * 输入参数:键/值对<key,value>与reducer数量numPartitions 
         * 输出参数:分配的Reducer编号,这里是result 
         * */  
        @Override  
        public int getPartition(Text key, Text value, int numPartitions) {  
            // TODO Auto-generated method stub  
            int result = 0;  
            System.out.println("numPartitions--" + numPartitions);  
            if (key.toString().equals("long")) {  
                result = 0 % numPartitions;  
            } else if (key.toString().equals("short")) {  
                result = 1 % numPartitions;  
            } else if (key.toString().equals("right")) {  
                result = 2 % numPartitions;  
            }  
            System.out.println("result--" + result);  
            return result;  
        } 

重写后需要在conf中加上

conf.setPartitionerClass(MyPartitionerPar.class);

2. setup
Mapper阶段之前调用,可以加载表头信息,也可以起到数据过滤的作用。

public void setup(Context context) throws IOException, InterruptedException {
            BufferedReader in = null;
            try {

                // 从当前作业中获取要缓存的文件
                Path[] paths = DistributedCache.getLocalCacheFiles(context.getConfiguration());
                String deptIdName = null;
                for (Path path : paths) {

                    // 对部门文件字段进行拆分并缓存到deptMap中
                    if (path.toString().contains("dept")) {
                        in = new BufferedReader(new FileReader(path.toString()));
                        while (null != (deptIdName = in.readLine())) {

                            // 对部门文件字段进行拆分并缓存到deptMap中
                            // 其中Map中key为部门编号,value为所在部门名称
                            deptMap.put(deptIdName.split(",")[0], deptIdName.split(",")[1]);
                        }
                    }
                }
            } catch (IOException e) {
                e.printStackTrace();
            } finally {
                try {
                    if (in != null) {
                        in.close();
                    }
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }

3. run
Job运行时调用,该函数调用了setup() 和 map() ,重写可实现多线程。

public void run(Context context) throws IOException, InterruptedException {  
  setup(context);//只运行一次,可以重载实现自己的功能,比如获得Configuration中的参数  
  while (context.nextKeyValue()) {  
    map(context.getCurrentKey(), context.getCurrentValue(), context);  
  }  
  cleanup(context);  
}

4.IntWritable.Comparator

重写shuffle阶段的比较器来实现不同的排序规则

 private static class IntWritableDecreasingComparator extends IntWritable.Comparator {  
      public int compare(WritableComparable a, WritableComparable b) {  
        return -super.compare(a, b);  
      }  

      public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {  
          return -super.compare(b1, s1, l1, b2, s2, l2);  
      }  
  }  



job.setNumReduceTasks(1);   

job.setSortComparatorClass(IntWritableDecreasingComparator.class);

5.WritableComparable
(自定义的序列化的数据类型)
例如定义一个”LongWritable,LongWritable”的类

import java.io
import org.apache.hadoop.io

 public class NumPair impements WritableComparable<NumPair>{
private LongWritable line;
private LongWritable location;

****自定义构造函数************ 
public NumPair(){
    set(new LongWritable(1),new LongWritable(0));
}

public NumPair(LongWritable first LongWritable second){
    set(first,second);
}

public NumPair(int first int second){
    set(new LongWritable(first),new LongWritable(second));
}

public void set(LongWritable first LongWritable second){
    this.line = first;
    this.location = second;
}
**************重写5项******************
public void readFields(DataInput in) throws IOException {
    line.readFields(in);
    location.readFields(in);
}

public void write(DataOutput out) throws IOException {
    line.write(out);
    location.write(out);
}

public boolean equals(NumPair o){
    if ((this.line == o.line)&&(this.location == o.location)){
        return true;
    }else{
        return false;
    }
}

public int hasCode(){
    return line.hasCode()*13+location.hashCode();
}

public int compareTo(NumPair o){
    if((this.line == o.line)&&(this.location == o.location)){
        return 0;
    }else {
        return -1;
    }
}

6.数据压缩
加快磁盘IO和网络传输的速度。大大缩短了Job执行的时间
1 output of the map

job.setBoolean("mapreduce.compress.map.output",true);

2 output of the reduce

job.setBoolean("mapreduce.output.compress",true);
job.setClass("mapreduce.output.compression.codec",GzipCode.class,CompressinCodec.class);

7.SequenceFile

写文件方法

SequenceFile.Writer writer = null;
writer = SequenceFile.createWriter(fs, conf, path , key.getClass(), value.getClass(),CompressionType.BLOCK);
writer.append(key, value);

读文件方法

SequenceFile.Reader reader = null;
read = new SequenceFile.Reader(fs, path, conf);
reader.next(key,value);
Writable key = (Writable) ReflectionUtils.newInstance(reader.getKeyClass(),conf);
Writable value = (Writable)ReflectionUtils.newInstance(reader.getValueClass(), conf);

好处

  • 支持压缩,且可定制为基于Record或Block压缩(Block级压缩性能较优)
  • 本地化任务支持:因为文件可以被切分,因此MapReduce任务时数据的本地化情况应该是非常好的。
  • 难度低:因为是Hadoop框架提供的API,业务逻辑侧的修改比较简单。

8.MapFile

MapFile是经过排序后的SequenceFile文件。

有两个值分别是
index offset
索引 偏移量

写文件方法

MapFile.Writer writer = null;
writer = new MapFile.Writer(conf,fs,uri, key.getClass(), value.getClass(),CompressionType.BLOCK);
writer.append(key, value);

读文件方法

MapFile.Reader reader = null;
read = new MapFile.Reader(fs, uri, conf);
WritableComparable key = (WritableComparable) ReflectionUtils.newInstance(reader.getKeyClass(),conf);
Writable value = (Writable)ReflectionUtils.newInstance(reader.getValueClass(), conf);

while(reader.next(key,value)){
    System.out.printf("%s\n",value)
}

9.RecordReader

自定义RecordReader:

RecordReader设置了数据读取的方式,每读取一条记录都会调用RecordReader类

系统默认的RecordReader是LineRecordReader,如TextInputFormat;而SequenceFileInputFormat的RecordReader是SequenceFileRecordReader

重写

public class MyRecordReader extends RecordReader {

public class MyFileInputFormat extends FileInputFormat 

最后设置

job.setInputFormatClass(MyFileInputFormat.class);
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值