Hadoop 常用参数项整理

最新推荐文章于 2021-12-10 16:22:16 发布

托狸

最新推荐文章于 2021-12-10 16:22:16 发布

阅读量592

点赞数

分类专栏： Hadoop 文章标签： hadoop

本文链接：https://blog.csdn.net/zl007700/article/details/50597308

版权

Hadoop 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

常用重写函数

(代码案例整理自网上,具体实现以API为准)

学习了Hadoop有一小段时间了，简单的了解了hadoop的原理，也尝试着编写代码解决一些实际的问题。在这里也对hadoop常用，实践中可调整的一些参数项进行了整理。

1. Partitioner
Shuffle 阶段在Map端的核心。决定了Mapper的输出将由哪一个reducer处理

 public static class MyPartitionerPar implements Partitioner<Text, Text> { 
 /** 
         * getPartition()方法的 
         * 输入参数：键/值对<key,value>与reducer数量numPartitions 
         * 输出参数：分配的Reducer编号，这里是result 
         * */  
        @Override  
        public int getPartition(Text key, Text value, int numPartitions) {  
            // TODO Auto-generated method stub  
            int result = 0;  
            System.out.println("numPartitions--" + numPartitions);  
            if (key.toString().equals("long")) {  
                result = 0 % numPartitions;  
            } else if (key.toString().equals("short")) {  
                result = 1 % numPartitions;  
            } else if (key.toString().equals("right")) {  
                result = 2 % numPartitions;  
            }  
            System.out.println("result--" + result);  
            return result;  
        }

重写后需要在conf中加上

conf.setPartitionerClass(MyPartitionerPar.class);

2. setup
Mapper阶段之前调用，可以加载表头信息，也可以起到数据过滤的作用。

public void setup(Context context) throws IOException, InterruptedException {
            BufferedReader in = null;
            try {

                // 从当前作业中获取要缓存的文件
                Path[] paths = DistributedCache.getLocalCacheFiles(context.getConfiguration());
                String deptIdName = null;
                for (Path path : paths) {

                    // 对部门文件字段进行拆分并缓存到deptMap中
                    if (path.toString().contains("dept")) {
                        in = new BufferedReader(new FileReader(path.toString()));
                        while (null != (deptIdName = in.readLine())) {

                            // 对部门文件字段进行拆分并缓存到deptMap中
                            // 其中Map中key为部门编号，value为所在部门名称
                            deptMap.put(deptIdName.split(",")[0], deptIdName.split(",")[1]);
                        }
                    }
                }
            } catch (IOException e) {
                e.printStackTrace();
            } finally {
                try {
                    if (in != null) {
                        in.close();
                    }
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }

3. run
Job运行时调用，该函数调用了setup() 和 map() ，重写可实现多线程。

public void run(Context context) throws IOException, InterruptedException {  
  setup(context);//只运行一次，可以重载实现自己的功能，比如获得Configuration中的参数  
  while (context.nextKeyValue()) {  
    map(context.getCurrentKey(), context.getCurrentValue(), context);  
  }  
  cleanup(context);  
}

4.IntWritable.Comparator

重写shuffle阶段的比较器来实现不同的排序规则

 private static class IntWritableDecreasingComparator extends IntWritable.Comparator {  
      public int compare(WritableComparable a, WritableComparable b) {  
        return -super.compare(a, b);  
      }  

      public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {  
          return -super.compare(b1, s1, l1, b2, s2, l2);  
      }  
  }  



job.setNumReduceTasks(1);   

job.setSortComparatorClass(IntWritableDecreasingComparator.class);

5.WritableComparable
(自定义的序列化的数据类型)
例如定义一个”LongWritable,LongWritable”的类

import java.io
import org.apache.hadoop.io

 public class NumPair impements WritableComparable<NumPair>{
private LongWritable line;
private LongWritable location;

****自定义构造函数************ 
public NumPair(){
    set(new LongWritable(1),new LongWritable(0));
}

public NumPair(LongWritable first LongWritable second){
    set(first,second);
}

public NumPair(int first int second){
    set(new LongWritable（first）,new LongWritable(second));
}

public void set(LongWritable first LongWritable second){
    this.line = first;
    this.location = second;
}
**************重写5项******************
public void readFields(DataInput in) throws IOException {
    line.readFields(in);
    location.readFields(in);
}

public void write(DataOutput out) throws IOException {
    line.write(out);
    location.write(out);
}

public boolean equals(NumPair o){
    if ((this.line == o.line)&&(this.location == o.location)){
        return true;
    }else{
        return false;
    }
}

public int hasCode(){
    return line.hasCode()*13+location.hashCode();
}

public int compareTo(NumPair o){
    if((this.line == o.line)&&(this.location == o.location)){
        return 0;
    }else {
        return -1;
    }
}

6.数据压缩
加快磁盘IO和网络传输的速度。大大缩短了Job执行的时间
1 output of the map

job.setBoolean("mapreduce.compress.map.output",true);

2 output of the reduce

job.setBoolean("mapreduce.output.compress",true);
job.setClass("mapreduce.output.compression.codec",GzipCode.class,CompressinCodec.class);

7.SequenceFile

写文件方法

SequenceFile.Writer writer = null;
writer = SequenceFile.createWriter(fs, conf, path , key.getClass(), value.getClass(),CompressionType.BLOCK);
writer.append(key, value);

读文件方法

SequenceFile.Reader reader = null;
read = new SequenceFile.Reader(fs, path, conf);
reader.next(key,value);
Writable key = (Writable) ReflectionUtils.newInstance(reader.getKeyClass(),conf);
Writable value = (Writable)ReflectionUtils.newInstance(reader.getValueClass(), conf);

好处

支持压缩，且可定制为基于Record或Block压缩（Block级压缩性能较优）
本地化任务支持：因为文件可以被切分，因此MapReduce任务时数据的本地化情况应该是非常好的。
难度低：因为是Hadoop框架提供的API，业务逻辑侧的修改比较简单。

8.MapFile

MapFile是经过排序后的SequenceFile文件。

有两个值分别是
index offset
索引偏移量

写文件方法

MapFile.Writer writer = null;
writer = new MapFile.Writer(conf,fs,uri, key.getClass(), value.getClass(),CompressionType.BLOCK);
writer.append(key, value);

读文件方法

MapFile.Reader reader = null;
read = new MapFile.Reader(fs, uri, conf);
WritableComparable key = (WritableComparable) ReflectionUtils.newInstance(reader.getKeyClass(),conf);
Writable value = (Writable)ReflectionUtils.newInstance(reader.getValueClass(), conf);

while(reader.next(key,value)){
    System.out.printf("%s\n",value)
}

9.RecordReader

自定义RecordReader：

RecordReader设置了数据读取的方式，每读取一条记录都会调用RecordReader类

系统默认的RecordReader是LineRecordReader，如TextInputFormat；而SequenceFileInputFormat的RecordReader是SequenceFileRecordReader

重写

public class MyRecordReader extends RecordReader {

和

public class MyFileInputFormat extends FileInputFormat

最后设置

job.setInputFormatClass(MyFileInputFormat.class);

托狸

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
Hadoop 常用参数项整理

Hadoop 常用参数项整理常用重写函数(代码案例整理自网上)学习了Hadoop有一小段时间了，简单的了解了hadoop的原理，也尝试着编写代码解决一些实际的问题。在这里也对hadoop常用，实践中可调整的一些参数项进行了整理。1. Partitioner Shuffle 阶段在Map端的核心。决定了Mapper的输出将由哪一个reducer处理 public static class MyPar
复制链接

扫一扫