hadoop自定义输入格式(InputFormat)

  1. 个人感觉如果没有能自己实现输入格式的话,其实对mapreduce的程序运行,是不能理解深刻的。实现目标:自定义输入格式从本地文本信息中统计单词出现个数。感觉很熟悉吧。
  2. 第一步首先要实现抽象类InputFormat。里面有要两个实现的方法,得到分片信息,和得到记录阅读类(RecordReader)。下面是源代码
public abstract class InputFormat<K, V> {
public abstract 
    List<InputSplit> getSplits(JobContext context
                               ) throws IOException, InterruptedException;


  public abstract 
    RecordReader<K,V> createRecordReader(InputSplit split,
                                         TaskAttemptContext context
                                        ) throws IOException, 
                                                 InterruptedException;

}

接着来看自定义的输入类代码


public class TextInputFormat  extends InputFormat<IntWritable, Text>{

    @Override
    public List<InputSplit> getSplits(JobContext context) throws IOException, InterruptedException {
        // TODO Auto-generated method stub
        //这里为了方便直接以1,3条记录为一个分片,4,6记录为一个分片。将分片信息放在集合里面。
        ArrayList<InputSplit> arrayList=new ArrayList<>();
        TestSplit testSplit=new TestSplit(1, 3);
        TestSplit testSplit2=new TestSplit(4, 6);
        arrayList.add(testSplit);
        arrayList.add(testSplit2);
        return arrayList;
    }

    @Override
    public RecordReader<IntWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context)
            throws IOException, InterruptedException {
        // TODO Auto-generated method stub
        return new TestReader();
    }

}

3.首先来看得到分片信息,我们可以看到返回类型是inputsplit的list类型。很容易得到自定义分片,应该是inputsplit的子类。于是我们实现抽象类InputSplit.有两个要实现方法。getlength()得到分片大小。这里主要是后面会根据分片大小排序。getLocations()得到分片信息来自于哪一个节点的名称,如果返回null的话,这里是会出异常的。可以任意返回与分片数量相等的字符串数组。来看源代码


public class TestSplit extends InputSplit implements Writable{
    private int start=0;
    private int end=0;

    //分片一定有一个空的构造函数,不然会报错、
    public TestSplit() {
        // TODO Auto-generated constructor stub
    }
    //返回分片大小
    @Override
    public long getLength() throws IOException, InterruptedException {
        // TODO Auto-generated method stub
        return end-start+1;
    }
    //返回位置信息
    @Override
    public String[] getLocations() throws IOException, InterruptedException {
        // TODO Auto-generated method stub
        return new String[]{"hadoop1","hadoop2"};
    }
    //序列化这个地方不用讲吧
    @Override
    public void write(DataOutput out) throws IOException {
        // TODO Auto-generated method stub
         out.writeInt(start);
         out.writeInt(end);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        // TODO Auto-generated method stub
        this.start=in.readInt();
        this.end=in.readInt();

    }

    public TestSplit(int start, int end) {

        this.start = start;
        this.end = end;

    }

    public int getStart() {
        return start;
    }

    public void setStart(int start) {
        this.start = start;
    }

    public int getEnd() {
        return end;
    }

    public void setEnd(int end) {
        this.end = end;
    }
}

4.再来看RecordReader类,这个类主要是将记录拆成key/value这种形式,我里以行数为key,以行内容为value。实现这个功能的主要是nextKeyValue()方法。这个方法只要还有记录没有拆分就返回true.这个地方先要讲一个如何得到记录,我写了一个工具类。可以读本地文件将每行信息保存到map中。下面来看源代码


public class TestReader  extends RecordReader<IntWritable, Text>{
    private TestSplit testsplit;
    private int start;     //记录的开始
    private int end;       //记录的结束 
    private IntWritable key=new IntWritable();  //返回的key
    private Text value=new Text();              //返回的value
    private Map<Integer, String> map=new HashMap<>();
    @Override
    public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
        // TODO Auto-generated method stub
        testsplit=(TestSplit) split;
        start=testsplit.getStart();  
        end=testsplit.getEnd();
        map=TestText.getText();  //得到文件信息
    }

    @Override
    public boolean nextKeyValue() throws IOException, InterruptedException {
        // TODO Auto-generated method stub
      if (start<=end) {
        key.set(start);            //设置key和value
        value.set(map.get(start));
        start++;
        return true;
    }
      else
          return false;
    }

    @Override
    public IntWritable getCurrentKey() throws IOException, InterruptedException {
        // TODO Auto-generated method stub
        return key;
    }

    @Override
    public Text getCurrentValue() throws IOException, InterruptedException {
        // TODO Auto-generated method stub
        return value;
    }


    //后面两个方法可以不用实现
    @Override
    public float getProgress() throws IOException, InterruptedException {
        // TODO Auto-generated method stub
        return 0;
    }

    @Override
    public void close() throws IOException {
        // TODO Auto-generated method stub

    }

}
 工具类代码

public class TestText {

    public  static Map<Integer, String> getText() {
         Map<Integer, String> map=new HashMap();
            File file = new File("D:/Test.txt");  
            BufferedReader reader = null;  
            try {  
                reader = new BufferedReader(new FileReader(file));  
                String tempString = null;  
                int line = 1;  
                while ((tempString = reader.readLine()) != null) {  
                    map.put(line, tempString);
                    line++;  
                }  
                reader.close();  
            } catch (IOException e) {  
                e.printStackTrace();  
            } finally {     
                if (reader != null) {  
                    try {  
                        reader.close();  
                    } catch (IOException e1) {  
                    }  
                } 
                return map;
            }  
    }


}

5.Mapper类


public class TestMapper  extends Mapper<IntWritable, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);  
    private Text word = new Text();  
    @Override
    public void map(IntWritable key, Text value, Context context)
            throws IOException, InterruptedException {

        // TODO Auto-generated method stub
          String line = value.toString();  
            StringTokenizer st = new StringTokenizer(line," ");  

            while(st.hasMoreTokens()){  
                word.set(st.nextToken()); 

                context.write(word,one);  

            }  
    }

}

6.Reducer类


public  class TestReduce extends Reducer<Text, IntWritable, Text, IntWritable>{
     IntWritable result = new IntWritable();
        public void reduce(Text    key, Iterable<IntWritable> values, Context context) throws IOException,InterruptedException {
            int sum = 0;
            for(IntWritable val:values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key,result);
        }  
}  

7.主类


public class WordCount {
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();

        Job job = new Job(conf, "wordcount");
        job.setJarByClass(WordCount.class);
        job.setInputFormatClass(TextInputFormat.class);
        job.setMapperClass(TestMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        job.setReducerClass(TestReduce.class);
        FileOutputFormat.setOutputPath(job, new Path("01/"));  //输出信息保存在本地项目目录下
        System.exit(job.waitForCompletion(true)?0:1);
    } 
}

8.github项目地址https://github.com/iareuniqe/InputFormat

  • 3
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

wending-Y

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值