HDPCD-Java-复习笔记（6）

最新推荐文章于 2018-01-30 14:11:46 发布

Younge__

最新推荐文章于 2018-01-30 14:11:46 发布

阅读量530

点赞数

分类专栏： hdp hadoop 文章标签： hdp

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/yongaini10/article/details/78215337

版权

hadoop 同时被 2 个专栏收录

24 篇文章 0 订阅

订阅专栏

23 篇文章 0 订阅

订阅专栏

Input and Output Formats

输入输出格式

Overview of Input and Output Formats

输入输出格式概述

In a MapReduce job, there are two configurable components that determine how data is read and written:

在一个MapReduce工作任务中，有两种可以配置的组件用来决定数据是怎么读取和写入的：

InputFormat--- The InputFormat of a job is responsible for reading the data from the InputSplit and generating a <key ,value > pair for the Mapper.

(InputFormat --- 一个工作任务的InputFormat负责从InputSplit中读取数据并且生成为Mapper的键值对输入。)

OutputFormat--- The OutputFormat is responsible for writing the < key ,value > pairs from the Reducer to an output file.

(OutputFormat --- OutputFormat负责从将键值对从Reducer中写入到输出文件中。)

The Built-in Input Formats

内建输入格式

FileInputFormat<K,V> -- The parent class of all input formats that read data from files.(所有从文件读取数据的输入格式的父类）

TextInputFormat<LongWritable, Text> -- The default Input Format.(默认输入格式)

SequenceFileInputFormat<K,V> -- For reading data from a Sequence File.(用来从 Sequence文件中读取数据)

KeyValueTextInputFormat<Text,Text> -- Reads in lines of data as records, and the key is the first token - based on a delimiter that is configurable.(读取每行数据作为记录，键为第一个token，token分割基于配置的分割符)

CombineFileInputFormat<K,V> -- For controlling input splits.(用来控制InputSplits)

MultipleInputs-- For specifying multiple input paths with a different InputFormat and Mapper for each path.(用来为不同的 InputFormat和Mapper指定多个不同的输入路径)

A custom InputFormat is a great tool when working with custom data types.

当定制InputFortmat用来处理定制的数据类型时，表现强大。

Understanding InputFormats

理解 InputFormats

The InputFormat class has two methods:

InputFormat类包含两个方法：

getSplits -- Determines the input splits.（决定input splits.）

createRecordReader -- Provides a RecordReader instance for iterating through the input splits andgenerating <key,value> pairs.（提供一个 RecordReader实例，用来遍历 input splits和生成键值对）

Determining Input Splits

判断 Input Splits

To write a custom InputFormat, extend the FileInputFormat class and let it determine the inputsplits, allowing you to only worry about the RecordReader instance.

编写一个定制 InputFormat，继承 FileInputFormat类，让这个类判断inputsplits，你只需要关心 RecordReader实例。

Defining a RecordReader

定义一个 RecordReader

To understand how a RecordReader works, it helps to look at the Hadoop source code of the run method, which invokes methods on the RecordReader (via the Context):

为了更好的理解 RecordReader是怎么工作的，查看Hadoop源码的run方法会有所帮助，通过Context调用 RecordReader上的方法。

public void run(Context context)
		throws IOException, InterruptedException {
	 setup(context);
	 while (context.nextKeyValue()) {
	  map(context.getCurrentKey(),
				context.getCurrentValue(),
				context);
	 }
	 cleanup(context);
}

nextKeyValue -- Is invoked to determine if there is another <key,value> pair.（被调用用来决断是否还有下一个键值对）

getCurrentKey and getCurrentValue -- Retrieve the <key,value> pair.（获取键值对）

An example of a custom RecordReader:

一个定制 RecordReader的列子：

public class CustomerReader extends RecordReader<CustomerKey, Customer> {
    private BufferedReader in;
    private FSDataInputStream fsInput;
    private CustomerKey key = new CustomerKey();
    private Customer value = new Customer();
    private long start;
    private long end;
    private long pos;

    public void initialize(InputSplit inputSplit, TaskAttemptContext context) {
        FileSplit split = (FileSplit) inputSplit;
        Configuration conf = context.getConfiguration();
        Path path = split.getPath();
        FileSystem fs = path.getFileSystem(conf);
        fsInput = fs.open(path);
        in = new BufferedReader(new InputStreamReader(fsInput));
        start = split.getStart();
        end = start + split.getLength();
    }

    public boolean nextKeyValue() {
        String line = in.readLine();

        if (line == null) {
            return false;
        } else {
            String[] words = StringUtils.split(line, ',');
            key.setCustomerId(Integer.parseInt(words[0]));
            key.setZipCode(words[4]);
            value.setFirst(words[1]);
            value.setLast(words[2]);
            value.setStreetAddress(words[3]);

            return true;
        }

        pos += line.length();
    }

    public CustomerKey getCurrentKey() {
        return key;
    }

    public Customer getCurrentValue() {
        return value;
    }

    public float getProgress() {
        return Math.min(1.0f, (pos - start) / (float) (end - start));
    }

    public void close() throws IOException {
        fsInput.close();
    }
}

Handling Records that Span Splits

处理跨Splits的记录

Considering the CustomerReader defined on the previously, it does not handle records across splits well. To fix the issue, the CustomerReader should keep track of where it is reading in the split.

考虑前面定义的 CustomerReader类，它没有处理跨splits记录。为了修复这个问题，CustomerReader需要跟中在split中读取到哪了。

public void initialize(InputSplit inputSplit,
		TaskAttemptContext context) {
	FileSplit split = (FileSplit) inputSplit;
	Configuration conf = context.getConfiguration();
	Path path = split.getPath();
	FileSystem fs = path.getFileSystem(conf);
	fsInput = fs.open(path);
	in = new BufferedReader(new
	InputStreamReader(fsInput));
	//Hang on to the start and end of the split
	start = split.getStart();
	end = start + split.getLength();
	//Seek to the start of the split
	fsInput.seek(start);
	//If we are in the middle of a split, then skip the
	//line since it is a portion of a previous record
	if(start != 0) {
		start += fsInput.readLine(new Text(), 0,
		  (int) Math.min(Integer.MAX_VALUE, end - start));
	}
	currentPosition = start;
}

public boolean nextKeyValue()
	//Read one line beyond the end of the split
	if (currentPos > end) {
		return false;
	 }
	currentPos += in.readLine(line);
	if (line.getLength() == 0) {
		return false;
	}

	//...remainder of nextKeyValue method
}

Processing Many Small Files

处理多个小文件

public class CustomerCombineFileInputFormat extends CombineFileInputFormat<CustomerKey, Customer> {
    public RecordReader<CustomerKey, Customer> createRecordReader(
        InputSplit split, TaskAttemptContext context) throws IOException {
        return new CombineFileRecordReader((CombineFileSplit) split, context,
            CustomerCombineReader.class);
    }
}

public class CustomerCombineReader extends RecordReader<CustomerKey, Customer> {
    private int index;
    private CustomerReader in;

    public CustomerCombineReader(CombineFileSplit split,
        TaskAttemptContext context, Integer index) throws IOException {
        this.index = index;
        in = new CustomerReader();
    }

    public void initialize(InputSplit split, TaskAttemptContext context) {
        CombineFileSplit cfsplit = (CombineFileSplit) split;
        FileSplit fileSplit = new FileSplit(cfsplit.getPath(index),
                cfsplit.getOffset(index), cfsplit.getLength(index),
                cfsplit.getLocations());
        in.initialize(fileSplit, context);
    }

    public boolean nextKeyValue() {
        return in.nextKeyValue();
    }

    public CustomerKey getCurrentKey() {
        return in.getCurrentKey();
    }

    public Customer getCurrentValue() {
        return in.getCurrentValue();
    }

    public float getProgress() {
        return in.getProgress();
    }

    public void close() throws IOException {
        in.close();
    }
}

To make this combine file input work, the size of the input splits must be specified when the job is run.

为使组合文件输入能够工作，必须在工作任务运行时指定input splits的大小。

There are two options:

两个选项：
1.Use the setMaxSplitSize, setMinSplitSizeNode and setMinSplitSizeRack methods of the CombineFileInputFormat class to specify a split size range.

1.使用 CombineFileInputFormat类中的 setMaxSplitSize, setMinSplitSizeNode 和 setMinSplitSizeRack方法指定split的范围。

2.Set the mapreduce.input.fileinputformat.split.maxsize, mapreduce.input.fileinputformat.split.minsize.per.node and
mapreduce.input.fileinputformat.split.minsize.per.rack properties to the desired input split size.

2.设定 mapreduce.input.fileinputformat.split.maxsize, mapreduce.input.fileinputformat.split.minsize.per.node 和
mapreduce.input.fileinputformat.split.minsize.per.rack属性，以获取到需要的input split大小。

public class CustomerCombineFileInputFormat extends CombineFileInputFormat<CustomerKey, Customer> {
    public CustomerCombineFileInputFormat() {
        setMaxSplitSize(67108864); //64MB
    }
}

The Built-in Output Formats

内建输出格式

FileOutputFormat<K,V> -- The abstract parent class of Output Formats that write to a file.（写一个文件的输出格式的抽象父类）

TextOutputFormat<K,V> -- For writing text - this is the default OutputFormat of a MapReduce job.（用来写文本，MapReduce工作任务的默认输出格式）

SequenceFileOutputFormat<K,V> -- For generating sequence files.（用来生成序列化文件）

MultipleOutputs<K,V> -- For sending output to multiple destinations.（用来发送输出到多个目的地）

NullOutputFormat<K,V> -- Sends all output to /dev/null,which essentially means no output is generated.（发送输出到/dev/null, 也就意味着没有输出产生）

LazyOutputFormat<K,V> -- The output file does not get created until a call to write. Useful if write will not be called, and users do not want an empty file generated.（当调用write时输出文件才会生成。当write没有被调用，用户不想生成空文件时，会非常有用。）

Writing a Custom OutputFormat

编写定制输出格式

The steps for writing a custom OutputFormat look similar to writing a custom InputFormat:

编写一个定制输出格式的步骤和编写一个定制的输入格式相同：

1.Write a class that extends OutputFormat, which typically is accomplished by extending FileOutputFormat if writing output to files.

编写一个类继承 OutputFormat，通常如果写输出到文件中会继承 FileOutputFormat。

2.Implement the getRecordWriter method, which needs to return a RecordWriter instance.

实现 getRecordWriter方法，方法需要返回一个 RecordWriter实例。

3.Write a class that extends RecordWriter and define the write method.

编写一个类继承 RecordWriter，定义write方法。

4.The write method is invoked for each < key ,value > pair.

每个键值对会调用write方法。

public class CustomerOutputFormat extends FileOutputFormat<CustomerKey, Customer> {
    @Override
    public RecordWriter<CustomerKey, Customer> getRecordWriter(
        TaskAttemptContext context) throws IOException, InterruptedException {
        //Create a file to write the output to
        Path outputDir = FileOutputFormat.getOutputPath(context);
        Path file = new Path(outputDir.getName() + "/" + "Customers_" +
                context.getTaskAttemptID().getTaskID());
        FileSystem fs = file.getFileSystem(context.getConfiguration());
        FSDataOutputStream fileOut = fs.create(file);

        //Return the RecordWriter
        return new CustomerRecordWriter(fileOut);
    }
}

public class CustomerRecordWriter extends RecordWriter<CustomerKey, Customer> {
    private PrintWriter out;

    public CustomerRecordWriter(DataOutputStream fileOut) {
        out = new PrintWriter(fileOut);
    }

    public void write(CustomerKey key, Customer value) {
        out.println(key.getCustomerId() + '\t' + key.getZipCode());
    }

    public void close(TaskAttemptContext context) {
        out.close();
    }
}

The MulitpleOutputs Class

public class CustomerReducer extends Reducer<CustomerKey, Customer, Text, Text> {
    private MultipleOutputs<Text, Text> outs;
    private Text outputKey = new Text();
    private Text outputValue = new Text();

    @Override
    protected void setup(Context context)
        throws IOException, InterruptedException {
        outs = new MultipleOutputs<Text, Text>(context);
    }

    @Override
    protected void reduce(CustomerKey key, Iterable<Customer> values,
        Context context) throws IOException, InterruptedException {
        while (values.iterator().hasNext()) {
            Customer e = values.iterator().next();
            outputKey.set(e.getCustomerId());
            outputValue.set(e.getLast());
            outs.write("lastnames", outputKey, outputValue, "outputPath1");
            outputValue.set(e.getStreetAddress());
            outs.write("addresses", outputKey, outputValue, "outputPath2");
        }
    }
}

The “ lastnames” output will contain the last name of each customer.

“lastnames” 输出将包含每个顾客的姓。
The “addresses” output will contain the street address of each customer.

“addresses”输出将包含每个顾客的街道地址。

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
HDPCD-Java-复习笔记（6）

HDPCD-Java-复习笔记（6）Input and Output Formats
复制链接

扫一扫

专栏目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。