Hadoop Map/Reduce InputFormat基础

最新推荐文章于 2021-06-05 14:18:19 发布

jokes000

最新推荐文章于 2021-06-05 14:18:19 发布

阅读量3.6k

点赞数

分类专栏： Hadoop学习文章标签： hadoop mapreduce url interface input class

Hadoop学习专栏收录该内容

19 篇文章 0 订阅

订阅专栏

有时候你可能想要用不同的方法从input data中读取数据。那么你就需要创建一个自己的InputFormat类。

InputFormat是一个只有两个函数的接口。

public interface InputFormat<K, V> {
    InputSplit[] getSplits(JobConf job, int numSplits) throws IOException;
    RecordReader<K, V> getRecordReader(InputSplit split,
                                       JobConf job,
                                       Reporter reporter) throws IOException;
}

getSplits()：标记所有的输入数据，然后将他们切分为小的输入数据块，每个Map任务处理一个数据块；

getRecordReader()：提供一个RecordReader来从给定的数据块中迭代处理数据，然后将数据处理为<key,value>格式。

由于没有人愿意关心怎样将数据块分为小的数据块，你应该继承FileInputFormat类，它用来处理数据的分块。

大部分已知的InputFormat就是FileInputFormat的子类。

InputFormat	Description
TextInputFormat	输入文件中的每一行就是一个记录，Key是这一行的byte offset，而value是这一行的内容。 Key: LongWritable Value: Text
KeyValueTextInputFormat	输入文件中每一行就是一个记录，第一个分隔符字符切分每行。在分隔符字符之前的内容为Key，在之后的为Value。分隔符变量通过key.value.separator.in.input.line变量设置，默认为(\t)字符。 Key: Text Value: Text
SequenceFileInputFormat<K,V>	一个用来读取字符流数据的InputFormat，<Key,Value>为用户自定义的。字符流数据是Hadoop自定义的压缩的二进制数据格式。它用来优化从一个MapReduce任务的输出到另一个MapReduce任务的输入之间的数据传输过程。 Key: K(用户自定义) Value: V(用户自定义)
NLineInputFormat	与TextInputFormat一样，但每个数据块必须保证有且只有Ｎ行，mapred.line.input.format.linespermap属性，默认为１，设置为Ｎ。 Key: LongWritable value: Text

FileInputFormat实现getSplits()方法，但是仍然保留getRecordReader()方法为abstract以使其子类实现。

　FileInputFormat的getSplits()实现试着将输入数据分块大小限制在numSplits值之上，numSplits<数据块<hdfs block size

FileInputFormat有一些子类可以重载的protected函数，例如isSplitable()，它用来确定你是否可以切分一个块，默认返回为true，表示只要数据块大于hdfs block size，那么它将会被切分。但有时候你不希望切分一个文件，例如某些二进制序列文件不能被切分时，你就需要重载该函数使其返回false。

　在用FileInputFormat时，你主要的精力应该集中在数据块分解为记录，并且生成<key,value>键值对的RecordReader方法上。

public interface RecordReader<K, V> {
　　boolean next(K key, V value) throws IOException;
　　
　　K createKey();
　　V createValue();
　　
　　long getPos() throws IOException;
　　public void close() throws IOException;
　　float getProgress() throws IOException;
}

　　同样的，我们将再次使用Hadoop提供的类，而不使用自己写的RecordReader。

　　例如LineRecordReader实现RecordReader<LongWritable,Text>，用在TextInputFormat中。

　　　　KeyValueLineRecordReader用在KeyValueTextInputFormat中。

　　大多数时候，你自己写的RecordReader只是现有的RecordReader的变形实现而已。

　　实例：

　　假如有一组数据：

　　17:16:18　http://hadoop.apache.org/core/docs/r0.19.0/api/index.html
　　17:16:19　http://hadoop.apache.org/core/docs/r0.19.0/mapred_tutorial.html
　　17:16:20　http://wiki.apache.org/hadoop/GettingStartedWithHadoop
　　17:16:20　http://www.maxim.com/hotties/2008/finalist_gallery.aspx
　　17:16:25　http://wiki.apache.org/hadoop/
　　...

　　让我们创建一个TimeUrlTextInputFormat类来实现将URL视为URLWritable类型。

　　根据之前提到的知识，我们创建我们TimeUrlTextInputFormat类，继承FileInputFormat，并实现getRecordReader来返回我们自己的RecordReader。　　

public class TimeUrlTextInputFormat extends FileInputFormat<Text, URLWritable> {
	public RecordReader<Text, URLWritable> getRecordReader(InputSplit input, JobConf job, Reporter reporter)
		throws IOException {
		return new TimeUrlLineRecordReader(job, (FileSplit)input);
	}
}

　　URLWritable类：

public class URLWritable implements Writable {
	protected URL url;
	public URLWritable() { }
	public URLWritable(URL url) {
		this.url = url;
	}
	public void write(DataOutput out) throws IOException {
		out.writeUTF(url.toString());
	}
	public void readFields(DataInput in) throws IOException {
		url = new URL(in.readUTF());
	}
	public void set(String s) throws MalformedURLException {
		url = new URL(s);
	}
}

　　我们的TimeUrlRecordReader会RecordReader接口中的6个方法。

class TimeUrlLineRecordReader implements RecordReader<Text, URLWritable> {
	private KeyValueLineRecordReader lineReader;
	private Text lineKey, lineValue;

	public TimeUrlLineRecordReader(JobConf job, FileSplit split) throws IOException {
		lineReader = new KeyValueLineRecordReader(job, split);
		lineKey = lineReader.createKey();
		lineValue = lineReader.createValue();
	}

	public boolean next(Text key, URLWritable value) throws IOException {
		if (!lineReader.next(lineKey, lineValue)) {
			return false;
		}
		key.set(lineKey);
		value.set(lineValue.toString());
		return true;
	}

	public Text createKey() {
		return new Text(“”);
	}
	public URLWritable createValue() {
		return new URLWritable();
	}
	public long getPos() throws IOException {
		return lineReader.getPos();
	}
	public float getProgress() throws IOException {
		return lineReader.getProgress();
	}
	public void close() throws IOException {
		lineReader.close();
	}
}

jokes000

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Hadoop Map/Reduce InputFormat基础

有时候你可能想要用不同的方法从input data中读取数据。那么你就需要创建一个自己的InputFormat类。 InputFormat是一个只有两个函数的接口。public interface InputFormat { InputSplit[] getSplits(JobConf job, int numSplits) throws IOException;
复制链接

扫一扫

专栏目录