Hadoop自定义RecordReader

最新推荐文章于 2024-03-21 16:20:33 发布

原创最新推荐文章于 2024-03-21 16:20:33 发布 · 5.7k 阅读

6 ·

CC 4.0 BY-SA版权

文章标签：

#hadoop #path #buffer #file #float #codec

Hadoop 同时被 2 个专栏收录

54 篇文章

订阅专栏

源码分析

12 篇文章

订阅专栏

本文介绍如何自定义Hadoop的InputFormat和RecordReader以实现按文件路径和内容进行分布式处理，适用于处理包含多个文件路径的文本文件，并通过设置文件分块大小实现分布式处理。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

系统默认的LineRecordReader是按照每行的偏移量做为map输出时的key值，每行的内容作为map的value值，默认的分隔符是回车和换行。

现在要更改map对应的输入的<key,value>值，key对应的文件的路径（或者是文件名），value对应的是文件的内容（content）。

那么我们需要重写InputFormat和RecordReader，因为RecordReader是在InputFormat中调用的，当然重写RecordReader才是重点！

下面看代码InputFormat的重写：

public class chDicInputFormat extends FileInputFormat<Text,Text>
	implements JobConfigurable{
	private CompressionCodecFactory compressionCodecs = null;
	public void configure(JobConf conf) {
		compressionCodecs = new CompressionCodecFactory(conf);
	}
	/**
	 * @brief isSplitable 不对文件进行切分，必须对文件整体进行处理
	 *
	 * @param fs
	 * @param file
	 *
	 * @return false
	 */
	protected boolean isSplitable(FileSystem fs, Path file) {
	//	CompressionCodec codec = compressionCodecs.getCode(file);
		return false;//以文件为单位，每个单位作为一个split，即使单个文件的大小超过了64M，也就是Hadoop一个块得大小，也不进行分片
	}

	public RecordReader<Text,Text> getRecordReader(InputSplit genericSplit,
							JobConf job, Reporter reporter) throws IOException{
		reporter.setStatus(genericSplit.toString());
		return new chDicRecordReader(job,(FileSplit)genericSplit);
	}

}

下面来看RecordReader的重写：

public class chDicRecordReader implements RecordReader<Text,Text> {
	private static final Log LOG = LogFactory.getLog(chDicRecordReader.class.getName());
	private CompressionCodecFactory compressionCodecs = null;
	private long start;
	private long pos;
	private long end;
	private byte[] buffer;
	private String keyName;
	private FSDataInputStream fileIn;
	
	public chDicRecordReader(Configuration job,FileSplit split) throws IOException{
		start = split.getStart(); //从中可以看出每个文件是作为一个split的
		end = split.getLength() + start;
		final Path path = split.getPath();
		keyName = path.toString();
		LOG.info("filename in hdfs is : " + keyName);
		final FileSystem fs = path.getFileSystem(job);
		fileIn = fs.open(path);
		fileIn.seek(start);
		buffer = new byte[(int)(end - start)];
		this.pos = start;

	}

	public Text createKey() {
		return new Text();
	}

	public Text createValue() {
		return new Text();
	}

	public long getPos() throws IOException{
		return pos;
	}

	public float getProgress() {
		if (start == end) {
			return 0.0f;
		} else {
			return Math.min(1.0f, (pos - start) / (float)(end - start));
		}
	}

		public boolean next(Text key, Text value) throws IOException{
		while(pos < end) {
			key.set(keyName);
			value.clear();
			fileIn.readFully(pos,buffer);
			value.set(buffer);
	//		LOG.info("---内容: " + value.toString());
			pos += buffer.length;
			LOG.info("end is : " + end  + " pos is : " + pos);
			return true;
		}
		return false;
	}

	public void close() throws IOException{
		if(fileIn != null) {
			fileIn.close();
		}
		
	}

}

通过上面的代码，然后再在main函数中设置InputFormat对应的类，就可以使用这种新的读入格式了。

对于那些需要对整个文档进行处理的工作来说，还是比较有效的。

OK，下一次需要将输入文件进行split。

需求是这样的，一个文本中存放的是一些文件的路径和名称，每一行代表一个文件

如下所示：

/mkbootimg/mkbootimg.c
/logwrapper/logwrapper.c
/adb/log_service.c
/adb/adb_client.c
/adb/usb_windows.c
/adb/get_my_path_darwin.c
/adb/usb_osx.c
/adb/file_sync_service.c
/adb/file_sync_client.c
/adb/usb_linux.c
/adb/fdevent.c
/adb/usb_linux_client.c
/adb/commandline.c
/adb/remount_service.c
/adb/sockets.c

要对这些文件分别进行处理，但是整个文本中包含数十万个这种的文本，想对这些文本在处理的时候进行分布式处理。

大家有什么好的建议可以提出来：

现在的一个思路是控制分块的大小，但是这个也不是太好，想用更好的方式，例如每10000行作为一个split，这样可以通过hadoop平台实现分布式，正在做.....求指点啊！