InputFormat接口实现类（选择题）

最新推荐文章于 2022-06-19 08:21:53 发布

静默安然

最新推荐文章于 2022-06-19 08:21:53 发布

阅读量393

点赞数

分类专栏： hadoop

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/zhao2chen3/article/details/110703313

版权

hadoop 专栏收录该内容

45 篇文章 1 订阅

订阅专栏

MapReduce任务的输入文件一般是存储在HDFS里面。输入的文件格式包括：基于行的日志文件、二进制格式文件等。这些文件一般会很大，达到数十GB，甚至更大。那么MapReduce是如何读取这些数据的呢？下面我们首先学习InputFormat接口。

InputFormat常见的接口实现类包括：TextInputFormat、KeyValueTextInputFormat、NLineInputFormat、CombineTextInputFormat和自定义InputFormat等。

1）TextInputFormat

TextInputFormat是默认的InputFormat。每条记录是一行输入。键是LongWritable类型，存储该行在整个文件中的字节偏移量。值是这行的内容，不包括任何行终止符（换行符和回车符）。

以下是一个示例，比如，一个分片包含了如下4条文本记录。

Rich learning form

Intelligent learning engine

Learning more convenient

From the real demand for more close to the enterprise

每条记录表示为以下键/值对：

(0,Rich learning form)

(19,Intelligent learning engine)

(47,Learning more convenient)

(72,From the real demand for more close to the enterprise)

很明显，键并不是行号。一般情况下，很难取得行号，因为文件按字节而不是按行切分为分片。

2）KeyValueTextInputFormat

每一行均为一条记录，被分隔符分割为key，value。可以通过在驱动类中设置conf.set(KeyValueLineRecordReader.KEY_VALUE_SEPERATOR, " ");来设定分隔符。默认分隔符是tab（\t）。

以下是一个示例，输入是一个包含4条记录的分片。其中——>表示一个（水平方向的）制表符。

line1 ——>Rich learning form

line2 ——>Intelligent learning engine

line3 ——>Learning more convenient

line4 ——>From the real demand for more close to the enterprise

每条记录表示为以下键/值对：

(line1,Rich learning form)

(line2,Intelligent learning engine)

(line3,Learning more convenient)

(line4,From the real demand for more close to the enterprise)

此时的键是每行排在制表符之前的Text序列。

3）NLineInputFormat

如果使用NlineInputFormat，代表每个map进程处理的InputSplit不再按block块去划分，而是按NlineInputFormat指定的行数N来划分。即输入文件的总行数/N=切片数，如果不整除，切片数=商+1。

以下是一个示例，仍然以上面的4行输入为例。

Rich learning form

Intelligent learning engine

Learning more convenient

From the real demand for more close to the enterprise

例如，如果N是2，则每个输入分片包含两行。开启2个maptask。

(0,Rich learning form)

(19,Intelligent learning engine)

另一个 mapper 则收到后两行：

(47,Learning more convenient)

(72,From the real demand for more close to the enterprise)

这里的键和值与TextInputFormat生成的一样。

3.2.5 自定义InputFormat

1）概述

（1）自定义一个类继承FileInputFormat。

（2）改写RecordReader，实现一次读取一个完整文件封装为KV。

（3）在输出时使用SequenceFileOutPutFormat输出合并文件。

2）案例实操

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。