MapReduce高级编程之自定义InputFormat

最新推荐文章于 2024-09-20 17:03:43 发布

天边tbdp

最新推荐文章于 2024-09-20 17:03:43 发布

阅读量602

点赞数

分类专栏： mapreduce InputFormat 文章标签： mapreduce InputFormat

mapreduce 同时被 2 个专栏收录

7 篇文章 0 订阅

订阅专栏

InputFormat

1 篇文章 0 订阅

订阅专栏

InputFormat是MapReduce中一个很常用的概念，它在程序的运行中到底起到了什么作用呢？

InputFormat其实是一个接口，包含了两个方法：

public interface InputFormat<K, V> {
InputSplit[] getSplits(JobConf job, int numSplits) throws IOException;
RecordReader<K, V> getRecordReader(InputSplit split,
JobConf job,
Reporter reporter) throws IOException;
}

这两个方法有分别完成着以下工作：

方法getSplits将输入数据切分成splits， splits的个数即为map tasks的个数，splits的大小默认为块大小，即64M

方法getSplits将每个split解析成records, 再依次将record解析成<K,V>对

也就是说InputFormat完成以下工作：

InputFile --> splits--> <K,V>

系统常用的 InputFormat 又有哪些呢？

其中TextInputFormat便是最常用的，它的<K,V>就代表<行偏移,该行内容>

然而系统所提供的这几种固定的将 InputFile转换为<K,V>的方式有时候并不能满足我们的需求：

此时需要我们自定义InputFormat ，从而使 Hadoop框架按照我们预设的方式来将

InputFile解析为<K,V>

在领会自定义InputFormat 之前，需要弄懂一下几个抽象类、接口及其之间的关系：

InputFormat(interface), FileInputFormat(abstract class), TextInputFormat(class),

RecordReader(interface), LineRecordReader(class)的关系

FileInputFormat implements InputFormat

TextInputFormat extends FileInputFormat

TextInputFormat.getRecordReader calls LineRecordReader

LineRecordReader implements RecordReader

对于InputFormat接口，上面已经有详细的描述

再看看FileInputFormat，它实现了InputFormat接口中的getSplits方法，而将getRecordReader与isSplitable留给具体类(如TextInputFormat)实现，isSplitable方法通常不用修改，所以只需要在自定义的InputFormat中实现

getRecordReader方法即可，而该方法的核心是调用LineRecordReader(即由LineRecorderReader类来实现 "将每个split解析成records, 再依次将record解析成<K,V>对")，该方法实现了接口RecordReader

public interface RecordReader<K, V> {

booleannext(K key, V value) throws IOException;
KcreateKey();
VcreateValue();
longgetPos() throws IOException;
public voidclose() throws IOException;
floatgetProgress() throws IOException;
}

因此自定义InputFormat的核心是自定义一个实现接口RecordReader类似于LineRecordReader的类，该类的核心也正是重写接口RecordReader中的几大方法，

定义一个InputFormat的核心是定义一个类似于LineRecordReader的，自己的RecordReader

示例，数据每一行为 “物体，x坐标，y坐标，z坐标”

ball 3.5,12.7,9.0

car 15,23.76,42.23

device 0.0,12.4,-67.1

每一行将要被解析为<Text, Point3D>（Point3D是我们在上一篇日志中自定义的数据类型）

方式一，自定义的RecordReader使用中LineRecordReader，

public class ObjectPositionInputFormat extends

FileInputFormat<Text, Point3D> {

public RecordReader<Text, Point3D> getRecordReader (

InputSplit input, JobConf job, Reporter reporter)

throws IOException {

reporter.setStatus(input.toString());

return new ObjPosRecordReader(job, (FileSplit)input);

}

class ObjPosRecordReader implements RecordReader<Text, Point3D> {

private LineRecordReader lineReader;

private LongWritable lineKey;

private Text lineValue;

public ObjPosRecordReader (JobConf job, FileSplit split) throws IOException {

lineReader = new LineRecordReader(job, split);

lineKey = lineReader.createKey();

lineValue = lineReader.createValue();

}

public boolean next (Text key, Point3D value) throws IOException {

// get the next line

if (!lineReader.next(lineKey, lineValue)) {

return false ;

}

// parse the lineValue which is in the format:

// objName, x, y, z

String [] pieces = lineValue.toString().split( "," );

if (pieces.length != 4) {

throw new IOException( "Invalid record received");

}

// try to parse floating point components of value

float fx, fy, fz;

try {

fx = Float.parseFloat(pieces[1].trim());

fy = Float.parseFloat(pieces[2].trim());

fz = Float.parseFloat(pieces[3].trim());

} catch (NumberFormatException nfe) {

throw new IOException( "Error parsing floating point value in record" );

}

// now that we know we'll succeed, overwrite the output objects

key.set(pieces[0].trim()); // objName is the output key.

value.x = fx;

value.y = fy;

value.z = fz;

return true ;

}

public Text createKey () {

return new Text( "" );

}

public Point3D createValue () {

return new Point3D();

}

public long getPos () throws IOException {

return lineReader.getPos();

}

public void close () throws IOException {

lineReader.close();

}

public float getProgress () throws IOException {

return lineReader.getProgress();

}

方式二：自定义的RecordReader中使用LineReader，

public class ObjectPositionInputFormat extends FileInputFormat<Text, Point3D> {

@ Override

protected boolean isSplitable (JobContext context, Path filename) {

// TODO Auto-generated method stub

return false ;

}

@ Override

public RecordReader<Text, Point3D> createRecordReader (InputSplit inputsplit,

TaskAttemptContext context) throws IOException, InterruptedException {

// TODO Auto-generated method stub

return new objPosRecordReader();

}

public static class objPosRecordReader extends RecordReader<Text,Point3D>{

public LineReader in;

public Text lineKey;

public Point3D lineValue;

public StringTokenizer token= null ;

public Text line;

@ Override

public void close () throws IOException {

// TODO Auto-generated method stub

}

@ Override

public Text getCurrentKey () throws IOException, InterruptedException {

// TODO Auto-generated method stub

System.out.println( "key" );

//lineKey.set(token.nextToken());

System.out.println( "hello" );

return lineKey;

}

@ Override

public Point3D getCurrentValue () throws IOException,

InterruptedException {

// TODO Auto-generated method stub

return lineValue;

}

@ Override

public float getProgress () throws IOException, InterruptedException {

// TODO Auto-generated method stub

return 0;

}

@ Override

public void initialize (InputSplit input, TaskAttemptContext context)

throws IOException, InterruptedException {

// TODO Auto-generated method stub

FileSplit split=(FileSplit)input;

Configuration job=context.getConfiguration();

Path file=split.getPath();

FileSystem fs=file.getFileSystem(job);

FSDataInputStream filein=fs.open(file);

in= new LineReader(filein,job);

line= new Text();

lineKey= new Text();

lineValue= new Point3D();

}

@ Override

public boolean nextKeyValue () throws IOException, InterruptedException {

// TODO Auto-generated method stub

int linesize=in.readLine(line);

if (linesize==0)

return false ;

token= new StringTokenizer(line.toString());

String []temp= new String[2];

if (token.hasMoreElements()){

temp[0]=token.nextToken();

if (token.hasMoreElements()){

temp[1]=token.nextToken();

}

System.out.println(temp[0]);

System.out.println(temp[1]);

String []points=temp[1].split( "," );

System.out.println(points[0]);

System.out.println(points[1]);

System.out.println(points[2]);

lineKey.set(temp[0]);

lineValue.set(Float.parseFloat(points[0]),Float.parseFloat(points[1]), Float.parseFloat(points[2]));

System.out.println( "pp" );

return true ;

}

从以上可以看出，自定义一个InputFormat的核心是定义一个类似于LineRecordReader的，自己的RecordReader，而在其中可能会到LineReader/LineRecordReader/KeyValueLineRecordReader类

因此，要自定义InputFormat，这三个类的源码就必须很熟悉~

天边tbdp

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录