hadoop中java部分练习,如何使用Java从Hadoop中读取具有偏移量的文件

Problem: I want to read a section of a file from HDFS and return it, such as lines 101-120 from a file of 1000 lines.

I don't want to use seek because I have read that it is expensive.

I have log files which I am using PIG to process down into meaningful sets of data. I've been writing an API to return the data for consumption and display by a front end. Those processed data sets can be large enough that I don't want to read the entire file out of Hadoop in one slurp to save wire time and bandwidth. (Let's say 5 - 10MB)

Currently I am using a BufferedReader to return small summary files which is working fine

ArrayList lines = new ArrayList();

...

for (FileStatus item: items) {

// ignoring files like _SUCCESS

if(item.getPath().getName().startsWith("_")) {

continue;

}

in = fs.open(item.getPath());

BufferedReader br = new BufferedReader(new InputStreamReader(in));

String line;

line = br.readLine();

while (line != null) {

line = line.replaceAll("(\\r|\\n)", "");

lines.add(line.split("\t"));

line = br.readLine();

}

}

I've poked around the interwebs quite a bit as well as Stack but haven't found exactly what I need.

Perhaps this is completely the wrong way to go about doing it and I need a completely separate set of code and different functions to manage this. Open to any suggestions.

Thanks!

解决方案

I think SEEK is a best option for reading files with huge volumes. It did not cause any problems to me as the volume of data that i was reading was in the range of 2 - 3GB. I did not encounter any issues till today but we did use file splitting to handle the large data set. below is the code which you can use for reading purpose and test the same.

public class HDFSClientTesting {

/**

* @param args

*/

public static void main(String[] args) {

// TODO Auto-generated method stub

try{

//System.loadLibrary("libhadoop.so");

Configuration conf = new Configuration();

FileSystem fs = FileSystem.get(conf);

conf.addResource(new Path("core-site.xml"));

String Filename = "/dir/00000027";

long ByteOffset = 3185041;

SequenceFile.Reader rdr = new SequenceFile.Reader(fs, new Path(Filename), conf);

Text key = new Text();

Text value = new Text();

rdr.seek(ByteOffset);

rdr.next(key,value);

//Plain text

JSONObject jso = new JSONObject(value.toString());

String content = jso.getString("body");

System.out.println("\n\n\n" + content + "\n\n\n");

File file =new File("test.gz");

file.createNewFile();

}

catch (Exception e ){

throw new RuntimeException(e);

}

finally{

}

}

}

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值