大文本用组游标处理每一组计算

润乾软件

于 2022-10-09 09:48:26 发布

阅读量76

点赞数

分类专栏： JAVA计算文章标签： java text

本文链接：https://blog.csdn.net/raqsoft/article/details/127221014

版权

JAVA计算专栏收录该内容

363 篇文章 2 订阅

订阅专栏

面对一个100GB的大文本文件，每行包含一对逗号分隔的值，根据第一值进行分组。由于文件过大无法一次性加载内存，需要按组处理。目前采用的方法是使用`BufferedReader`逐行扫描，遇到新组时暂存当前行，处理前组内容，然后清空缓冲并继续读取。寻求更优解决方案。答复中推荐使用SPL语言以游标方式按组读取并处理文件。

摘要由CSDN通过智能技术生成

【问题】

I have a large (~100GB) text file structured like this:

A,foobar
A,barfoo
A,foobar
B,barfoo
B,barfoo
C,foobar

Each line is a comma-separated pair of values. The file is sorted by the first value in the pair. The lines are variable length. Define a group as being all lines with a common first value, i.e. with the example quoted above all lines starting with "A," would be a group, all lines starting with "B," would be another group.

The entire file is too large to fit into memory, but if you took all the lines from any individual group will always fit into memory.

I have a routine for processing a single such group of lines and writing to a text file. My problem is that I don't know how best to read the file a group at a time. All the groups are of arbitrary, unknown size. I have considered two ways: 1) Scan the file using a `BufferedReader`, accumulating the lines from a group in a String or array. Whenever a line is encountered that belongs to a new group, hold that line in a temporary variable, process the previous group. Clear the accumulator, add the temporary and then continue reading the new group starting from the second line. 2)Scan the file using a BufferedReader, whenever a line is encountered that belongs to a new group, somehow reset the cursor so that when readLine() is next invoked it starts from the first line of the group instead of the second. I have looked into mark() and reset() but these require knowing the byte-position of the start of the line.

I'm going to go with (1) at the moment, but I would be very grateful if someone could suggest a method that smells less.

【回答】

这个算法逻辑很清晰，while (thereAreGroupsRemaining) {String s = readNextGroup(); process(s); }。但用JAVA实现这个算法的细节太多，代码不会很简单。建议用SPL来实现：

	A	B
1	=file("e:\\bigfile.txt").cursor@c()
2	for A1 ;_1
3		=A2.select(like(_2,"foo*"))

A1:以游标方式打开文件bigfile.txt。

A2:读文件，每次将第一列中值相同的读入一组，放入内存，_1表示第一列

B3: 查询，找出第二列中以“foo”开头的记录

这段代码和上述算法逻辑一致。A2就是while (thereAreGroupsRemaining)以及readNextGroup()；B3就是process(s)；A2就是s，即当前这组数据的所有记录。

上述代码很容易集成到JAVA，参考Java 如何调用 SPL 脚本

润乾软件

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
大文本用组游标处理每一组计算

这个算法逻辑很清晰，while (thereAreGroupsRemaining) {String s = readNextGroup();process(s);但用JAVA实现这个算法的细节太多，代码不会很简单。A2就是while (thereAreGroupsRemaining)以及readNextGroup()；B3就是process(s)；A2就是s，即当前这组数据的所有记录。A2:读文件，每次将第一列中值相同的读入一组，放入内存，_1表示第一列。B3: 查询，找出第二列中以“foo”开头的记录。
复制链接

扫一扫