高效分析大文本的方案

润乾软件

于 2022-04-29 09:35:22 发布

阅读量183

点赞数 1

分类专栏： JAVA计算文章标签：文本大数据正则匹配

本文链接：https://blog.csdn.net/raqsoft/article/details/124488817

版权

JAVA计算专栏收录该内容

363 篇文章 2 订阅

订阅专栏

【问题】

What’s the most efficient way to parse a large text file?

I’m trying to optimize some code. I have to open a large text file, match each line against a regular expression, and then process the results.

I’ve tried the simple approaches:

  
 for line in my_file:  
 match = my_regx.match(line)  
 process( match.groups() )  
and  
 data = my_file.read().splitlines()  
 for line in data:  
 # etc.

Neither is terribly speedy. Does anyone have a better method?

【回答】

多线程并行计算可以提高匹配的速度，但 python 的多线程代码比较复杂，而且涉及按字节数对文件分段的问题，处理起来并不轻松。使用 SPL 就简单多了，可将文件 file1.txt 分为多段，每线程计算一段，并用正则表达式匹配出符合条件的行数据，最后合并结果并输出到文件。具体代码如下：

	A
1	=file("D:\\file1.txt")
2	=A1.cursor@m(;4).(~.array().concat())
3	=A2.regex(".smile.")
4	=file("D:\\result.txt").export(A3)

正则表达式功能强大，但性能不高，如果匹配规则不复杂时，还可以采用性能更高的 like 函数，比如:

	A
1	=file("D:\\file1.txt")
2	=A1.cursor@m(;4)
3	=A2.select(like(#1,"smile"))
4	=file("D:\\result.txt").export(A5)

集算器还支持丰富的计算函数，比如分组汇总、排名排序、关联计算、多文件查询、归并查找，易于实现各种复杂的算法逻辑。

润乾软件

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
高效分析大文本的方案

【问题】What’s the most efficient way to parse a large text file?I’m trying to optimize some code. I have to open a large text file, match each line against a regular expression, and then process the results.I’ve tried the simple approaches: for li
复制链接

扫一扫