用java删除重复单词,当单词数超过2亿时，如何使用Java删除重复的单词？

最新推荐文章于 2021-10-26 10:50:53 发布

一碗面条v

最新推荐文章于 2021-10-26 10:50:53 发布

阅读量161

点赞数

文章标签：用java删除重复单词

I have a file (size = ~1.9 GB) which contains ~220,000,000 (~220 million) words / strings. They have duplication, almost 1 duplicate word every 100 words.

In my second program, I want to read the file. I am successful to read the file by lines using BufferedReader.

Now to remove duplicates, we can use Set (and it's implementations), but Set has problems, as described following in 3 different scenarios:

With default JVM size, Set can contain up to 0.7-0.8 million words, and then OutOfMemoryError.

With 512M JVM size, Set can contain up to 5-6 million words, and then OOM error.

With 1024M JVM size, Set can contain up to 12-13 million words, and then OOM error. Here after 10 million records addition into Set, operations become extremely slow. For example, addition of next ~4000 records, it took 60 seconds.

I have restrictions that I can't increase the JVM size further, and I want to remove duplicate words from the file.

Please let me know if you have any idea about any other ways/approaches to remove duplicate words using Java from such a gigantic file. Many Thanks :)

Addition of info to question: My words are basically alpha-numeric and they are IDs which are unique in our system. Hence they are not plain English words.

解决方案

Use merge sort and remove the duplicates in a second pass. You could even remove the duplicates while merging (just keep the latest word added to output in RAM and compare the candidates to it as well).

一碗面条v

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
用java删除重复单词,当单词数超过2亿时，如何使用Java删除重复的单词？

I have a file (size = ~1.9 GB) which contains ~220,000,000 (~220 million) words / strings. They have duplication, almost 1 duplicate word every 100 words.In my second program, I want to read the file....
复制链接

扫一扫