用java删除重复单词,删除大文本文件中的重复单词 - Java

I have text file with a size of over 50gb.

Now i want to delete the duplicate words.

But I have heard, that i need very much RAM to load every Word from the text file into an Hash Set.

Can you tell me a very good way to delete every duplicate word from the text file?

The Words are sorted by a white space, like this.

word1 word2 word3 ... ...

解决方案

The H2 answer is good, but maybe overkill. All the words in the english language won't be more than a few Mb. Just use a set. You could use this in RAnders00 program.

public static void read50Gigs(String fileLocation, String newFileLocation) {

Set words = new HashSet<>();

try(FileInputStream fileInputStream = new FileInputStream(fileLocation);

Scanner scanner = new Scanner(fileInputStream);) {

while (scanner.hasNext()) {

String nextWord = scanner.next();

words.add(nextWord);

}

System.out.println("words size "+words.size());

Files.write(Paths.get(newFileLocation), words,

StandardOpenOption.CREATE, StandardOpenOption.WRITE);

} catch (IOException e) {

throw new RuntimeException(e);

}

}

As an estimate of common words, I added this for war and peace (from gutenberg)

public static void read50Gigs(String fileLocation, String newFileLocation) {

try {

Set words = Files.lines(Paths.get("war and peace.txt"))

.map(s -> s.replaceAll("[^a-zA-Z\\s]", ""))

.flatMap(Pattern.compile("\\s")::splitAsStream)

.collect(Collectors.toSet());

System.out.println("words size " + words.size());//22100

Files.write(Paths.get("out.txt"), words,

StandardOpenOption.CREATE,

StandardOpenOption.TRUNCATE_EXISTING,

StandardOpenOption.WRITE);

} catch (IOException e) {}

}

It completed in 0 seconds. You can't use Files.lines unless your huge source file has line breaks. With line breaks, it will process it line by line so it won't use too much memory.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值