java文本文档统计字数,用java 8计算字数

I am trying to implement a word count program in java 8 but I am unable to make it work. The method must take a string as parameter and returns a Map.

When I am doing it in old java way, everthing works fine. But when I am trying to do it in java 8, it returns a map where the keys are the empty with the correct occurrences.

Here is my code in a java 8 style :

public Map countJava8(String input){

return Pattern.compile("(\\w+)").splitAsStream(input).collect(Collectors.groupingBy(e -> e.toLowerCase(), Collectors.reducing(0, e -> 1, Integer::sum)));

}

Here is the code I would use in a normal situation :

public Map count(String input){

Map wordcount = new HashMap<>();

Pattern compile = Pattern.compile("(\\w+)");

Matcher matcher = compile.matcher(input);

while(matcher.find()){

String word = matcher.group().toLowerCase();

if(wordcount.containsKey(word)){

Integer count = wordcount.get(word);

wordcount.put(word, ++count);

} else {

wordcount.put(word.toLowerCase(), 1);

}

}

return wordcount;

}

The main program :

public static void main(String[] args) {

WordCount wordCount = new WordCount();

Map phrase = wordCount.countJava8("one fish two fish red fish blue fish");

Map count = wordCount.count("one fish two fish red fish blue fish");

System.out.println(phrase);

System.out.println();

System.out.println(count);

}

When I run this program, the outputs that I have :

{ =7, =1}

{red=1, blue=1, one=1, fish=4, two=1}

I thought that the method splitAsStream would stream the matching elements in the regex as Stream. How can I correct that?

解决方案

The problem seems to be that you are in fact splitting by words, i.e. you are streaming over everything that is not a word, or that is in between words. Unfortunately, there seems to be no equivalent method for streaming the actual match results (hard to believe, but I did not find any; feel free to comment if you know one).

Instead, you could just split by non-words, using \W instead of \w. Also, as noted in comments, you can make it a bit more readable by using String::toLowerCase instead of a lambda and Collectors.summingInt.

public static Map countJava8(String input) {

return Pattern.compile("\\W+")

.splitAsStream(input)

.collect(Collectors.groupingBy(String::toLowerCase,

Collectors.summingInt(s -> 1)));

}

But IMHO this is still very hard to comprehend, not only because of the "inverse" lookup, and it's also difficult to generalize to other, more complex patterns. Personally, I would just go with the "old school" solution, maybe making it a bit more compact using the new getOrDefault.

public static Map countOldschool(String input) {

Map wordcount = new HashMap<>();

Matcher matcher = Pattern.compile("\\w+").matcher(input);

while (matcher.find()) {

String word = matcher.group().toLowerCase();

wordcount.put(word, wordcount.getOrDefault(word, 0) + 1);

}

return wordcount;

}

The result seems to be the same in both cases.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值