使用Java中的Stream流的方式实现单词的频次统计
单词的频次统计是学习大数据中的一个相当经典的案例。像使用MapReduce、Scala、Spark、Hive等技术都可以完成,相应的操作。在Java8中,其新增的特性Stream流,也可以很简单的完成单词频次统计的案例。
下面上一段代码:
@Test
public void test() {
List<String> list = Arrays.asList("hello","hadoop","hive","hadoop","hadoop","hello");
list.stream().collect(Collectors.groupingBy((x)->x)).values().stream().map((x)->{Map map=new HashMap<>();map.put(x.get(0), x.size());return map;}).forEach(System.out::println);
}
这里是输出结果:
{hive=1}
{hadoop=3}
{hello=2}
这样,就通过一句话就完成了这个单词频次统计的小案例。看这种代码风格是不是很像Scala语言。
下面,我来把上面的代码做一下拆分:
@Test
public void test() {
List<String> list = Arrays.asList("hello","hadoop","hive","hadoop","hadoop","hello");
Map<String, List<String>> collect = list.stream().collect(Collectors.groupingBy((x)->x));
Stream<Map<String,Integer>> stream = collect.values().stream().map((x)->{Map<String, Integer> map=new HashMap<>();map.put(x.get(0), x.size());return map;});
stream.forEach(System.out::println);
}
更加精简的写法:
List<String> words = Arrays.asList("hello", "hadoop", "hive", "hadoop", "hadoop", "hello");
System.out.println(words.parallelStream().collect(Collectors.groupingBy(it -> it, Collectors.counting())));
或:
List<String> words = Arrays.asList("hello", "hadoop", "hive", "hadoop", "hadoop", "hello");
System.out.println(words.parallelStream().map(it -> new Tuple<>(it, 1)).collect(Collectors.groupingBy(Tuple::getIt1,
Collectors.summingInt(Tuple::getIt2))));
@Data
@AllArgsConstructor
class Tuple<T, E> {
private T it1;
private E it2;
}
Python中的写法:
words = ["hello", "hadoop", "hive", "hadoop", "hadoop", "hello"]
print({word: words.count(word) for word in words})