例如文本文件如下:
hello spark
hello spark
hello flink
hello spark
使得每个单词输出一次;
代码实现如下:
//读取文本数据,将每一行数据拆分成单个单词,对单词进行去重输出
public class FlinkTest02 {
public static void main(String[] args) throws Exception {
//获取flink环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//设置并发度为1
env.setParallelism(1);
//读取文件
DataStreamSource<String> lineDS = env.readTextFile("input");
//flatMap压平操作
SingleOutputStreamOperator<String> wordDS = lineDS.flatMap(new FlatMapFunction<String, String>() {
@Override
public void flatMap(String s, Collector<String> collector) throws Exception {
String[] s1 = s.split(" ");
for (String s2 : s1) {
collector.collect(s2);
}
}
});
//采用redis去重过滤
SingleOutputStreamOperator<String> redisDS = wordDS.filter(new RichFilterFunction<String>() {
Jedis jedis = null;
//设置redis的key
String key = "input";
@Override
public void open(Configuration parameters) throws Exception {
//获取redis连接
jedis = new Jedis("hadoop102", 6379);
}
@Override
public void close() throws Exception {
//关闭连接
jedis.close();
}
@Override
public boolean filter(String s) throws Exception {
//这里采用redis中的set数据类型
//判断redis中是否存在重复的值
Boolean bool = jedis.sismember(key, s);
if (!bool) {
jedis.sadd(key, s);
}
return !bool;
}
});
//打印
redisDS.print();
//执行任务
env.execute();
}
}
输出结果为:
hello
spark
flink