使用DataStreamAPI实时统计用户搜索词频
这是我毕业设计的一个模块,后面会提供源码
1.模块介绍
本项目做的是一个题目搜索相关的应用,这个模块做的就是对用户搜索的文本进行分词然后统计词频。
先看这个模块的数据链路图吧
用户在搜题的过程中服务端会将用户的搜索数据发送到kafka。Flink消费kafka的数据然后进行中文分词,分词后统计词的频率将结构sink到redis中。
2.具体实现
这里使用的Flink的DataStreamAPI来实现。先看相关依赖
<dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_2.11</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
</dependency>
<!--kafka-->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.9</version>
</dependency>
<!--ik analyser-->
<dependency>
<groupId>com.janeluo</groupId>
<artifactId>ikanalyzer</artifactId>
<version>2012_u6</version>
</dependency>
<!--redis sink-->
<dependency>
<groupId>com.github.yang69</groupId>
<artifactId>flink-connector-redis_2.11</artifactId>
<version>1.0</version>
</dependency>
...
</dependencies>
这里使用的中文分词器是ik analyzer
kafka上的数据样本
{"data": "8-1受集度为q的均布载荷作用的矩形杠杆截面简支梁","ts": "2020-06-06T15:01:39.780Z"}
以下是java代码实现,具体使用都写到注释里面了。
public class SearchWordCount {
private static final String KAFKA_TOPIC_WORD_COUNT = "search_data";
private static final String REDIS_HOST = "redis";
private static final Integer REDIS_PORT = 6379;
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//配置kafka相关
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "kafka1:9094");
properties.setProperty("group.id", "group-flink-word_count");
FlinkKafkaConsumer<String> kafkaConsumer = new FlinkKafkaConsumer<>(
KAFKA_TOPIC_WORD_COUNT,
new SimpleStringSchema(),//注意这里使用的是SimpleStringSchema()来反序列化成字符串格式。
properties);
//从最开始的offset开始消费
kafkaConsumer.setStartFromEarliest();
//add source
DataStream<String> input = env.addSource(kafkaConsumer);
//词频统计
DataStream<Tuple2<String, Integer>> counts = input
.flatMap(new WordAnalysis()) //这里看下面的具体实现方法(分词)
.keyBy(0) // 按照filed 0 相当于group by
.sum(1); // 按照 filed 1 sum
//sink到redis
FlinkJedisPoolConfig config = new FlinkJedisPoolConfig.Builder().setHost(REDIS_HOST).setPort(REDIS_PORT).build();
RedisSink<Tuple2<String,Integer>> redisSink = new RedisSink<>(config,new WordCountRedisMapper());
counts.addSink(redisSink);
env.execute("SearchLite AnalysisModel FlinkJob");
}
public static final class WordAnalysis implements FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String source, Collector<Tuple2<String, Integer>> out) {
//构造成JSON格式然后得到搜索文本
String searchSentence = JSONObject.parseObject(source).get("data").toString();
analysis(searchSentence).forEach(word -> out.collect(new Tuple2<>(word,1)));
}
}
//配置redis
public static final class WordCountRedisMapper implements RedisMapper<Tuple2<String,Integer>>{
@Override
public RedisCommandDescription getCommandDescription() {
//选择Hash结构来存储 HSET(RedisDataType.HASH);
return new RedisCommandDescription(RedisCommand.HSET,"word_count");
}
@Override
public String getKeyFromData(Tuple2<String, Integer> data) {
return data.f0;
}
@Override
public int getSecondsFromData(Tuple2<String, Integer> data) {
return 0;
}
@Override
public String getValueFromData(Tuple2<String, Integer> data) {
return String.valueOf(data.f1);
}
}
//分词
public static List<String> analysis(String source){
List<String> list =new ArrayList<>();
Analyzer analyzer = new IKAnalyzer(true);
TokenStream ts = null;
try {
ts = analyzer.tokenStream(",",source);
CharTermAttribute term = ts.addAttribute(CharTermAttribute.class);
ts.reset();
while (ts.incrementToken()) {
if(term.toString().length()>1)list.add(term.toString());
}
ts.end();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (ts != null) {
try {
ts.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
return list;
}
}
3.打包部署
打包部署上一篇文章已经讲了,参考上篇文章吧
使用FlinkSQL开发应用
4.结果
看一下Redis中的数据
最后看一下小程序那边的图吧
备注:毕业有一段时间了,现在才来开始整理,可能有不对或者不详尽的地方还请谅解。