上一篇已经详细分析了该案例的具体目标,本篇主要介绍实现的总体框架及其中的实时预处理部分。其中实时处理环境的搭建可参见这里
总体架构
实时预处理
1. 准备工作
从Stanford的Amazon开源数据上下载Music类商品的评价数据文件Musical_Instruments_5.json, 其中每行数据示例如下:
{
"reviewerID": "A2IBPI20UZIR0U",
"asin": "1384719342",
"reviewerName": "Jake",
"helpful": [0, 0],
"reviewText": "Not much to write about here, but it does exactly what it's supposed to. filters out the pop sounds. now my recordings are much more crisp. it is one of the lowest prices pop filters on amazon so might as well buy it, they honestly work the same despite their pricing,",
"overall": 5.0,
"summary": "good",
"unixReviewTime": 1393545600,
"reviewTime": "02 28, 2014"
}
其中我们关注的字段及含义如下:
字段 | 含义 |
---|---|
reviewerID | 评价人/用户ID |
asin | 商品ID |
reviewText | 用户对商品的评价文本 |
overall | 用户对商品的评分 |
2. 整体流程
Musical_Instruments_5.json是汇总的数据集,为了模拟WEB上的实时评论,可以自己写个小应用每三秒发送一条评论(文本中的一行数据)。提交给Storm的ReviewTopology处理。考虑到后面离线处理过程中需要将评论的单词序列转化为词频向量,故而需要所有评论的非重复单词个数(向量维度)以及每个单词编号(该单词词频所在列),这里利用已搭建的ZK集群来实现。
3. ReviewTopology实现
(1) ReviewTopology的定义
public class ReviewTopology {
private static final String zks = "zk01:2181,zk02:2181,zk03:2181";
private static final String topic = "review-topic";
private static final String zkRoot = "/topics";
private static final String id = "musicReview";
public static void main(String[] args) {
SpoutConfig spoutConf =