因为项目上要做推荐,推荐对时间的要求比较高,如果用Spark的话存在一定的延迟(Spark的时间窗口为单位为1s),所以决定用storm做数据的实时处理,大概的架构思路为
若要做数据清洗的话可以用Spark SQL(离线数据),Kafka做为消息队列(数据来源),storm做数据的处理,然后推荐算法的模型为协同过滤(ItemBase),某个用户的推荐结果可以放到redis(key_values,用用ID做为Key,Valeus可以为Json)的形式,后端可以从redis去读取数据.
* storm-kafka 是用的kafka的low level APi把Offset写入到ZK里面,我的topic有5个partion在ZK中分别从0到4,查看某个partition的数据
get /zkkafkaspout/kafkaspout/partition_3
数据如下:{"topology":{"id":"f4df2f64-5207-4713-800f-9a87633aa37e","name":"Topo"},"offset":101235,"partition":3,"broker":{"host":"localhost","port":9092},"topic":"testPartion"}
1.Storm 和Kafka的整合 BrokerHosts brokerHosts = new ZkHosts("localhost:2181"); //SpoutConfig继承KafkaConfig接口并有序列化,默认是60秒向ZK写入offset(可查看Zkhosts源码) SpoutConfig spoutConfig = new SpoutConfig(brokerHosts, "testPartion", "/zkkafkaspout", "kafkaspout"); spoutConfig.zkPort = 2181; List<String> zkServers = new ArrayList<String>(); zkServers.add("localhost"); spoutConfig.zkServers = zkServers; Config config = new Config(); Map<String, String> map = new HashMap<String, String>(); map.put("metadata.broker.list", "localhost:9092"); map.put("serializer.class", "kafka.serializer.StringEncoder"); config.put("kafka.broker.properties", map); //SchemeAsMultiScheme实现了 MultiScheme 接口,构造方法参数为Scheme 来自Storm-core的方法,另外 MessageScheme 类其实是实现了 spoutConfig.scheme = new SchemeAsMultiScheme(new MessageScheme()); TopologyBuilder builder = new TopologyBuilder(); MessageScheme
//实现了Scheme的接口,并没有做任何操作只是向Kafka把数据读取过来,传到下一个blot public class MessageScheme implements Scheme { @Override public List<Object> deserialize(byte[] bytes) { try { String msg = new String(bytes, "utf-8"); return new Values(msg); } catch (UnsupportedEncodingException e) { e.printStackTrace(); } return null; } @Override public Fields getOutputFields() { return new Fields("msg"); } } SenqueceBolt public class extends BaseBasicBolt { @Override public void execute(Tuple tuple, BasicOutputCollector basicOutputCollector) { String word = (String) tuple.getStringByField("msg"); String out = "Message From Kafka:" + word + "!" + "\r\n"; //此段把从kafka到数据存入到文件里面,也可以写入到入Redis,Hbase,ElasticSearch,在此思路为kafka可以穿Json数据,此blot可以做解析Json等 try { FileOutputStream fileOutputStream = new FileOutputStream(new File("/Users/stormKafka.txt"), true); fileOutputStream.write(out.getBytes()); fileOutputStream.close(); } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } System.out.println("out=" + out); Values values = new Values(); basicOutputCollector.emit(new Values(out)); } @Override public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) { outputFieldsDeclarer.declare(new Fields("message")); } } public class extends BaseBasicBolt { @Override
public void execute(Tuple tuple, BasicOutputCollector basicOutputCollector) {
String word = (String) tuple.getStringByField("msg");
String out = "Message From Kafka:" + word + "!" + "\r\n"; //此段把从kafka到数据存入到文件里面,也可以写入到入Redis,Hbase,ElasticSearch,在此思路为kafka可以穿Json数据,此blot可以做解析Json等 try { FileOutputStream fileOutputStream = new FileOutputStream(new File("/Users/stormKafka.txt"), true); fileOutputStream.write(out.getBytes()); fileOutputStream.close(); } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } System.out.println("out=" + out); Values values = new Values(); basicOutputCollector.emit(new Values(out)); } @Override public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) { outputFieldsDeclarer.declare(new Fields("message")); } }