Topology
1.定义两个spout, 分别是genderSpout, ageSpout
Fields, ("id", "gender"), ("id", "age"), 最终join的结果应该是("id", "gender", "age")
2. 在设置SingleJoinBolt需要将outFields作为参数, 即告诉bolt, join完的结果应该包含哪些fields
并且对于两个spout都是以Fields("id")进行fieldsGrouping, 保证相同id都会发到同一个task
public class SingleJoinExample { public static void main(String[] args) { FeederSpout genderSpout = new FeederSpout(new Fields("id", "gender")); FeederSpout ageSpout = new FeederSpout(new Fields("id", "age")); TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("gender", genderSpout); builder.setSpout("age", ageSpout); builder.setBolt("join", new SingleJoinBolt(new Fields("gender", "age"))) .fieldsGrouping("gender", new Fields("id")) .fieldsGrouping("age", new Fields("id")); }
SingleJoinBolt
由于不能保证bolt可以同时收到某个id的所有tuple, 所以必须把收到的tuple都先在memory里面cache, 至到收到某id的所有的tuples, 再做join.
做完join后, 这些tuple就可以从cache里面删除, 但是如果某id的某些tuple丢失, 就会导致该id的其他tuples被一直cache.
解决这个问题, 对cache数据设置timeout, 过期后就删除, 并发送这些tuples的fail通知.
可见这个场景, 使用TimeCacheMap正合适,
TimeCacheMap<List<Object>, , Map
List<Object>, 被join的field, 对于上面的例子就是"id”, 之所以是List, 应该是为了支持多fields join
Map<GlobalStreamId, Tuple>,记录tuple和stream的关系
对于这个例子, 从TimeCacheMap的bucket里面取出下面两个k,v, 然后进行join
{id, {agestream, (id, age)}}
{id, {genderstream, (id, gender)}}
1. prepare
一般的prepare的逻辑都很简单, 而这里确很复杂...
a, 设置Timeout和ExpireCallback
timeout 设的是, Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS, 默认是30s, 这个可以根据场景自己调整
应该设法保证不同spout中tuple的发送顺序, 以保证相同id的tuple以较短时间间隔被收到, 比如这个例子应该按id排序然后emit
否则如果出现, ("id", "gender")被第一个emit, 而 ("id", "age")被最后一个emit, 会导致不断的timeout
设置ExpireCallback, 对于所有timeout的tuples, 发送fail通知
private class ExpireCallback implements TimeCacheMap.ExpiredCallback<List<Object>, Map<GlobalStreamId, Tuple>> { @Override public void expire(List<Object> id, Map<GlobalStreamId, Tuple> tuples) { for(Tuple tuple: tuples.values()) { _collector.fail(tuple); } } }
b. 找出_idFields(哪些field是相同的, 可以用作join) 和_fieldLocations (outfield和spout stream的关系, 比如gender属于genderstream)
通过context.getThisSources()取出spout sources列表, 并通过getComponentOutputFields取到fields列表
_idFields, 逻辑很简单, 每次都拿新的fields和idFields做retainAll(取出set共同部分), 最终会得到所有spout fields的相同部分
_fieldLocations, 拿_outFields和spout fields进行匹配, 找到后记录下关系
其实, 我觉得这部分准备工作, 在调用的时候用参数指明就可以了, 犯不着那么麻烦的来做
比如参数变为("id", {"gender", genderstream}, {"age", agestream})
@Override public void prepare(Map conf, TopologyContext context, OutputCollector collector) { _fieldLocations = new HashMap<String, GlobalStreamId>(); _collector = collector; int timeout = ((Number) conf.get(Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS)).intValue(); _pending = new TimeCacheMap<List<Object>, Map<GlobalStreamId, Tuple>>(timeout, new ExpireCallback()); _numSources = context.getThisSources().size(); Set<String> idFields = null; for(GlobalStreamId source: context.getThisSources().keySet()) { Fields fields = context.getComponentOutputFields(source.get_componentId(), source.get_streamId()); Set<String> setFields = new HashSet<String>(fields.toList()); if(idFields==null) idFields = setFields; else idFields.retainAll(setFields); for(String outfield: _outFields) { for(String sourcefield: fields) { if(outfield.equals(sourcefield)) { _fieldLocations.put(outfield, source); } } } } _idFields = new Fields(new ArrayList<String>(idFields)); if(_fieldLocations.size()!=_outFields.size()) { throw new RuntimeException("Cannot find all outfields among sources"); } }
2, execute
a, 从tuple中取出_idFields和streamid
如果在_pending(TimeCacheMap)中没有此_idFields, 为这个_idFields创新新的hashmap并put到bucket
b, 取出该_idFields所对应的所有Map<GlobalStreamId, Tuple> parts, 并检测当前收到的是否是无效tuple(从同一个stream emit的具有相同id的tuple)
将新的tuple, put到该_idFields所对应的map. parts.put(streamId, tuple);
c, 判断如果parts的size等于spout sources的数目, 对于这个例子为2, 意思是当从genderstream和agestream过来的tuple都已经收到时
从_pending(TimeCacheMap)删除该_idFields的cache数据, 因为已经可以join, 不需要继续等待了
并根据_outFields以及_fieldLocations, 去各个stream的tuple中取出值
最终emit结果, (((id, age), (id, gender)), (age, gender))
ArrayList<Tuple>(parts.values()), joinResult
Ack所有的tuple
@Override public void execute(Tuple tuple) { List<Object> id = tuple.select(_idFields); GlobalStreamId streamId = new GlobalStreamId(tuple.getSourceComponent(), tuple.getSourceStreamId()); if(!_pending.containsKey(id)) { _pending.put(id, new HashMap<GlobalStreamId, Tuple>()); } Map<GlobalStreamId, Tuple> parts = _pending.get(id); if(parts.containsKey(streamId)) throw new RuntimeException("Received same side of single join twice"); parts.put(streamId, tuple); if(parts.size()==_numSources) { _pending.remove(id); List<Object> joinResult = new ArrayList<Object>(); for(String outField: _outFields) { GlobalStreamId loc = _fieldLocations.get(outField); joinResult.add(parts.get(loc).getValueByField(outField)); } _collector.emit(new ArrayList<Tuple>(parts.values()), joinResult); for(Tuple part: parts.values()) { _collector.ack(part); } } }
TimeCacheMap
解决什么问题?
常常需要在memory里面cache key-value, 比如实现快速查找表
但是memeory是有限的, 所以希望只保留最新的cache的, 过期的key-value可以被删除. 所以TimeCacheMap就是用来解决这个问题的, 在一定time内cache map(kv set)
1. 构造参数
TimeCacheMap(int expirationSecs, int numBuckets, ExpiredCallback<K, V> callback)
首先需要expirationSecs, 表示多久过期
然后, numBuckets, 表示时间粒度, 比如expirationSecs = 60s, 而numBuckets=10, 那么一个bucket就代表6s的时间窗, 并且6s会发生一次过期数据删除
最后, ExpiredCallback<K, V> callback, 当发生超时的时候, 需要对超时的K,V做些操作的话, 可以定义这个callback, 比如发送fail通知
2. 数据成员
核心结构, 使用linkedlist来实现bucket list, 用HashMap<K, V>来实现每个bucket
private LinkedList<HashMap<K, V>> _buckets;
辅助成员, lock对象和定期的cleaner thread
private final Object _lock = new Object(); private Thread _cleaner;
3. 构造函数
其实核心就是启动_cleaner Daemon线程
_cleaner的逻辑其实很简单,
定期的把最后一个bucket删除, 在bucket list开头加上新的bucket, 并且如果有定义callback, 对所有timeout的kv调用callback
同时这里考虑线程安全, 会对操作过程加锁synchronized(_lock)
唯一需要讨论的是, sleepTime
即如果保证数据在定义的expirationSecs时间后, 被删除
定义, sleepTime = expirationMillis / (numBuckets-1)
a, 如果cleaner刚刚完成删除last, 添加first bucket, 这时put的K,V的过期时间为,
expirationSecs / (numBuckets-1) * numBuckets = expirationSecs * (1 + 1 / (numBuckets-1))
需要等待完整的numBuckets个sleepTime, 所以时间会略大于expirationSecsb, 如果反之, 刚完成put k,v操作后, cleaner开始clean操作, 那么k,v的过期时间为,
expirationSecs / (numBuckets-1) * numBuckets - expirationSecs / (numBuckets-1) = expirationSecs
这种case会比a少等一个sleepTime, 时间恰恰是expirationSecs所以这个方法保证, 数据会在[b,a]的时间区间内被删除
public TimeCacheMap(int expirationSecs, int numBuckets, ExpiredCallback<K, V> callback) { if(numBuckets<2) { throw new IllegalArgumentException("numBuckets must be >= 2"); } _buckets = new LinkedList<HashMap<K, V>>(); for(int i=0; i<numBuckets; i++) { _buckets.add(new HashMap<K, V>()); } _callback = callback; final long expirationMillis = expirationSecs * 1000L; final long sleepTime = expirationMillis / (numBuckets-1); _cleaner = new Thread(new Runnable() { public void run() { try { while(true) { Map<K, V> dead = null; Time.sleep(sleepTime); synchronized(_lock) { dead = _buckets.removeLast(); _buckets.addFirst(new HashMap<K, V>()); } if(_callback!=null) { for(Entry<K, V> entry: dead.entrySet()) { _callback.expire(entry.getKey(), entry.getValue()); } } } } catch (InterruptedException ex) { } } }); _cleaner.setDaemon(true); _cleaner.start(); }
4. 其他操作
首先, 所有操作都会使用synchronized(_lock)保证线程互斥
其次, 所有操作的复杂度都是O(numBuckets), 因为每个item都是hashmap, 都是O(1)操作
最重要的是Put, 只会将新的k,v, put到第一个(即最新的)bucket, 并且将之前旧bucket里面的相同key的cache数据删除
public void put(K key, V value) { synchronized(_lock) { Iterator<HashMap<K, V>> it = _buckets.iterator(); HashMap<K, V> bucket = it.next(); bucket.put(key, value); while(it.hasNext()) { bucket = it.next(); bucket.remove(key); } } }
其他还支持如下操作,
public boolean containsKey(K key)
public V get(K key)
public Object remove(K key)
public int size() //将所有bucket的HashMap的size累加