Topology
1.定义两个spout, 分别是genderSpout, ageSpout
Fields, ("id", "gender"), ("id", "age"), 最终join的结果应该是("id", "gender", "age")
2. 在设置SingleJoinBolt需要将outFields作为参数, 即告诉bolt, join完的结果应该包含哪些fields
并且对于两个spout都是以Fields("id")进行fieldsGrouping, 保证相同id都会发到同一个task
public class SingleJoinExample { public static void main(String[] args) { FeederSpout genderSpout = new FeederSpout(new Fields("id", "gender")); FeederSpout ageSpout = new FeederSpout(new Fields("id", "age")); TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("gender", genderSpout); builder.setSpout("age", ageSpout); builder.setBolt("join", new SingleJoinBolt(new Fields("gender", "age"))) .fieldsGrouping("gender", new Fields("id")) .fieldsGrouping("age", new Fields("id")); }
SingleJoinBolt
由于不能保证bolt可以同时收到某个id的所有tuple, 所以必须把收到的tuple都先在memory里面cache, 至到收到某id的所有的tuples, 再做join.
做完join后, 这些tuple就可以从cache里面删除, 但是如果某id的某些tuple丢失, 就会导致该id的其他tuples被一直cache.
解决这个问题, 对cache数据设置timeout, 过期后就删除, 并发送这些tuples的fail通知.
可见这个场景, 使用TimeCacheMap正合适,
TimeCacheMap<List<Object>, , Map
List<Object>, 被join的field, 对于上面的例子就是"id”, 之所以是List, 应该是为了支持多fields join
Map<GlobalStreamId, Tuple>,记录tuple和stream的关系
对于这个例子, 从TimeCacheMap的bucket里面取出下面两个k,v, 然后进行join
{id, {agestream, (id, age)}}
{id, {genderstream, (id, gender)}}
1. prepare
一般的prepare的逻辑都很简单, 而这里确很复杂...
a, 设置Timeout和ExpireCallback
timeout 设的是, Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS, 默认是30s, 这个可以根据场景自己调整
应该设法保证不同spout中tuple的发送顺序, 以保证相同id的tuple以较短时间间隔被收到, 比如这个例子应该按id排序然后emit
否则如果出现, ("id", "gender")被第一个emit, 而 ("id", "age")被最后一个emit, 会导致不断的timeout
设置ExpireCallback, 对于所有timeout的tuples, 发送fail通知
private class ExpireCallback implements TimeCacheMap.ExpiredCallback<List<Object>, Map<GlobalStreamId, Tuple>> { @Override public void expire(List<Object> id, Map<GlobalStreamId, Tuple> tuples) { for(Tuple tuple: tuples.values()) { _collector.fail(tuple); } } }
b. 找出_idFields(哪些field是相同的, 可以用作join) 和_fieldLocations (outfield和spout stream的关系, 比如gender属于genderstream)
通过context.getThisSources()取出spout sources列表, 并通过getComponentOutputFields取到fields列表
_idFields, 逻辑很简单, 每次都拿新的fields和idFields做retainAll(取出set共同部分), 最终会得到所有spout fields的相同部分
_fieldLocations, 拿_outFields和spout fields进行匹配, 找到后记录下关系
其实, 我觉得这部分准备工作, 在调用的时候用参数指明就可以了, 犯不着那么麻烦的来做
比如参数变为("id", {"gender", genderstream}, {"age", agestream})
@Override
public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
_fieldLocations = new HashMap<String, GlobalStreamId>();
_collector = collector;
int timeout = ((Number) conf.get(Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS)).intValue();
_pending = new TimeCacheMap<List<Object>, Map<GlobalStreamId, Tuple>>(timeout, new ExpireCallback());
_numSources = context.getThisSources().size();
Set<String> idFields = null;
for(GlobalStreamId source: context.getThisSources().keySet()) {
Fields fields = context.getComponentOutputFields(source.get_componentId(), source.get_streamId());
Set<String> setFields = new HashSet<String>(fields.toList());
if(idFields==null) idFields = setFields;
else idFields.retainAll(setFields);
for(String outfield: _outFields) {
for(String sourcefield: fields) {
if(outfield.equals(sourcefield)) {
_fieldLocations.put(outfield, source);
}
}
}
}
_idFields = new Fields(new ArrayList<String>(idFields));
if(_fieldLocations.size()!=_outFields.size()) {
throw new RuntimeException("Cannot find all outfields among sources");
}
}
2, execute
a, 从tuple中取出_idFields和streamid
如果在_pending(TimeCacheMap)中没有此_idFields, 为这个_idFields创新新的hashmap并put到bucket
b, 取出该_idFields所对应的所有Map<GlobalStreamId, Tuple> parts, 并检测当前收到的是否是无效tuple(从同一个stream emit的具有相同id的tuple)
将新的tuple, put到该_idFields所对应的map. parts.put(streamId, tuple);
c, 判断如果parts的size等于spout sources的数目, 对于这个例子为2, 意思是当从genderstream和agestream过来的tuple都已经收到时
从_pending(TimeCacheMap)删除该_idFields的cache数据, 因为已经可以join, 不需要继续等待了
并根据_outFields以及_fieldLocations, 去各个stream的tuple中取出值
最终emit结果, (((id, age), (id, gender)), (age, gender))
ArrayList<Tuple>(parts.values()), joinResult
Ack所有的tuple
@Override
public void execute(Tuple tuple) {
List<Object> id = tuple.select(_idFields);
GlobalStreamId streamId = new GlobalStreamId(tuple.getSourceComponent(), tuple.getSourceStreamId());
if(!_pending.containsKey(id)) {
_pending.put(id, new HashMap<GlobalStreamId, Tuple>());
}
Map<GlobalStreamId, Tuple> parts = _pending.get(id);
if(parts.containsKey(streamId)) throw new RuntimeException("Received same side of single join twice");
parts.put(streamId, tuple);
if(parts.size()==_numSources) {
_pending.remove(id);
List<Object> joinResult = new ArrayList<Object>();
for(String outField: _outFields) {
GlobalStreamId loc = _fieldLocations.get(outField);
joinResult.add(parts.get(loc).getValueByField(outField));
}
_collector.emit(new ArrayList<Tuple>(parts.values()), joinResult);
for(Tuple part: parts.values()) {
_collector.ack(part);
}
}
}
TimeCacheMap
解决什么问题?
常常需要在memory里面cache key-value, 比如实现快速查找表
但是memeory是有限的, 所以希望只保留最新的cache的, 过期的key-value可以被删除. 所以TimeCacheMap就是用来解决这个问题的, 在一定time内cache map(kv set)
1. 构造参数
TimeCacheMap(int expirationSecs, int numBuckets, ExpiredCallback<K, V> callback)
首先需要expirationSecs, 表示多久过期
然后, numBuckets, 表示时间粒度, 比如expirationSecs = 60s, 而numBuckets=10, 那么一个bucket就代表6s的时间窗, 并且6s会发生一次过期数据删除
最后, ExpiredCallback<K, V> callback, 当发生超时的时候, 需要对超时的K,V做些操作的话, 可以定义这个callback, 比如发送fail通知
2. 数据成员
核心结构, 使用linkedlist来实现bucket list, 用HashMap<K, V>来实现每个bucket
private LinkedList<HashMap<K, V>> _buckets;
辅助成员, lock对象和定期的cleaner thread
private final Object _lock = new Object(); private Thread _cleaner;
3. 构造函数
其实核心就是启动_cleaner Daemon线程
_cleaner的逻辑其实很简单,
定期的把最后一个bucket删除, 在bucket list开头加上新的bucket, 并且如果有定义callback, 对所有timeout的kv调用callback
同时这里考虑线程安全, 会对操作过程加锁synchronized(_lock)
唯一需要讨论的是, sleepTime
即如果保证数据在定义的expirationSecs时间后, 被删除
定义, sleepTime = expirationMillis / (numBuckets-1)
a, 如果cleaner刚刚完成删除last, 添加first bucket, 这时put的K,V的过期时间为,
expirationSecs / (numBuckets-1) * numBuckets = expirationSecs * (1 + 1 / (numBuckets-1))
需要等待完整的numBuckets个sleepTime, 所以时间会略大于expirationSecsb, 如果反之, 刚完成put k,v操作后, cleaner开始clean操作, 那么k,v的过期时间为,
expirationSecs / (numBuckets-1) * numBuckets - expirationSecs / (numBuckets-1) = expirationSecs
这种case会比a少等一个sleepTime, 时间恰恰是expirationSecs所以这个方法保证, 数据会在[b,a]的时间区间内被删除
public TimeCacheMap(int expirationSecs, int numBuckets, ExpiredCallback<K, V> callback) {
if(numBuckets<2) {
throw new IllegalArgumentException("numBuckets must be >= 2");
}
_buckets = new LinkedList<HashMap<K, V>>();
for(int i=0; i<numBuckets; i++) {
_buckets.add(new HashMap<K, V>());
}
_callback = callback;
final long expirationMillis = expirationSecs * 1000L;
final long sleepTime = expirationMillis / (numBuckets-1);
_cleaner = new Thread(new Runnable() {
public void run() {
try {
while(true) {
Map<K, V> dead = null;
Time.sleep(sleepTime);
synchronized(_lock) {
dead = _buckets.removeLast();
_buckets.addFirst(new HashMap<K, V>());
}
if(_callback!=null) {
for(Entry<K, V> entry: dead.entrySet()) {
_callback.expire(entry.getKey(), entry.getValue());
}
}
}
} catch (InterruptedException ex) {
}
}
});
_cleaner.setDaemon(true);
_cleaner.start();
}
4. 其他操作
首先, 所有操作都会使用synchronized(_lock)保证线程互斥
其次, 所有操作的复杂度都是O(numBuckets), 因为每个item都是hashmap, 都是O(1)操作
最重要的是Put, 只会将新的k,v, put到第一个(即最新的)bucket, 并且将之前旧bucket里面的相同key的cache数据删除
public void put(K key, V value) { synchronized(_lock) { Iterator<HashMap<K, V>> it = _buckets.iterator(); HashMap<K, V> bucket = it.next(); bucket.put(key, value); while(it.hasNext()) { bucket = it.next(); bucket.remove(key); } } }
其他还支持如下操作,
public boolean containsKey(K key)
public V get(K key)
public Object remove(K key)
public int size() //将所有bucket的HashMap的size累加