storm是一个流式处理框架,可以做到Spout产生一条数据,Bolt处理一条以达到实时计算。
这种模式并不是实际的业务需要的,我们更多是需要 最近5分钟的PV UV ,最近10分钟的网络最大延迟,最近5分钟页面访问TOP10
这类问题都可以抽象为:
每隔M秒统计最近N秒内的数据,即我们需要一个滑动窗口(固定时间段)来控制数据流量
因为storm 未提供api封装,滑动窗口只能自己来实现
我们先来看一个特殊的场景 M = N = 5时,即每隔5秒钟计算最近5秒中最大网络延迟
我们可以在Bolt定义一个List<Tuple> 作为cache,每次execute来给put到cache,等到时间过了5秒就立即从cache提取数据做计算
1
2
3
4
5
6
7
8
|
List
<Tuple>
cache
=
Lists
.
newArrayList
(
)
;
public
void
execute
(
Tuple
tuple
)
{
if
(
current_time
-
last_time
<
5s
)
{
cache
.
add
(
tuple
)
;
}
else
{
calculate
(
cache
)
;
}
}
|
但是如果M != N时,即每隔5秒计算最近20秒内的数据就无能为力了
同样的原理,我们也是提供一个cache用来缓存最近时间的消息,此外还要实现两点:
1,如何实现每隔设定的时间批量提交一次Tuple
2,设计一个怎样的存储结构及算法以方便获得设定滑动窗口内的Tuples
2,设计一个怎样的存储结构及算法以方便获得设定滑动窗口内的Tuples
比方说:要实现一个每M=2秒计算最近N=6秒内的交易UV、交易额
在Bolt就需要每个2秒获取窗口长度为6秒的Tuple并做处理.
1
2
3
4
5
6
7
8
9
10
11
12
|
对于第一点,
Storm提供了一个
TickTuple机制
@Override
public
Map
<
String
,
Object
>
getComponentConfiguration
(
)
{
Map
<
String
,
Object
>
conf
=
new
HashMap
<
String
,
Object
>
(
)
;
conf
.
put
(
Config
.
TOPOLOGY_TICK_TUPLE_FREQ_SECS
,
emitFrequencyInSeconds
)
;
return
conf
;
}
if
(
TupleHelpers
.
isTickTuple
(
tuple
)
)
{
LOG
.
debug
(
"Received tick tuple, triggering emit of current window counts"
)
;
emitCurrentWindowCounts
(
)
;
}
|
第二点也是滑动窗口实现的核心
设定两个数,Tick频率——batch的时间间隔(2秒),滑动窗口的长度(6秒)
我们把移动窗口划分成6/2=3个Slots,每个Slot存储2秒内收到的Tuple
我们的Cache采用 Map<Integer,List<T>> key为slotIndex,value为当前slot时间内收到的消息tuple
storm每触发一次Tick就获取当前窗口的Tuples计算,同时像后滑动一次窗口,通过(headSlot,tailSlot)控制
SlidingWindowBolt的代码如下:
SlidingWindowCache.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
|
public
class
SlidingWindowCache
<T>
implements
Serializable
{
private
static
final
long
serialVersionUID
=
1L
;
private
Map
<
Integer
,
List
<T>
>
tupMap
=
new
HashMap
<
Integer
,
List
<T>
>
(
)
;
private
int
headSlot
;
private
int
tailSlot
;
private
int
slotNum
;
public
SlidingWindowCache
(
int
slotNum
)
{
if
(
slotNum
<
2
)
{
throw
new
IllegalArgumentException
(
"Window length in slots must be at least two (you requested "
+
slotNum
+
")"
)
;
}
this
.
slotNum
=
slotNum
;
for
(
int
i
=
0
;
i
<
slotNum
;
i
++
)
{
tupMap
.
put
(
i
,
null
)
;
}
headSlot
=
0
;
tailSlot
=
(
headSlot
+
1
)
%
this
.
slotNum
;
}
public
void
add
(
T
t
)
{
List
<T>
objs
=
tupMap
.
get
(
headSlot
)
;
if
(
objs
==
null
)
{
objs
=
new
ArrayList
<T>
(
)
;
}
objs
.
add
(
t
)
;
tupMap
.
put
(
headSlot
,
objs
)
;
}
/**
* 获取窗口内的消息,并向前移动窗口
* @return
*/
public
List
<T>
getAndAdvanceWindow
(
)
{
int
i
=
headSlot
;
List
<T>
windowedTuples
=
new
ArrayList
<T>
(
)
;
if
(
tupMap
.
get
(
i
)
!=
null
)
{
windowedTuples
.
addAll
(
tupMap
.
get
(
i
)
)
;
}
while
(
(
i
=
slotAfter
(
i
)
)
!=
headSlot
)
{
if
(
tupMap
.
get
(
i
)
!=
null
)
{
windowedTuples
.
addAll
(
tupMap
.
get
(
i
)
)
;
}
}
advanceHead
(
)
;
return
windowedTuples
;
}
/**
* 向前移动窗口
*/
private
void
advanceHead
(
)
{
printList
(
tupMap
.
get
(
headSlot
)
)
;
headSlot
=
tailSlot
;
wipeSlot
(
headSlot
)
;
tailSlot
=
slotAfter
(
tailSlot
)
;
}
public
int
slotAfter
(
int
slot
)
{
return
(
slot
+
1
)
%
slotNum
;
}
public
void
wipeSlot
(
int
slot
)
{
tupMap
.
put
(
slot
,
null
)
;
}
}
|
WindowedBolt.java
public abstract class WindowedBolt extends BaseRichBolt {
private static final long serialVersionUID = 8849434942882466073L;
private static final Logger LOG = Logger.getLogger(WindowedBolt.class);
private final static int DEFAULT_WINDOW_LEN_IN_SECS = 12;
private final static int DEFAULT_WINDOW_EMIT_FREQ = 4;
private int windowLengthInSeconds;
private int emitFrequencyInSeconds;
protected SlidingWindowCache<Tuple> cache;
public WindowedBolt(){
this(DEFAULT_WINDOW_LEN_IN_SECS,DEFAULT_WINDOW_EMIT_FREQ);
}
public WindowedBolt(int windowLenInSecs, int emitFrequency){
if(windowLenInSecs%emitFrequency!=0){
LOG.warn(String.format("Actual window length(%d) isnot emitFrequency(%d)'s times"));
}
this.windowLengthInSeconds = windowLenInSecs;
this.emitFrequencyInSeconds = emitFrequency;
cache = new SlidingWindowCache<Tuple>(getSlots(this.windowLengthInSeconds,this.emitFrequencyInSeconds));
}
private int getSlots(int windowLenInSecs, int emitFrequency){
return windowLenInSecs/emitFrequency;
}
@Override
public void execute(Tuple tuple) {
if (TupleHelpers.isTickTuple(tuple)) {
LOG.info("====>Received tick tuple, triggering emit of current window counts");
emitCurrentWindowCounts();
} else {
emitNormal(tuple);
}
}
private void emitNormal(Tuple tuple){
cache.add(tuple);
}
public abstract void prepare(Map stormConf, TopologyContext context, OutputCollector collector);
public abstract void emitCurrentWindowCounts();
public abstract void declareOutputFields(OutputFieldsDeclarer declarer);
@Override
public Map<String, Object> getComponentConfiguration() {
Map<String, Object> conf = new HashMap<String, Object>();
conf.put(Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS, emitFrequencyInSeconds);
return conf;
}
}
|
|
WindowedBolt有三个抽象方法:
1
2
3
4
5
|
public
abstract
void
prepare
(
Map
stormConf
,
TopologyContext
context
,
OutputCollector
collector
)
;
public
abstract
void
emitCurrentWindowCounts
(
)
;
public
abstract
void
declareOutputFields
(
OutputFieldsDeclarer
declarer
)
;
|
要实现滑动窗口的Bolt只需要extends它实现这三个方法即可,就跟写一个普通的Bolt唯一区别在于emitCurrentWindowCounts,我们来实现一个SumupBolt来实现每2秒计算滑动窗口长度为6秒的所有数据之和
SumupBolt.java#emitCurrentWindowCounts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
@override
public
void
emitCurrentWindowCounts
(
)
{
int
sum
=
0
;
List
<Tuple>
windowedTuples
=
cache
.
getAndAdvanceWindow
(
)
;
Values
val
=
new
Values
(
)
;
if
(
windowedTuples
!=
null
&&
windowedTuples
.
size
(
)
!=
0
)
{
for
(
Tuple
t
:
windowedTuples
)
{
List
<Object>
objs
=
t
.
getValues
(
)
;
val
.
addAll
(
t
.
getValues
(
)
)
;
if
(
objs
!=
null
&&
objs
.
size
(
)
>
0
)
{
for
(
Object
obj
:
objs
)
{
int
tmp
=
Integer
.
parseInt
(
obj
.
toString
(
)
)
;
sum
+
=
tmp
;
}
}
}
LOG
.
info
(
"array to sum up: "
+
val
.
toString
(
)
)
;
collector
.
emit
(
new
Values
(
sum
+
""
)
)
;
}
}
LOG
.
info
(
"array to sum up: "
+
val
.
toString
(
)
)
;
collector
.
emit
(
new
Values
(
sum
+
""
)
)
;
}
}
|
package com.bj58.learningstorm;
import backtype.storm.Constants;
import backtype.storm.tuple.Tuple;
public final class TupleHelpers {
private TupleHelpers() {}
public static boolean isTickTuple(Tuple tuple) {
return tuple.getSourceComponent().equals(Constants.SYSTEM_COMPONENT_ID)
&& tuple.getSourceStreamId().equals(Constants.SYSTEM_TICK_STREAM_ID);
}
}
import backtype.storm.Constants;
import backtype.storm.tuple.Tuple;
public final class TupleHelpers {
private TupleHelpers() {}
public static boolean isTickTuple(Tuple tuple) {
return tuple.getSourceComponent().equals(Constants.SYSTEM_COMPONENT_ID)
&& tuple.getSourceStreamId().equals(Constants.SYSTEM_TICK_STREAM_ID);
}
}
构建Topology
1
2
3
4
5
|
String
spoutId
=
"numberGenerator"
;
String
sumup
=
"sumup"
;
builder
.
setSpout
(
spoutId
,
new
NumberSpout
(
)
,
2
)
;
builder
.
setBolt
(
sumup
,
new
SumupBolt
(
6
,
2
)
,
1
)
.
fieldsGrouping
(
spoutId
,
new
Fields
(
"number"
)
)
;
|
打jar包 执行
看了上面例子,我比较关注的是,storm
是通过方法TupleHelpers.isTickTuple(tuple)来判断是否应该发射当前窗口数据,但是判断的依据一开始让我很迷惑,居然是判断该tuple是否来源于“__system”的组件和“__tick”流。
作为对storm了解不多的人,我真的糊涂了,tuple不都是上游的spout发射来的吗,哪里冒出来源不同的tuple。
好吧,我就开始猜了,莫非有个隐藏的spout?或者RollingCountBolt自己给自己发什么特殊的tuple。
正毫无头绪时,奇迹出现了,我把鼠标移到Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS这个常量上时,出现了一行小提示:
How often a tick tuple from the "__system" component and "__tick" stream should be sent to tasks. Meant to be used as a component-specific configuration.
哦,在方法getComponentConfiguration() 里
conf.put(Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS, emitFrequencyInSeconds);
这句话告诉系统,需要按照emitFrequencyInSeconds的频率,产生来源于“__system”的组件和“__tick”流的tuple给task。
好了,到这里我们就知道“__system”的组件和“__tick”两个组件会周期性发送tuple,我们可以利用此特性来判断是否该发送我们的窗口数据了。
好吧,我就开始猜了,莫非有个隐藏的spout?或者RollingCountBolt自己给自己发什么特殊的tuple。
正毫无头绪时,奇迹出现了,我把鼠标移到Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS这个常量上时,出现了一行小提示:
How often a tick tuple from the "__system" component and "__tick" stream should be sent to tasks. Meant to be used as a component-specific configuration.
哦,在方法getComponentConfiguration() 里
conf.put(Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS, emitFrequencyInSeconds);
这句话告诉系统,需要按照emitFrequencyInSeconds的频率,产生来源于“__system”的组件和“__tick”流的tuple给task。
好了,到这里我们就知道“__system”的组件和“__tick”两个组件会周期性发送tuple,我们可以利用此特性来判断是否该发送我们的窗口数据了。