Temporal Convolutional Networks for Action Segmentation and Detection
Lea C, Flynn M D, Vidal R, et al. Temporal convolutional networks for action segmentation and detection[C]//proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 156-165.
Motivation
行为细分(Action Segmentation)方法预测在一个视频中每一帧出现什么动作。
检测(Detection)方法输出一个稀疏的动作细分集合,这个集合中一个细分由起始时间,和类别标签定义。
传统的方法将这个问题分解为两步:
首先从视频的帧中提取局部的时空特征;再将他们喂入一个时间分类器中捕捉高级的时间模式。
其中第二步最近的时间模型主要有三种,但是他们都是有对应的缺点:
- Sliding window action detectors:太短不能捕获长期的时间模式
- Segmental models:捕获段内属性,但是忽略了长期的潜在依赖
- Recurrent models:注意力有限且很难正确训练。
Model
Encoder-Decoder-TCN
编码器:
E
(
l
)
∈
R
F
l
×
T
l
E^{(l)}\in R^{F_l\times T_l}
E(l)∈RFl×Tl
时间卷积,非线性激活,最大池化
E
(
l
)
=
m
a
x
_
p
o
o
l
i
n
g
(
f
(
W
∗
E
(
l
−
1
)
+
b
)
)
E^{(l)}=max\_pooling(f(W*E^{(l-1)}+b))
E(l)=max_pooling(f(W∗E(l−1)+b))
解码器:
D
(
l
)
∈
R
F
l
×
T
l
D^{(l)}\in R^{F_l\times T_l}
D(l)∈RFl×Tl
上采样,卷积,激活函数
Y
^
t
=
s
o
f
t
m
a
x
(
U
D
t
(
1
)
+
c
)
\hat Y_t=softmax(UD_t^{(1)}+c)
Y^t=softmax(UDt(1)+c)
Dilated-TCN
膨胀TCN由一系列block组成,每个block又由L个卷积层序列组成。
S
(
j
,
l
)
∈
R
F
w
×
T
S^{(j,l)\in R^{F_w\times T}}
S(j,l)∈RFw×T:第j个block中第l层的激活函数。
每一层都由具有膨胀率参数的一系列膨胀卷积、一个非线性激活函数和一个残差连接组成。
膨胀卷积在时刻t的结果为:
S
^
t
(
j
,
l
)
=
f
(
W
(
1
)
S
t
−
s
(
j
,
l
−
1
)
+
W
(
2
)
S
t
(
j
,
l
−
1
)
+
b
)
\hat S_t^{(j,l)}=f(W^{(1)}S_{t-s}^{(j,l-1)}+W^{(2)}S_{t}^{(j,l-1)}+b)
S^t(j,l)=f(W(1)St−s(j,l−1)+W(2)St(j,l−1)+b)
再加入残差连接之后的结果为:
S
t
(
j
,
l
)
=
S
t
(
j
,
l
−
1
)
+
V
S
^
t
(
j
,
l
)
+
ϵ
S_t^{(j,l)}=S_t^{(j,l-1)}+V\hat S_t^{(j,l)}+\epsilon
St(j,l)=St(j,l−1)+VS^t(j,l)+ϵ
一系列的跳跃连接之后;
Z
t
(
0
)
=
R
e
L
U
(
∑
j
=
1
B
S
t
(
j
,
L
)
)
Z_t^{(0)}=ReLU(\sum_{j=1}^BS_t^{(j,L)})
Zt(0)=ReLU(j=1∑BSt(j,L))
每个时刻t的预测结果为:
Y
^
t
=
s
o
f
t
m
a
x
(
U
Z
t
(
1
)
+
c
)
\hat Y_t=softmax(UZ_t^{(1)}+c)
Y^t=softmax(UZt(1)+c)
两个模型的共同点:
两个模型的区别:
ED-TCN:
- Efficiently capture long-range temporal patterns;
- Has a relatively small number of layers;
- Each layer contains a set of long convolutional filters.
Dilated-TCN:
- Was developed for speech synthesis;
- Has more layers;
- Each layer uses dilated filters that only operate on a small number of time steps.
因果卷积 vs 非因果卷积:
因果卷积:t时刻的预测仅仅是1-t时刻数据的函数
ED-TCN:
从
X
(
t
−
d
)
X_{(t-d)}
X(t−d)到
X
t
X_t
Xt进行卷积
Dilated-TCN:
S
^
t
(
j
,
l
)
=
f
(
W
(
1
)
S
t
−
s
(
j
,
l
−
1
)
+
W
(
2
)
S
t
(
j
,
l
−
1
)
+
b
)
\hat S_t^{(j,l)}=f(W^{(1)}S_{t-s}^{(j,l-1)}+W^{(2)}S_{t}^{(j,l-1)}+b)
S^t(j,l)=f(W(1)St−s(j,l−1)+W(2)St(j,l−1)+b)
非因果卷积:t时刻的预测可能是序列中任意时间步数据的函数
ED-TCN:
从
X
(
t
−
d
2
)
X_{(t-\frac{d}{2})}
X(t−2d)到
X
(
t
+
d
2
)
X_{(t+\frac{d}{2})}
X(t+2d)进行卷积
Dilated-TCN:
S
^
t
(
j
,
l
)
=
f
(
W
(
1
)
S
t
−
s
(
j
,
l
−
1
)
+
W
(
2
)
S
t
(
j
,
l
−
1
)
+
W
(
3
)
S
t
+
s
(
j
,
l
−
1
)
+
b
)
\hat S_t^{(j,l)}=f(W^{(1)}S_{t-s}^{(j,l-1)}+W^{(2)}S_{t}^{(j,l-1)}+W^{(3)}S_{t+s}^{(j,l-1)}+b)
S^t(j,l)=f(W(1)St−s(j,l−1)+W(2)St(j,l−1)+W(3)St+s(j,l−1)+b)
Experiments
采用F1同时评价Segmentation任务和Detection任务:
P
=
T
P
T
P
+
F
P
,
R
=
T
P
T
P
+
F
N
P=\frac{TP}{TP+FP}, R=\frac{TP}{TP+FN}
P=TP+FPTP,R=TP+FNTP
F
1
=
2
p
r
e
c
∗
r
e
c
a
l
l
p
r
e
c
+
r
e
c
a
l
l
F_1=2 \frac{prec*recall}{prec+recall}
F1=2prec+recallprec∗recall
如果IoU分数在阈值之上的话被认为是TP(True Positive),否则是FP(False Positive)
数据集:
在每个数据集上都实现了SOTA。
其余实验部分的细节大家可以查看原文。
keras实现模型部分代码:
import numpy as np
from keras.models import Sequential, Model
from keras.layers import Input, Dense, TimeDistributed, merge, Lambda
from keras.layers.core import *
from keras.layers.convolutional import *
from keras.layers.recurrent import *
import tensorflow as tf
from keras import backend as K
from keras.activations import relu
from functools import partial
def channel_normalization(x):
# Normalize by the highest activation最大值进行正则化
max_values = K.max(K.abs(x), 2, keepdims=True) + 1e-5
out = x / max_values
return out
def WaveNet_activation(x):
# WaveNet的激活函数
tanh_out = Activation('tanh')(x)
sigm_out = Activation('sigmoid')(x)
return Merge(mode='mul')([tanh_out, sigm_out])
# -------------------------------------------------------------
def ED_TCN(n_nodes, conv_len, n_classes, n_feat, max_len,
loss='categorical_crossentropy', causal=False,
optimizer="rmsprop", activation='norm_relu',
return_param_str=False):
n_layers = len(n_nodes)
inputs = Input(shape=(max_len, n_feat)) # [T,F]
model = inputs
# ---- Encoder ----
for i in range(n_layers):
# Pad beginning of sequence to prevent usage of future data
if causal: model = ZeroPadding1D((conv_len // 2, 0))(model)
model = Convolution1D(n_nodes[i], conv_len, border_mode='same')(model)
if causal: model = Cropping1D((0, conv_len // 2))(model)
model = SpatialDropout1D(0.3)(model)
if activation == 'norm_relu':
model = Activation('relu')(model)
model = Lambda(channel_normalization, name="encoder_norm_{}".format(i))(model)
elif activation == 'wavenet':
model = WaveNet_activation(model)
else:
model = Activation(activation)(model)
model = MaxPooling1D(2)(model)
# ---- Decoder ----
for i in range(n_layers):
model = UpSampling1D(2)(model)
if causal: model = ZeroPadding1D((conv_len // 2, 0))(model)
model = Convolution1D(n_nodes[-i - 1], conv_len, border_mode='same')(model)
if causal: model = Cropping1D((0, conv_len // 2))(model)
model = SpatialDropout1D(0.3)(model)
if activation == 'norm_relu':
model = Activation('relu')(model)
model = Lambda(channel_normalization, name="decoder_norm_{}".format(i))(model)
elif activation == 'wavenet':
model = WaveNet_activation(model)
else:
model = Activation(activation)(model)
# Output FC layer
model = TimeDistributed(Dense(n_classes, activation="softmax"))(model)
model = Model(input=inputs, output=model)
model.compile(loss=loss, optimizer=optimizer, sample_weight_mode="temporal", metrics=['accuracy'])
if return_param_str:
param_str = "ED-TCN_C{}_L{}".format(conv_len, n_layers)
if causal:
param_str += "_causal"
return model, param_str
else:
return model
def Dilated_TCN(num_feat, num_classes, nb_filters, dilation_depth, nb_stacks, max_len,
activation="wavenet", tail_conv=1, use_skip_connections=True, causal=False,
optimizer='adam', return_param_str=False):
"""
dilation_depth : number of layers per stack
nb_stacks : number of stacks.
"""
def residual_block(x, s, i, activation):
original_x = x
if causal:
x = ZeroPadding1D(((2 ** i) // 2, 0))(x)
conv = AtrousConvolution1D(nb_filters, 2, atrous_rate=2 ** i, border_mode='same',
name='dilated_conv_%d_tanh_s%d' % (2 ** i, s))(x)
conv = Cropping1D((0, (2 ** i) // 2))(conv)
else:
conv = AtrousConvolution1D(nb_filters, 3, atrous_rate=2 ** i, border_mode='same',
name='dilated_conv_%d_tanh_s%d' % (2 ** i, s))(x)
conv = SpatialDropout1D(0.3)(conv)
# x = WaveNet_activation(conv)
if activation == 'norm_relu':
x = Activation('relu')(conv)
x = Lambda(channel_normalization)(x)
elif activation == 'wavenet':
x = WaveNet_activation(conv)
else:
x = Activation(activation)(conv)
# res_x = Convolution1D(nb_filters, 1, border_mode='same')(x)
# skip_x = Convolution1D(nb_filters, 1, border_mode='same')(x)
x = Convolution1D(nb_filters, 1, border_mode='same')(x)
res_x = Merge(mode='sum')([original_x, x])
# return res_x, skip_x
return res_x, x
input_layer = Input(shape=(max_len, num_feat)) # [T,F]
skip_connections = []
x = input_layer
if causal:
x = ZeroPadding1D((1, 0))(x)
x = Convolution1D(nb_filters, 2, border_mode='same', name='initial_conv')(x)
x = Cropping1D((0, 1))(x)
else:
x = Convolution1D(nb_filters, 3, border_mode='same', name='initial_conv')(x)
for s in range(nb_stacks):
for i in range(0, dilation_depth + 1):
x, skip_out = residual_block(x, s, i, activation)
skip_connections.append(skip_out)
if use_skip_connections:
x = Merge(mode='sum')(skip_connections)
x = Activation('relu')(x)
x = Convolution1D(nb_filters, tail_conv, border_mode='same')(x)
x = Activation('relu')(x)
x = Convolution1D(num_classes, tail_conv, border_mode='same')(x)
x = Activation('softmax', name='output_softmax')(x)
model = Model(input_layer, x)
model.compile(optimizer, loss='categorical_crossentropy', sample_weight_mode='temporal')
if return_param_str:
param_str = "D-TCN_C{}_B{}_L{}".format(2, nb_stacks, dilation_depth)
if causal:
param_str += "_causal"
return model, param_str
else:
return model
代码链接点击此处。