Record my reading of DAPs paper.
Abstract
This paper introduces Deep Action Proposals(DAPs) ,an effective and efficient algorithm for generating temporal action proposals from long videos.Authors show how to use deep learning models and memory cells to retrieve from untrimmed videos temporal segments. Then a comprehensive evaluation indicates that this approach outperforms previous works on a large scale action benchmark(134 FPS)
Introduction
In this paper,authors focus on the task of quickly localizing temporal chunks in videos which are likely to contain human activities of interest.A good temporal action proposals method can facilitate activity detection. In computer vision community the idea of extracting regions with semantic content is not new.
DAPs can localize segments of varied duration around actions occurring along a video without exhaustively exploring multiple temporal scales.
contributions
- Output the temporal location and scale of a fixed number of proposals.
- Multiple temporal scales with single pass,generalize
- Extensive experiments on large-scale benchmarks
- Run at 134 FPS
DAPs
PS:A convolution kernel of 2d conv is 2-dim(w*h),a convolution kernel of 3d conv is 3-dim(w*h*f),a feature map is 2-dim,each convolution kernel corresponds to a bias
We aim to retrieve temporal segments that likely contain actions of interest.
Architecture
Our DAPs network encodes a stream of visual observations of length T frames into discriminative states, from which we infer the temporal location and duration
{
s
i
}
i
=
1
K
\{s_i\}^K_{i=1}
{si}i=1K of
K
K
K action proposals inside the stream. Each proposal s i is associated with a confidence score
c
i
c_i
ci . Our network integrates the following modules:
Visual Encoder: It encodes a small video volume into a meaningful low dimensional feature vector. In practice, we use activations from the top layer of a 3D convolutional network trained for action classification (C3D network [34]).
Sequence Encoder: It encodes the sequence of visual codes as a discriminative sequence of hidden states. Here, we use a long-short term memory (LSTM) network.
Localization Module: It predicts the location of K proposals inside the stream based on a linear combination of the last state in the sequence encoder. In this way, our model can output segments of different lengths in one pass instead of the traditional way of scanning over overlapping segments with multiple window sizes. Each proposal s i is predicted by the localization module.
Prediction Module: It predicts the confidence c i that proposal s i contains an action within its temporal extent. In practice, c i is the output of a sigmoid function over a linear combination of the last state of the sequence encoder.
Architecture Visualization:
PS:A segment is a proposal.The
T
T
T frames stream is in yellow box.The
K
K
K is number of proposals per stream.A (visual) stream is in yellow box.A video sequence is whole video volume. So there are several segments where it’s possible to find actions in a video sequence,not only
K
K
K.
Inference and Learning
Inference: In order to produce several candidate segments where actions are likely within a long video sequence, we slide our DAPs network over it with step size δ. Every time our model scans a video stream of length T frames,it places K segments of varied duration inside it with their respective action likelihoods.our algorithm scans the whole video sequence in only one pass with one stream (or window) size
T
T
T , while still producing segments of different duration.
Learning: We are interested in learning an appropriate function f such that:
(i) segments produced by our model match the locations of actions
A
=
{
a
i
}
i
=
1
M
A = \{a_i\}^M_{i=1}
A={ai}i=1M in the sequence (the number of these actions in stream
v
v
v is assumed less than
K
K
K);
(ii) confidence values associated with segments that match an action are higher than other segments. This is done by formulating an assignment problem:
Implementation Details: for our visual encoder, we use the publicly available pre-trained C3D model[34] which has a temporal resolution of 16 frames. To shorten the training time of our implementation, we reduce the dimensionality of the activations from the second fully-connected layer (fc7 ) of our visual encoder from 4096 to 500 dimensions using PCA. By cross-validation, we find that one layer and 256 output units achieves a good trade-off between accuracy and runtime. We use back-propagation through time with ADAGRAD update rule to find the parameters θ of our sequence encoder and output modules. By hyper-parameter search, a learning rate of
1
0
−
4
10^{−4}
10−4 and α = 1.0 provide good results. In practice, we predict locations (s) as duration of the action and the frame index of its center (normalized by T ).The DAPs network is trained on video streams of length T frames from long untrimmed videos. From a labeled dataset like THUMOS-14 with 11 hours of video and more than 3000 annotations, we are able to generate a large corpus of video streams (over 500 thousands) that might contain multiple actions. In practice, we densely extract video streams and cluster them according to their tIoU with annotations of the video. We sample streams from each cluster, so they are equally represented.
Reference
[34] Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: IEEE International Conference on Computer Vision, ICCV. pp. 4489–4497 (2015).
Code
The authors offer code based on lasagne.There is my code based on pytorch.PyTorch LSTM has different release with lasagne.Here,i only use default LSTM in PyTorch.You can release original computation. By the way,you may debug this programming.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Apr 17 21:14:25 2019
@author: tang
"""
import torch.nn as nn
import torch
import numpy as np
def format(X, mthd='c2b'):
"""Transform between temporal/frame annotations
Parameters
----------
X : ndarray
2d-ndarray of size [n, 2] with temporal annotations
mthd : str
Type of conversion:
'c2b': transform [center, duration] onto [f-init, f-end]
'b2c': inverse of c2b
'd2b': transform ['f-init', 'n-frames'] into ['f-init', 'f-end']
Outputs
-------
Y : ndarray
2d-ndarray of size [n, 2] with transformed temporal annotations.
"""
if X.ndim != 2:
msg = 'Incorrect number of dimensions. X.shape = {}'
ValueError(msg.format(X.shape))
if mthd == 'c2b':
Xinit = np.ceil(X[:, 0] - 0.5*X[:, 1])
Xend = Xinit + X[:, 1] - 1.0
return np.stack([Xinit, Xend], axis=-1)
elif mthd == 'b2c':
Xc = np.round(0.5*(X[:, 0] + X[:, 1]))
d = X[:, 1] - X[:, 0] + 1.0
return np.stack([Xc, d], axis=-1)
elif mthd == 'd2b':
Xinit = X[:, 0]
Xend = X[:, 0] + X[:, 1] - 1.0
return np.stack([Xinit, Xend], axis=-1)
class OriginLSTM(nn.Module):
def __int__(self):
super(OriginLSTM,self).__init__()
pass
class DAPs(nn.Mudule):
"""
Deep Action Proposal(sequence encoder & proposal generation)
References:
-----------
[1]https://github.com/escorciav/daps/blob/master/daps/sequence_encoder.py
[2]V. Escorcia, F. C. Heilbron, J. C. Niebles, and B. Ghanem.
Daps: Deep action proposals for action understanding. In
European Conference on Computer Vision, pages 768–784.
Springer, 2016.
"""
def __init__(self,out_size=64,input_size=500,time_step=32,\
receptive_field=512,depth=1,width=256,anchors=None,origin=False):
"""
Initialize DAPs model
Parameters:
----------
out_size:int,optional
the number of proposals
input_size:int,optional
the dimention of a clip feature ([1,input_size])
time_step:int,optional
the number of time step of LSTM,in other words,the length of
a stream with clips
receptive_field:int,optional
the length of a stream with frames
depth:int,optional
the number of hidden layers in LSTM
width:int,optional
number of a hidden layer in LSTM
anchors:ndarray,optional
2d-array of size [out_size,2] with anchor segment locations
normalized with respect to receptive_field.The anchor location format
should be [central frame,duration].
Raises:
------
ValueError:
anchors.shape[0]!=out_size
"""
self.out_size=out_size
self.input_size=input_size
self.time_step=time_step
self.receptive_field=receptive_field
self.depth=depth
self.width=width
self.anchors=anchors
if anchors!=None:
if anchors.shape[0]!=out_size:
raise ValueError(('Mismatch between number of anchors and'+\
'outputs'))
self.LSTM=self.built(origin=False)
self.dense=self.built_dense()
def built_sequense_encoder(self,forget_bias=5.0,grad_clip=10.0,origin=False):
"""
built sequense encoder model.
Parameters:
-----------
origin:boolean,optional
"""
if not origin:
return nn.LSTM(input_size=self.input_size,hidden_size=self.width,num_layers=self.depth)
else:
return OriginLSTM()
def built_dense(self):
"""
built dense layer to generate locations
"""
return nn.Linear(self.width,self.out_size*2),\
nn.Linear(self.width,self.out_size)
def forward(self,x,f_init_array,override=False):
"""
forward pass
Parameters:
-----------
x:ndarray,optional
3d-array of size [sequence length,batch,feature]
Raises:
-------
VauleError
x[0]!=time_step or x[2]!=self.input_size
"""
if x.shape[0]!=self.time_step:
raise ValueError(('Mismatch between number of step_times of model and x'))
if x.shape[2]!=self.input_size:
raise ValueError(('Mismatch between number of features of input and model'))
batch_size=x.shape[1]
h0=torch.zeros(self.depth,batch_size,self.width)
c0=torch.zeros(self.depth,batch_size,self.width)
output,(hn,cn)=self.LSTM(x,(h0,c0))
loc,conf=self.dense
hn=hn.reshape((hn.shape[1],hn.reshape[2]))
loc_var=loc(hn)
conf_var=conf(hn)
conf_var=nn.Sigmoid(conf_var)
if override and self.anchors is not None:
loc_var[:, ...] = self.anchors.reshape(-1)
# Clip proposals inside receptive field
loc_var.clip(0, 1, out=loc_var)
loc_var *= self.receptive_field
# Shift center to absolute location in the video
loc_var = loc_var.reshape((batch_size, -1, 2))
loc_var[:, :, 0] += f_init_array.reshape((batch_size, 1))
# Transform center 2 boundaries
proposals = np.reshape(\
format(loc_var.reshape((-1, 2)), 'c2b'),\
(batch_size, -1, 2)).astype(int)
return proposals,conf_var
For C3D by pytorch.There is a coding trick.You can use dictionary and Sequential(*args) object to dynamically create model by dictionary defining model structure.Different dictionary has different model.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Apr 17 01:20:42 2019
@author: tang
"""
dic={'C3D':[64,'m1',128,'m2',256,256,'m2',512,512,'m2',512,512,'m2']}
import torch.nn as nn
import torch
import torch.optim.lr_scheduler as lr_scheduler
class C3D(nn.Mudule):
"""Implement C3D model"""
def __init__(self):
super(C3D,self).__init__()
self.features=self.get_feature_layer(dic['C3D'])
self.classifier=nn.Sequential(
nn.Linear(4608,4096),
nn.ReLU(),
nn.Linear(4096,4096),
)
def get_feature_layer(self,l):
result=[]
last_c=3
for i in l:
if i=='m1':
result+=[nn.MaxPool3d((1,2,2),(1,2,2))]
elif i=='m2':
result+=[nn.MaxPool3d(2,2)]
else:
result+=[nn.Conv3d(last_c,i,3,padding=1),nn.ReLU()]
last_c=i
return nn.Sequential(*result)
def forward(self,x):
o1=self.features(x)
return self.classifier(o1)
class Utils():
"""Train and test C3D model"""
def __init__(self,initial_lr=0.003,batch_size=30,steps_size=150000,decay_rate=0.5,totel_steps=1.9e6):
self.initial_lr=initial_lr
self.batch_size=batch_size
self.steps_size=steps_size
self.decay_rate=decay_rate
self.totel_steps=totel_steps
def train(self,model,trainloader):
"""
Train C3D model from scratch according to origin paper.
Parameters:
-----------
model: nn.module
C3D model
trainloader:torch.utils.data.DataLoader
training set containing tensor of size [batch_size,16,h,w]
"""
loss=nn.CrossEntropyLoss()
optimizer=torch.optim.SGD(model.parameters(),lr=self.initial_lr)
scheduler=lr_scheduler.StepLR(optimizer,self.steps_size,gamma=self.decay_rate,last_epoch=0)
step_count=0
acc=0
count=0
l=0
while True:
for train_data in trainloader:
optimizer.zero_grad()
step_count+=1
data,label=train_data
data=data.cuda()
label=label.cuda()
out=model(data)
_,indexes=torch.max(out,1)
batch_acc=(indexes==label).sum().item()
batch_count=label.shape[0]
acc+=batch_acc
count+=batch_count
batch_l=loss(out,label)
batch_l.backward()
l+=batch_l
scheduler.step()
optimizer.step()
if step_count==self.totel_steps:
return
if step_count%1000==0:
print('the accurity of '+str(step_count-1000)+'-'+str(step_count)\
+' interations is %f'%(acc/count))
print('the mean loss of '+str(step_count-1000)+'-'+str(step_count)\
+' interations is %f'%(l/count))
l=0
acc=0
count=0
def validate(self,model,validationloader):
"""
Validate trained model.
Parameters:
-----------
model:nn.Module
trained C3D model
validationloader:torch.utils.data.DataLoader
validation set containing tensor of size [batch_size,16,h,w]
"""
acc=0
count=0
for validation_data in validationloader:
data,label=validation_data
data=data.cuda()
label=label.cuda()
out=model(data)
_,indexes=torch.max(out,1)
batch_acc=(indexes==label).sum().item()
batch_count=label.shape[0]
acc+=batch_acc
count+=batch_count
print('the accuracy of validation set '+'is %f'%(acc/count))