DAPs: Deep Action Proposals for Action Understanding

Record my reading of DAPs paper.

Abstract

This paper introduces Deep Action Proposals(DAPs) ,an effective and efficient algorithm for generating temporal action proposals from long videos.Authors show how to use deep learning models and memory cells to retrieve from untrimmed videos temporal segments. Then a comprehensive evaluation indicates that this approach outperforms previous works on a large scale action benchmark(134 FPS)

Introduction

In this paper,authors focus on the task of quickly localizing temporal chunks in videos which are likely to contain human activities of interest.A good temporal action proposals method can facilitate activity detection. In computer vision community the idea of extracting regions with semantic content is not new.
在这里插入图片描述
DAPs can localize segments of varied duration around actions occurring along a video without exhaustively exploring multiple temporal scales.

contributions

  1. Output the temporal location and scale of a fixed number of proposals.
  2. Multiple temporal scales with single pass,generalize
  3. Extensive experiments on large-scale benchmarks
  4. Run at 134 FPS

DAPs

PS:A convolution kernel of 2d conv is 2-dim(w*h),a convolution kernel of 3d conv is 3-dim(w*h*f),a feature map is 2-dim,each convolution kernel corresponds to a bias
We aim to retrieve temporal segments that likely contain actions of interest.

Architecture

Our DAPs network encodes a stream of visual observations of length T frames into discriminative states, from which we infer the temporal location and duration { s i } i = 1 K \{s_i\}^K_{i=1} {si}i=1K of K K K action proposals inside the stream. Each proposal s i is associated with a confidence score c i c_i ci . Our network integrates the following modules:
Visual Encoder: It encodes a small video volume into a meaningful low dimensional feature vector. In practice, we use activations from the top layer of a 3D convolutional network trained for action classification (C3D network [34]).
Sequence Encoder: It encodes the sequence of visual codes as a discriminative sequence of hidden states. Here, we use a long-short term memory (LSTM) network.
Localization Module: It predicts the location of K proposals inside the stream based on a linear combination of the last state in the sequence encoder. In this way, our model can output segments of different lengths in one pass instead of the traditional way of scanning over overlapping segments with multiple window sizes. Each proposal s i is predicted by the localization module.
Prediction Module: It predicts the confidence c i that proposal s i contains an action within its temporal extent. In practice, c i is the output of a sigmoid function over a linear combination of the last state of the sequence encoder.
Architecture Visualization:
在这里插入图片描述
PS:A segment is a proposal.The T T T frames stream is in yellow box.The K K K is number of proposals per stream.A (visual) stream is in yellow box.A video sequence is whole video volume. So there are several segments where it’s possible to find actions in a video sequence,not only K K K.

Inference and Learning

Inference: In order to produce several candidate segments where actions are likely within a long video sequence, we slide our DAPs network over it with step size δ. Every time our model scans a video stream of length T frames,it places K segments of varied duration inside it with their respective action likelihoods.our algorithm scans the whole video sequence in only one pass with one stream (or window) size T T T , while still producing segments of different duration.
Learning: We are interested in learning an appropriate function f such that:
(i) segments produced by our model match the locations of actions A = { a i } i = 1 M A = \{a_i\}^M_{i=1} A={ai}i=1M in the sequence (the number of these actions in stream v v v is assumed less than K K K);
(ii) confidence values associated with segments that match an action are higher than other segments. This is done by formulating an assignment problem:
在这里插入图片描述
Implementation Details: for our visual encoder, we use the publicly available pre-trained C3D model[34] which has a temporal resolution of 16 frames. To shorten the training time of our implementation, we reduce the dimensionality of the activations from the second fully-connected layer (fc7 ) of our visual encoder from 4096 to 500 dimensions using PCA. By cross-validation, we find that one layer and 256 output units achieves a good trade-off between accuracy and runtime. We use back-propagation through time with ADAGRAD update rule to find the parameters θ of our sequence encoder and output modules. By hyper-parameter search, a learning rate of 1 0 − 4 10^{−4} 104 and α = 1.0 provide good results. In practice, we predict locations (s) as duration of the action and the frame index of its center (normalized by T ).The DAPs network is trained on video streams of length T frames from long untrimmed videos. From a labeled dataset like THUMOS-14 with 11 hours of video and more than 3000 annotations, we are able to generate a large corpus of video streams (over 500 thousands) that might contain multiple actions. In practice, we densely extract video streams and cluster them according to their tIoU with annotations of the video. We sample streams from each cluster, so they are equally represented.

Reference

[34] Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: IEEE International Conference on Computer Vision, ICCV. pp. 4489–4497 (2015).

Code

The authors offer code based on lasagne.There is my code based on pytorch.PyTorch LSTM has different release with lasagne.Here,i only use default LSTM in PyTorch.You can release original computation. By the way,you may debug this programming.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Apr 17 21:14:25 2019

@author: tang
"""
import torch.nn as nn
import torch
import numpy as np
def format(X, mthd='c2b'):
    """Transform between temporal/frame annotations

    Parameters
    ----------
    X : ndarray
        2d-ndarray of size [n, 2] with temporal annotations
    mthd : str
        Type of conversion:
        'c2b': transform [center, duration] onto [f-init, f-end]
        'b2c': inverse of c2b
        'd2b': transform ['f-init', 'n-frames'] into ['f-init', 'f-end']

    Outputs
    -------
    Y : ndarray
        2d-ndarray of size [n, 2] with transformed temporal annotations.

    """
    if X.ndim != 2:
        msg = 'Incorrect number of dimensions. X.shape = {}'
        ValueError(msg.format(X.shape))

    if mthd == 'c2b':
        Xinit = np.ceil(X[:, 0] - 0.5*X[:, 1])
        Xend = Xinit + X[:, 1] - 1.0
        return np.stack([Xinit, Xend], axis=-1)
    elif mthd == 'b2c':
        Xc = np.round(0.5*(X[:, 0] + X[:, 1]))
        d = X[:, 1] - X[:, 0] + 1.0
        return np.stack([Xc, d], axis=-1)
    elif mthd == 'd2b':
        Xinit = X[:, 0]
        Xend = X[:, 0] + X[:, 1] - 1.0
        return np.stack([Xinit, Xend], axis=-1)

class OriginLSTM(nn.Module):
    def __int__(self):
        super(OriginLSTM,self).__init__()
        pass
class DAPs(nn.Mudule):
    """
    Deep Action Proposal(sequence encoder & proposal generation)
    
    References:
    -----------
    [1]https://github.com/escorciav/daps/blob/master/daps/sequence_encoder.py
    [2]V. Escorcia, F. C. Heilbron, J. C. Niebles, and B. Ghanem.
       Daps: Deep action proposals for action understanding. In
       European Conference on Computer Vision, pages 768–784.
       Springer, 2016.
    """
    def __init__(self,out_size=64,input_size=500,time_step=32,\
                 receptive_field=512,depth=1,width=256,anchors=None,origin=False):
        """
        Initialize DAPs model
        
        Parameters:
        ----------
        out_size:int,optional
            the number of proposals
        input_size:int,optional
            the dimention of a clip feature ([1,input_size])
        time_step:int,optional
            the number of time step of LSTM,in other words,the length of
            a stream with clips
        receptive_field:int,optional
            the length of a stream with frames
        depth:int,optional
            the number of hidden layers in LSTM
        width:int,optional
            number of a hidden layer in LSTM
        anchors:ndarray,optional
            2d-array of size [out_size,2] with anchor segment locations
            normalized with respect to receptive_field.The anchor location format
            should be [central frame,duration].
        
        Raises:
        ------
            ValueError:
                anchors.shape[0]!=out_size   
        """
        self.out_size=out_size
        self.input_size=input_size
        self.time_step=time_step
        self.receptive_field=receptive_field
        self.depth=depth
        self.width=width
        self.anchors=anchors
        if anchors!=None:
            if anchors.shape[0]!=out_size:
                raise ValueError(('Mismatch between number of anchors and'+\
                                  'outputs'))
        self.LSTM=self.built(origin=False)
        self.dense=self.built_dense()
    def built_sequense_encoder(self,forget_bias=5.0,grad_clip=10.0,origin=False):
        """
        built sequense encoder model.
        
        Parameters:
        -----------
        origin:boolean,optional
        """
        if not origin:
            return nn.LSTM(input_size=self.input_size,hidden_size=self.width,num_layers=self.depth)
        else:
            return OriginLSTM()
    def built_dense(self):
        """
        built dense layer to generate locations
        """
        return nn.Linear(self.width,self.out_size*2),\
                    nn.Linear(self.width,self.out_size)
    def forward(self,x,f_init_array,override=False):
         """
         forward pass
         
         Parameters:
         -----------
         x:ndarray,optional
             3d-array of size [sequence length,batch,feature]
         
         Raises:
         -------
         VauleError
             x[0]!=time_step or x[2]!=self.input_size
         """
         if x.shape[0]!=self.time_step:
             raise ValueError(('Mismatch between number of step_times of model and x'))
         if x.shape[2]!=self.input_size:
             raise ValueError(('Mismatch between number of features of input and model'))
         batch_size=x.shape[1]
         h0=torch.zeros(self.depth,batch_size,self.width)
         c0=torch.zeros(self.depth,batch_size,self.width)
         output,(hn,cn)=self.LSTM(x,(h0,c0))
         loc,conf=self.dense
         hn=hn.reshape((hn.shape[1],hn.reshape[2]))
         loc_var=loc(hn)
         conf_var=conf(hn)
         conf_var=nn.Sigmoid(conf_var)
         
         if override and self.anchors is not None:
            loc_var[:, ...] = self.anchors.reshape(-1)

         # Clip proposals inside receptive field
         loc_var.clip(0, 1, out=loc_var)
         loc_var *= self.receptive_field

         # Shift center to absolute location in the video
         loc_var = loc_var.reshape((batch_size, -1, 2))
         loc_var[:, :, 0] += f_init_array.reshape((batch_size, 1))

         # Transform center 2 boundaries
         proposals = np.reshape(\
            format(loc_var.reshape((-1, 2)), 'c2b'),\
            (batch_size, -1, 2)).astype(int)
         return proposals,conf_var
        
        
        

For C3D by pytorch.There is a coding trick.You can use dictionary and Sequential(*args) object to dynamically create model by dictionary defining model structure.Different dictionary has different model.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Apr 17 01:20:42 2019

@author: tang
"""
dic={'C3D':[64,'m1',128,'m2',256,256,'m2',512,512,'m2',512,512,'m2']}
import torch.nn as nn
import torch
import torch.optim.lr_scheduler as lr_scheduler
class C3D(nn.Mudule):
    """Implement C3D model"""
    def __init__(self):
        super(C3D,self).__init__()
        self.features=self.get_feature_layer(dic['C3D'])
        self.classifier=nn.Sequential(
                nn.Linear(4608,4096),
                nn.ReLU(),
                nn.Linear(4096,4096),
                )
    def get_feature_layer(self,l):
        result=[]
        last_c=3
        for i in l:
            if i=='m1':
                result+=[nn.MaxPool3d((1,2,2),(1,2,2))]
            elif i=='m2':
                result+=[nn.MaxPool3d(2,2)]
            else:
                result+=[nn.Conv3d(last_c,i,3,padding=1),nn.ReLU()]
                last_c=i
        return nn.Sequential(*result)
    def forward(self,x):
        o1=self.features(x)
        return self.classifier(o1)
class Utils():
    """Train and test C3D model"""
    def __init__(self,initial_lr=0.003,batch_size=30,steps_size=150000,decay_rate=0.5,totel_steps=1.9e6):
        self.initial_lr=initial_lr
        self.batch_size=batch_size
        self.steps_size=steps_size
        self.decay_rate=decay_rate
        self.totel_steps=totel_steps
        
    def train(self,model,trainloader):
        """
        Train C3D model from scratch according to origin paper.
        
        Parameters:
        -----------
        model: nn.module
            C3D model
        trainloader:torch.utils.data.DataLoader
            training set containing tensor of size [batch_size,16,h,w]
        """
        loss=nn.CrossEntropyLoss()
        optimizer=torch.optim.SGD(model.parameters(),lr=self.initial_lr)
        scheduler=lr_scheduler.StepLR(optimizer,self.steps_size,gamma=self.decay_rate,last_epoch=0)
        step_count=0
        acc=0
        count=0
        l=0
        while True:
            for train_data in trainloader:
                optimizer.zero_grad()
                step_count+=1
                data,label=train_data
                data=data.cuda()
                label=label.cuda()
                out=model(data)
                _,indexes=torch.max(out,1)
                batch_acc=(indexes==label).sum().item()
                batch_count=label.shape[0]
                acc+=batch_acc
                count+=batch_count
                batch_l=loss(out,label)
                batch_l.backward()
                l+=batch_l
                scheduler.step()
                optimizer.step()
                if step_count==self.totel_steps:
                    return 
                if step_count%1000==0:
                    print('the accurity of '+str(step_count-1000)+'-'+str(step_count)\
                    +' interations is %f'%(acc/count))
                    print('the mean loss of '+str(step_count-1000)+'-'+str(step_count)\
                    +' interations is %f'%(l/count))
                    l=0
                    acc=0
                    count=0
    def validate(self,model,validationloader):
            """
            Validate trained model.
            Parameters:
            -----------
            model:nn.Module
                trained C3D model
            validationloader:torch.utils.data.DataLoader
                validation set containing tensor of size [batch_size,16,h,w]
            """
            acc=0
            count=0
            for validation_data in validationloader:
                data,label=validation_data
                data=data.cuda()
                label=label.cuda()
                out=model(data)
                _,indexes=torch.max(out,1)
                batch_acc=(indexes==label).sum().item()
                batch_count=label.shape[0]
                acc+=batch_acc
                count+=batch_count
            print('the accuracy of validation set '+'is %f'%(acc/count))
            
    
    
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值