这里写目录标题
基本信息
标题:Temporal Convolutional Networks: A Unified Approach to Action Segmentation
Paper : https://arxiv.org/abs/1608.08242#
翻译
Abstract 摘要
段 | 原文 | 翻译 |
---|---|---|
1 | The dominant paradigm for video-based action segmentation is composed of two steps: first, for each frame, compute low-level features using Dense Trajectories or a Convolutional Neural Network that encode spatiotemporal information locally, and second, input these features into a classifier that captures high-level temporal relationships, such as a Recurrent Neural Network (RNN). While often effective, this decoupling requires specifying two separate models, each with their own complexities, and prevents capturing more nuanced long-range spatiotemporal relationships. We propose a unified approach, as demonstrated by our Temporal Convolutional Network (TCN), that hierarchically captures relationships at low-, intermediate-, and high-level time-scales. Our model achieves superior or competitive performance using video or sensor data on three public action segmentation datasets and can be trained in a fraction of the time it takes to train an RNN. | 基于视频的动作分割的主流范式通常分为两步:首先,通过使用密集轨迹(Dense Trajectories)或卷积神经网络(CNN)计算每帧的低级特征,以在局部编码时空信息;然后,将这些特征输入到一个分类器中,捕捉高级别的时间关系,例如递归神经网络(RNN)。虽然这种方法通常有效,但它将任务分解为两个独立的模型,每个模型都有自己的复杂性,并且无法捕捉更细粒度的长时间跨度的时空关系。我们提出了一种统一的方法,通过我们的时间卷积网络(TCN)来展示,这种方法可以分层地捕捉低、中、高级别的时间尺度关系。我们的模型在三个公共动作分割数据集上使用视频或传感器数据取得了优于或与现有方法竞争的表现,并且训练时间仅为 RNN 模型训练时间的一个小部分。 |
1. Introduction 引言
段 | 原文 | 翻译 |
---|---|---|
2 | Action segmentation is crucial for numerous applications ranging from collaborative robotics to modeling activities of daily living. Given a video, the goal is to simultaneously segment every action in time and classify each constituent segment. While recent work has shown strong improvements on this task, models tend to decouple low-level feature representations from high-level temporal models. Within video analysis, these low-level features may be computed by pooling handcrafted features (e.g. Improved Dense Trajectories (IDT)) or concatenating learned features (e.g. Spatiotemporal Convolutional Neural Networks (ST-CNN)) over a short period of time. High-level temporal classifiers capture a local history of these low-level features. In a Conditional Random Field (CRF), the action prediction at one time step is often a function of the prediction at the previous time step, and in a Recurrent Neural Network (RNN), the predictions are a function of a set of latent states at each time step, where the latent states are connected across time. This two-step paradigm has been around for decades and typically goes unquestioned. However, we posit that valuable information is lost between steps. | 动作分割在从协作机器人到日常生活活动建模等众多应用中至关重要。给定一个视频,其目标是在时间轴上同时分割每个动作并对每个组成片段进行分类。尽管最近的研究在这一任务上取得了显著进展,模型往往会将低级特征表示与高级时间模型分离。在视频分析中,这些低级特征可能通过聚合手工设计的特征(例如改进的密集轨迹 (Improved Dense Trajectories, IDT))或在短时间段内拼接学习到的特征(例如时空卷积神经网络 (Spatiotemporal Convolutional Neural Networks, ST-CNN))来计算。高级时间分类器则捕获这些低级特征的局部历史信息。在条件随机场 (Conditional Random Field, CRF) 中,一个时间步长的动作预测通常是前一时间步长预测的函数;而在递归神经网络 (Recurrent Neural Network, RNN) 中,预测是每个时间步长上一组相互连接的潜在状态的函数。这种两步范式已经存在数十年,并且通常很少被质疑。然而,我们认为,在这两个步骤之间会丢失重要的信息。 |
3 | In this work, we introduce a unified approach to action segmentation that uses a single set of computational mechanisms – 1D convolutions, pooling, and channel-wise normalization – to hierarchically capture low-, intermediate-, and high-level temporal information. For each layer, 1D convolutions capture how features at lower levels change over time, pooling enables efficient computation of long-range temporal patterns, and normalization improves robustness towards various environmental conditions. In contrast with RNN-based models, which compute a set of latent activations that are updated sequentially per-frame, we compute a set of latent activations that are updated hierarchically per-layer. As a byproduct, our model takes much less time to train. Our model can be viewed as a generalization of the recent ST-CNN and is more similar to recent models for semantic segmentation than it is to models for video-analysis. We show this approach is broadly applicable to video and other types of robot sensors. | 在本研究中,我们提出了一种统一的动作分割方法,该方法使用一组计算机制——一维卷积、池化和逐通道归一化——分层次地捕获低、中、高级别的时间信息。在每一层中,一维卷积捕获低级别特征随时间的变化,池化使得长时间范围的时间模式的高效计算成为可能,而归一化则提高了对各种环境条件的鲁棒性。与基于 RNN 的模型不同,RNN 模型按帧逐步更新一组潜在激活状态,而我们的方法则分层次地更新一组潜在激活状态。由此带来的附加好处是,我们的模型训练时间显著减少。我们的模型可以被视为最近提出的 ST-CNN 的一种广义形式,并且与近期用于语义分割的模型更为相似,而非传统的视频分析模型。我们展示了该方法可以广泛适用于视频和其他类型的机器人传感器数据。 |
4 | Due to space limitations, here we will only briefly describe models for time-series and semantic segmentation. See [8] for related work on action segmentation or [20] for a broader overview on action recognition. | 由于篇幅限制,本文仅简要描述了时间序列和语义分割的模型。关于动作分割的相关工作请参见[8],关于动作识别的更广泛概述可以参考[20]。 |
5 | RNNs and CRFs are popular high-level temporal classifiers. RNN variations, including Long Short Term Memory (LSTM) and Gated Recurrent Units (GRU), model hidden temporal states via internal gating mechanisms. However, they are hard to introspect and difficult to correctly train. It has been shown that in practice LSTM only keeps a memory of about 4 seconds on some video-based action segmentation datasets. CRFs typically model pairwise transitions between the labels or latent states (e.g., [8]), which are easy to interpret, but over-simplify the temporal dynamics of complex actions. Both of these models suffer from the same fundamental issue: intermediate activations are typically a function of the low-level features at the current time step and the state at the previous time step. Our temporal convolutional filters are a function of raw data across a much longer period of time. | RNN 和 CRF 是常见的高级时间分类器。RNN 的变种,包括长短期记忆(LSTM)和门控循环单元(GRU),通过内部门控机制建模隐藏的时间状态。然而,这些模型很难进行内部解释,且训练起来比较困难。实践中已有研究表明,LSTM 在一些基于视频的动作分割数据集上,仅能保留约 4 秒的记忆。CRF 通常建模标签或潜在状态之间的成对过渡(例如,参见[8]),这种方法易于理解,但简化了复杂动作的时间动态。这两种模型都面临一个共同的根本问题:中间激活通常是当前时间步的低级特征和前一个时间步的状态的函数。而我们的时间卷积滤波器则是基于原始数据,在更长的时间跨度上进行建模。 |
6 | Until recently, the dominant paradigm for semantic segmentation was similar to that of action segmentation. Approaches typically combined low-level texture features (e.g., TextonBoost) with high-level spatial models (e.g., grid-based CRFs) that model the relationships between different regions of an image. This is similar to action segmentation where low-level spatiotemporal features are used in tandem with high-level temporal models. Recently, with the introduction of Fully Convolutional Networks (FCNs), the dominant semantic segmentation paradigm has started to change. Long et al. [11] introduced the first FCN, which leverages typical classification CNNs like AlexNet, to compute per-pixel object labels. This is done by intelligently upsampling the intermediate activations in each region of an image. Our model is more similar to the recent encoder-decoder network by Badrinarayanan et al. [1]. Their encoder step uses the first half of a VGG-like network to capture patterns in different regions of an image and their decoder step takes the activations from the encoder, which are of a reduced image resolution, and uses convolutional filters to upsample back to the original image size. In subsequent sections we describe our temporal variation in detail. | 直到最近,语义分割的主流范式与动作分割类似。传统方法通常结合低级纹理特征(例如 TextonBoost)与高级空间模型(例如基于网格的 CRF),后者建模图像中不同区域之间的关系。这与动作分割相似,动作分割中低级时空特征与高级时间模型是结合使用的。最近,随着全卷积网络(Fully Convolutional Networks,FCN)的引入,语义分割的主流范式开始发生变化。Long 等人[11]提出了第一个 FCN,它利用典型的分类 CNN(如 AlexNet)来计算每个像素的目标标签。这是通过智能地对图像中每个区域的中间激活进行上采样来实现的。我们的模型与 Badrinarayanan 等人[1]最近提出的编码器-解码器网络更为相似。他们的编码器步骤使用类似 VGG 的网络的前半部分来捕获图像中不同区域的模式,而解码器步骤则利用来自编码器的激活(这些激活的图像分辨率已降低),并使用卷积滤波器将其上采样回原始图像大小。我们在后续章节中将详细描述我们模型的时间变体。 |
第4、5段的内容介绍了RNN和CRF的基本工作原理,以及它们的局限性,特别是在捕捉长时间跨度的时空信息时存在的不足。与这些传统方法不同,作者提出的TCN模型通过时间卷积滤波器能够更好地处理长时间范围的数据。
第6段讨论了语义分割领域的传统方法与新兴方法的演变。传统方法类似于动作分割任务,即将低级特征与高级模型结合。然而,随着全卷积网络(FCN)的引入,语义分割的范式开始发生变化,FCN通过对图像进行像素级别的标注,改变了传统方法的做法。作者的模型借鉴了类似的编码器-解码器结构,但与语义分割的工作有所不同,其核心是在时间序列数据中捕获多尺度的时间关系。
2. Temporal Convolutional Networks (TCN)
图 1:我们的时间编码器-解码器网络分层建模来自视频或其他时间序列数据的动作:
段 | 原文 | 翻译 |
---|---|---|
7 | The input to our Temporal Convolutional Network can be a sensor signal (e.g. accelerometers) or latent encoding of a spatial CNN applied to each frame. Let X t ∈ R F 0 X_t \in \mathbb{R}^{F_0} Xt∈RF0 be the input feature vector of length F 0 F_0 F0 for time step t t t for 0 < t ≤ T 0 < t \leq T 0<t≤T. Note that the time T T T may vary for each sequence, and we denote the number of time steps in each layer as T l T_l Tl. The true action label for each frame is given by y t ∈ { 1 , … , C } y_t \in \{1, \dots, C\} yt∈{1,…,C}, where C C C is the number of classes. | 我们提出的时间卷积网络的输入可以是传感器信号(例如加速度计)或应用于每一帧的空间CNN的潜在编码。设 X t ∈ R F 0 X_t \in \mathbb{R}^{F_0} Xt∈RF0 为时间步 t t t 的输入特征向量,长度为 F 0 F_0 F0,其中 0 < t ≤ T 0 < t \leq T 0<t≤T。注意,时间 T T T 对于每个序列可能不同,我们将每一层的时间步数表示为 T l T_l Tl。每帧的真实动作标签为 y t ∈ { 1 , … , C } y_t \in \{1, \dots, C\} yt∈{1,…,C},其中 C C C 是动作类别的数量。 |
8 | Our encoder-decoder framework, as depicted in Figure 1, is composed of temporal convolutions, 1D pooling/upsampling, and channel-wise normalization layers. | 如图 1 所示,我们的编码器-解码器框架由时间卷积、1D 池化/上采样和逐通道归一化层组成。 |
9 | For each of the L L L convolutional layers in the encoder, we apply a set of 1D filters that capture how the input signals evolve over the course of an action. The filters for each layer are parameterized by tensor W ( l ) ∈ R F l × d × F l − 1 W^{(l)} \in \mathbb{R}^{F_l \times d \times F_{l-1}} W(l)∈RFl×d×Fl−1 and biases b ( l ) ∈ R F l b^{(l)} \in \mathbb{R}^{F_l} b(l)∈RFl, where l ∈ { 1 , … , L } l \in \{1, \dots, L\} l∈{1,…,L} is the layer index, and d d d is the filter duration. For the l l l-th layer of the encoder, the i i i-th component of the (unnormalized) activation E ^ t ( l ) ∈ R F l \hat{E}^{(l)}_t \in \mathbb{R}^{F_l} E^t(l)∈RFl is a function of the incoming (normalized) activation matrix E ( l − 1 ) ∈ R F l − 1 × T l − 1 E^{(l-1)} \in \mathbb{R}^{F_{l-1} \times T_{l-1}} E(l−1)∈RFl−1×Tl−1 from the previous layer: | 在编码器的每一层 L L L 中,我们应用一组 1D 卷积滤波器,捕捉输入信号在动作过程中的演变。每层的滤波器由张量 W ( l ) ∈ R F l × d × F l − 1 W^{(l)} \in \mathbb{R}^{F_l \times d \times F_{l-1}} W(l)∈RFl×d×Fl−1 和偏置 b ( l ) ∈ R F l b^{(l)} \in \mathbb{R}^{F_l} b(l)∈RFl 参数化,其中 l ∈ { 1 , … , L } l \in \{1, \dots, L\} l∈{1,…,L} 是层的索引, d d d 是滤波器的持续时间。对于编码器的第 l l l 层,第 i i i 个(未归一化的)激活 E ^ t ( l ) ∈ R F l \hat{E}^{(l)}_t \in \mathbb{R}^{F_l} E^t(l)∈RFl 是来自前一层的(归一化的)激活矩阵 E ( l − 1 ) ∈ R F l − 1 × T l − 1 E^{(l-1)} \in \mathbb{R}^{F_{l-1} \times T_{l-1}} E(l−1)∈RFl−1×Tl−1 的函数: |
E ^ i , t ( l ) = f ( b i ( l ) + ∑ t ′ ⟨ W i , t ′ ( l ) , ⋅ , E ⋅ , t + d − t ′ ( l − 1 ) ⟩ ) \hat{E}^{(l)}_{i,t} = f\left(b^{(l)}_i + \sum_{t'} \langle W^{(l)}_{i,t'}, \cdot, E^{(l-1)}_{\cdot, t+d-t'} \rangle \right) E^i,t(l)=f(bi(l)+∑t′⟨Wi,t′(l),⋅,E⋅,t+d−t′(l−1)⟩) | ||
10 | for each time step t t t, where f ( ⋅ ) f(\cdot) f(⋅) is a Leaky Rectified Linear Unit (Leaky ReLU). The normalization process is described below. | 对于每个时间步 t t t,其中 f ( ⋅ ) f(\cdot) f(⋅) 是带泄漏的修正线性单元(Leaky ReLU)。归一化过程将在下文中描述。 |
11 | Max pooling is applied with width 2 across time (in 1D) such that T l = 1 2 T l − 1 T_l = \frac{1}{2} T_{l-1} Tl=21Tl−1. Pooling enables us to efficiently compute activations over a long period of time. | 池化操作在时间上以宽度 2 的 1D 方式应用,因此 T l = 1 2 T l − 1 T_l = \frac{1}{2} T_{l-1} Tl=21Tl−1。池化使得我们能够高效地计算长时间段内的激活。 |
12 | We apply channel-wise normalization after each pooling step in the encoder. This has been effective in recent CNN methods, including Trajectory-Pooled Deep-Convolutional Descriptors (TDD). We normalize the pooled activation vector E ^ t ( l ) \hat{E}^{(l)}_t E^t(l) by the highest response at that time step, m = max i E ^ i , t ( l ) m = \max_i \hat{E}^{(l)}_{i,t} m=maxiE^i,t(l), with some small ϵ = 1 × 1 0 − 5 \epsilon = 1 \times 10^{-5} ϵ=1×10−5, such that: | 在每次池化步骤之后,我们在编码器中应用逐通道归一化。这在最近的 CNN 方法中取得了良好的效果,包括轨迹池化深度卷积描述符(TDD)。我们通过当前时间步的最大响应 m = max i E ^ i , t ( l ) m = \max_i \hat{E}^{(l)}_{i,t} m=maxiE^i,t(l) 来归一化池化激活向量 E ^ t ( l ) \hat{E}^{(l)}_t E^t(l),并加上一个小的 ϵ = 1 × 1 0 − 5 \epsilon = 1 \times 10^{-5} ϵ=1×10−5,使得: |
E t ( l ) = 1 m + ϵ E ^ t ( l ) E^{(l)}_t = \frac{1}{m + \epsilon} \hat{E}^{(l)}_t Et(l)=m+ϵ1E^t(l) | ||
13 | Our decoder is similar to the encoder, except that upsampling is used instead of pooling, and the order of the operations is now upsample, convolve, then normalize. Upsampling is performed by simply repeating each entry twice. | 我们的解码器与编码器类似,不同之处在于使用上采样代替池化,操作的顺序变为上采样、卷积,再归一化。上采样是通过简单地重复每个条目两次来执行的。 |
14 | The probability that frame t t t corresponds to one of the C C C action classes is predicted by vector Y ^ t ∈ [ 0 , 1 ] C \hat{Y}_t \in [0, 1]^C Y^t∈[0,1]C using weight matrix U ∈ R C × F 0 U \in \mathbb{R}^{C \times F_0} U∈RC×F0 and bias c ∈ R C c \in \mathbb{R}^C c∈RC: | 通过权重矩阵 U ∈ R C × F 0 U \in \mathbb{R}^{C \times F_0} U∈RC×F0 和偏置 c ∈ R C c \in \mathbb{R}^C c∈RC,预测帧 t t t 对应于 C C C 个动作类别中的一个的概率: |
Y ^ t = softmax ( U D t ( 1 ) + c ) \hat{Y}_t = \text{softmax}(U D^{(1)}_t + c) Y^t=softmax(UDt(1)+c) | ||
15 | We explored many other mechanisms, such as adding skip connections between layers, using different patterns of convolutional layers, and other normalization schemes. These helped at times and hurt in others. The aforementioned solution was superior in aggregate. | 我们还探索了许多其他机制,如在层之间添加跳跃连接、使用不同模式的卷积层和其他归一化方案。这些方法在某些情况下有所帮助,在其他情况下则表现不佳。上述方案在整体上表现优越。 |
Implementation Details(实现细节) | ||
16 | Each of the L = 3 L = 3 L=3 layers has F l = { 32 , 64 , 96 } F_l = \{32, 64, 96\} Fl={32,64,96} filters. Filter duration, d d d, is set as the mean segment duration for the shortest class from the training set. For example, d = 10 d = 10 d=10 seconds for 50 Salads. Parameters of our model were learned using the cross entropy loss with Stochastic Gradient Descent and ADAM step updates. All models were implemented using Keras and TensorFlow. | 每一层的 L = 3 L = 3 L=3 层具有 F l = { 32 , 64 , 96 } F_l = \{32, 64, 96\} Fl={32,64,96} 个滤波器。滤波器的持续时间 d d d 被设置为训练集中最短类别的平均段持续时间。例如,在 50 Salads 数据集中, d = 10 d = 10 d=10 秒。我们的模型参数是通过交叉熵损失函数,结合随机梯度下降(SGD)和 Adam 步长更新算法进行训练的。所有模型均使用 Keras 和 TensorFlow 实现。 |
17 | For each frame in our video experiments, the input, X t X_t Xt, is the first fully connected layer computed in a spatial CNN trained solely on each dataset. We trained the model of [8], except instead of using Motion History Images (MHI) as input to the CNN, we concatenate the following for image I t I_t It at frame t t t: [ I t , I t − d − I t , I t + d − I t , I t − 2 d − I t , I t + 2 d − I t ] [I_t, I_{t-d} - I_t, I_{t+d} - I_t, I_{t-2d} - I_t, I_{t+2d} - I_t] [It,It−d−It,It+d−It,It−2d−It,It+2d−It] for d = 0.5 d = 0.5 d=0.5 seconds. In our experiments, these difference images – which are a simple type of attention mechanism – tend to perform better than MHI or optical flow across these datasets. Furthermore, for each time step, we perform channel-wise normalization before feeding it into the TCN. This helps with large environmental fluctuations, such as changes in lighting. | 在我们的 视频 实验中,每一帧的输入 X t X_t Xt 是通过仅在每个数据集上训练的空间 CNN 计算得到的第一层全连接层。我们训练了 [8] 中的模型,不过与使用运动历史图像(MHI)作为 CNN 输入不同,我们将以下内容拼接到时间步 t t t 的图像 I t I_t It 中: [ I t , I t − d − I t , I t + d − I t , I t − 2 d − I t , I t + 2 d − I t ] [I_t, I_{t-d} - I_t, I_{t+d} - I_t, I_{t-2d} - I_t, I_{t+2d} - I_t] [It,It−d−It,It+d−It,It−2d−It,It+2d−It],其中 d = 0.5 d = 0.5 d=0.5 秒。在我们的实验中,这些差分图像——一种简单的注意机制——往往比 MHI 或光流在这些数据集上表现更好。此外,在将数据输入 TCN 之前,我们对每个时间步的数据进行逐通道归一化。这有助于应对大范围的环境变化,例如光照变化。 |
在这一部分中,作者详细介绍了他们的 时间卷积网络(TCN) 的架构和工作原理:
输入特征:TCN可以接收传感器信号(如加速度计数据)或经过空间卷积神经网络(CNN)处理的每帧视频数据的潜在编码。
编码器-解码器框架:TCN采用编码器-解码器结构,在编码器部分通过1D卷积和池化处理输入数据,逐层捕获时间序列中的时空特征。解码器则通过上采样和卷积逐步恢复数据,并预测每一帧的动作类别。
时间卷积和池化:TCN使用1D卷积滤波器来捕捉时间序列中的长时依赖关系,池化操作帮助高效地计算长时间范围的激活。
逐通道归一化:在每个池化步骤之后,TCN会对每个通道的激活进行归一化,以提高鲁棒性。
训练和预测:TCN通过全连接层和softmax函数预测每个时间步的动作类别,显著提高了模型的训练效率。
与传统的RNN和CRF方法相比,TCN不依赖于逐步更新的潜在状态,而是通过层次化的卷积方式一次性捕获多级时间信息
3. Evaluation(评估)
表 1:50 Salads、Georgia Tech 第一人称活动(GTEA)和 JHU-ISI 手势与技能评估工作集的结果。备注:(1) 使用 VGG 和改进的密集轨迹(IDT)计算的结果故意未包含时间组件,以进行消融分析,因此它们的编辑得分较低。(2) 我们使用作者的公开代码重新计算了 [9] 的结果,以确保与 [14] 中的设置一致。
段 | 原文 | 翻译 |
---|---|---|
18 | We evaluate on three public datasets that contain action segmentation labels, video, and in two cases sensor data. | 我们在三个公共数据集上进行评估,这些数据集包含动作分割标签、视频以及在两个案例中包含传感器数据。 |
19 | University of Dundee 50 Salads [18] contains 50 sequences of users making a salad. Each video is 5-10 minutes in duration and contains around 30 action instances such as cutting a tomato or peeling a cucumber. This dataset includes video and synchronized accelerometers attached to ten objects in the scene, such as the bowl, knife, and plate. We performed cross validation with 5 splits on the “eval” action granularity which includes 10 action classes. Our sensor results used the features from [9] which are the absolute values of accelerometer values. Previous results (e.g., [9, 14]) were evaluated using different setups. For example, [9] smoothed out short interstitial background segments. We reran all results to be consistent with [14]. We also included an LSTM baseline for comparison which uses 64 hidden states. | University of Dundee 的 50 Salads 数据集 [18] 包含 50 个用户制作沙拉的序列。每个视频的时长为 5-10 分钟,包含大约 30 个动作实例,如切番茄或削黄瓜。该数据集包括视频以及附加在场景中十个物体上的同步加速度计数据,例如碗、刀和盘子。我们在“评估”动作粒度上进行了 5 折交叉验证,包含了 10 个动作类别。我们的传感器数据结果使用了 [9] 中的特征,即加速度计的绝对值。先前的结果(例如,[9, 14])使用了不同的设置进行评估。例如,[9] 平滑了短的背景插入段。为了与 [14] 一致,我们重新运行了所有结果。我们还包含了一个 LSTM 基准模型进行比较,该模型使用 64 个隐藏状态。 |
20 | JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS) [5] was introduced to improve quantitative evaluation of robotic surgery training tasks. We used Leave One User Out cross validation on the suturing activity, which consists of 39 sequences performed by 8 users about 5 times each. The dataset includes video and synchronized robot kinematics (position, velocity, and gripper angle) for each robot end effector as well as corresponding action labels with 10 action classes. Sequences are a few minutes long and typically contain around 20 action instances. | JHU-ISI 手势与技能评估工作集(JIGSAWS)[5] 被引入以改善机器人手术训练任务的定量评估。我们在缝合活动上使用了“留一用户外”交叉验证,数据集包含 39 个序列,由 8 个用户执行,每个用户大约执行 5 次。该数据集包括视频和同步的机器人运动学数据(位置、速度和夹持器角度),这些数据是每个机器人末端执行器的相应动作标签,并且包含 10 个动作类别。每个序列持续几分钟,通常包含约 20 个动作实例。 |
21 | Georgia Tech Egocentric Activities (GTEA) [4] contains 28 videos of 7 kitchen activities including making a sandwich and making coffee. For each of the four subjects, there is one instance of each activity. The camera is mounted on the head of the user and is pointing at the area in front them. On average there are about 30 actions per video and videos are around a minute long. We used the 11 action classes defined in [3] and evaluated using leave one user out. We show results for user 2 to be consistent with [3] and [16]. | Georgia Tech 第一人称活动(GTEA)[4] 包含了 28 个视频,展示了 7 种厨房活动,包括做三明治和做咖啡。对于四个受试者,每个活动有一个实例。相机安装在用户头部,指向他们面前的区域。每个视频平均包含约 30 个动作,视频时长大约为一分钟。我们使用了 [3] 中定义的 11 个动作类别,并采用了“留一用户外”交叉验证。在此,我们展示了用户 2 的结果,以保持与 [3] 和 [16] 的一致性。 |
22 | Metrics: We evaluated using accuracy, which is simply the percent of correctly labeled frames, and segmental edit distance [9], which measures the correctness of the predicted temporal ordering of actions. This edit score is computed by applying the Levenstein distance to the segmented predictions (e.g. AAABBA → ABA). This is normalized to be in the range 0 to 100 such that higher is better. | **评估指标:**我们使用准确率进行评估,准确率是指正确标记帧的百分比;同时还使用了段间编辑距离 [9],该指标衡量了预测的动作时间顺序的正确性。该编辑得分是通过将 Levenshtein 距离应用于分段预测(例如,AAABBA → ABA)来计算的。该得分被归一化为 0 到 100 的范围,其中分数越高越好。 |
4. Experiments and Discussion(实验与讨论)
段 | 原文 | 翻译 |
---|---|---|
23 | Table 1 includes results for all datasets and corresponding sensing modalities. We include results from the spatial CNN which is input into the TCN, the Spatiotemporal CNN of Lea et al. [8] applied to the spatial features, and our TCN. | 表 1 包括了所有数据集及其相应传感器数据模式的结果。我们展示了以下几种结果:输入到 TCN 的空间 CNN,Lea 等人 [8] 提出的时空 CNN 应用于空间特征,以及我们的 TCN。 |
24 | One of the most interesting findings is that some layers of convolutional filters appear to learn temporal shifts. There are certain actions in each dataset which are not easy to distinguish given the sensor data. By visualizing the activations for each layer, we found our model surmounts this issue by learning temporal offsets from activations in the previous layer. In addition, we find that despite the fact that we do not use a traditional temporal model, such as an RNN or CRF, our predictions do not suffer as heavily from issues like over-segmentation. This is highlighted by the large increase in edit score on most experiments. | 一个有趣的发现是,某些卷积层似乎学习到了时间偏移。在每个数据集中,存在一些动作,基于传感器数据很难区分。通过可视化每一层的激活,我们发现我们的模型通过从前一层的激活中学习时间偏移来克服了这一问题。此外,尽管我们没有使用传统的时间模型,如 RNN 或 CRF,但我们的预测在大多数实验中并未严重受到过度分割等问题的影响。这一点通过编辑得分的大幅提升得到了体现。 |
25 | Richard et al. [14] evaluated their model on the mid-level action granularity of 50 Salads which has 17 action classes. Their model achieved 54.2% accuracy, 44.8% edit, 0.379 mAP IoU overlap with a threshold of 0.1, and 0.229 mAP with a threshold of 0.5. Our model achieves 59.7% accuracy, 47.3% edit, 0.579 mAP at 0.1, and 0.378 mAP at 0.5. | Richard 等人 [14] 在 50 Salads 数据集的中级动作粒度上评估了他们的模型,该数据集包含 17 个动作类别。他们的模型达到了 54.2% 的准确率、44.8% 的编辑得分、0.379 的 mAP IoU 重叠(阈值为 0.1),以及 0.229 的 mAP(阈值为 0.5)。我们的模型在 50 Salads 上达到了 59.7% 的准确率、47.3% 的编辑得分、0.579 的 mAP(阈值为 0.1)以及 0.378 的 mAP(阈值为 0.5)。 |
26 | On GTEA, Singh et al. [16] reported 64.4% accuracy by performing cross validation on users 1 through 3. We achieve 62.5% using this setup. We found performance of our model has high variance between different trials on GTEA – even with the same hyper parameters – thus, the difference in accuracy is not likely to be statistically significant. Our approach could be used in tandem with features from Singh et al. to achieve superior performance. | 在 GTEA 数据集上,Singh 等人 [16] 在用户 1 至 3 上执行交叉验证,报告了 64.4% 的准确率。我们在相同的设置下达到了 62.5% 的准确率。我们发现,尽管使用相同的超参数,我们的模型在 GTEA 上的不同试验之间表现存在较大方差,因此准确率的差异很可能没有统计学意义。我们的方案可以与 Singh 等人的特征结合使用,从而实现更好的性能。 |
27 | Our model can be trained much faster than an RNN-LSTM. Using an Nvidia Titan X, it takes on the order of a minute to train a TCN for each split, whereas it takes on the order of an hour to train an RNN-LSTM. The speedup comes from the fact that we compute one set of convolutions for each layer, whereas RNN-LSTM effectively computes one set of convolutions for each time step. | 我们的模型比 RNN-LSTM 的训练速度快得多。在使用 Nvidia Titan X 的情况下,训练一个 TCN 每个拆分大约需要一分钟,而训练一个 RNN-LSTM 则大约需要一个小时。这一加速来自于我们为每一层计算一次卷积,而 RNN-LSTM 实际上为每个时间步计算一次卷积。 |
- 实验结果与比较:作者将他们的模型与基准模型(空间CNN和时空CNN)进行了比较,结果表明,提出的 TCN 模型在准确率和编辑距离上都大幅超越了这些基准模型。
- GTEA 数据集的表现:在 GTEA 数据集上,虽然我们的模型在准确率上优于 ST-CNN,但在编辑距离上略有劣势。
- 训练时间的比较:TCN 模型的训练速度显著快于 RNN-LSTM 模型,主要得益于卷积操作的层级计算方式,而不是为每个时间步都计算一次。
- 多层时间卷积的优势:通过将时间卷积层加入空间 CNN(即 ST-CNN),模型的性能得到了显著提升,但 TCN 的性能仍然更好。
这部分总结了不同模型的实验结果,并讨论了 TCN 模型在各个数据集上的优势,尤其是在处理速度和模型准确性方面。
Conclusion(结论)
段 | 原文 | 翻译 |
---|---|---|
28 | We introduced a model for action segmentation that learns a hierarchy of intermediate feature representations, which contrasts with the traditional low- versus high-level paradigm. This model achieves competitive or superior performance on several datasets and can be trained much more quickly than other models. A future version of this manuscript will include more comparisons and insights on the TCN. | 我们提出了一种用于动作分割的模型,该模型学习了一种层次化的中间特征表示,这与传统的低级和高级别分离的范式不同。该模型在多个数据集上实现了具有竞争力或更优的性能,并且训练速度远快于其他模型。未来版本的论文将包括更多的比较和对 TCN 的深入分析。 |
- 创新性:本研究提出的模型能够通过层次化学习中间特征表示,突破了传统的低级与高级别分离的做法。
- 优势:该模型在多个数据集上表现出色,且训练速度远快于传统的模型,如 RNN 或 LSTM。
- 未来工作:作者计划在未来的版本中,提供更多关于 TCN 的比较与深度分析。