Learned Video Compression




We present a new algorithm for video coding, learned end-to-end for the low-latency mode. In this setting, our approach outperforms all existing video codecs across nearly the entire bitrate range. To our knowledge, this is the first ML-based method to do so.


We evaluate our approach on standard video compression test sets of varying resolutions, and benchmark against all mainstream commercial codecs in the low-latency mode. On standard-definition videos, HEVC/H.265, AVC/H.264 and VP9 typically produce codes up to 60% larger than our algorithm. On high-definition 1080p videos, H.265 and VP9 typically produce codes up to 20% larger, and H.264
up to 35% larger. Furthermore, our approach does not suffer from blocking artifacts and pixelation, and thus produces videos that are more visually pleasing.


We propose two main contributions. The first is a novel architecture for video compression, which (1) generalizes motion estimation to perform any learned compensation beyond simple translations, (2) rather than strictly relying on previously transmitted reference frames, maintains a state of arbitrary information learned by the model, and (3) enables jointly compressing all transmitted signals (such as optical flow and residual).


Secondly, we present a framework for ML-based spatial rate control — a mechanism for assigning variable bitrates across space for each frame. This is a critical component for video coding, which to our knowledge had not been developed within a machine learning setting.


1. Introduction

1. 背景介绍

Video content consumed more than 70% of all internet traffic in 2016, and is expected to grow threefold by 2021 [1]. At the same time, the fundamentals of existing video compression algorithms have not changed considerably over the last 20 years [46, 36, 35, . . . ]. While they have been very well engineered and thoroughly tuned, they are hard-coded, and as such cannot adapt to the growing demand and increasingly versatile spectrum of video use cases such as social media sharing, object detection, VR streaming, and so on.

2016年,视频内容占据了所有互联网流量的70%以上,预计到2021年将增长三倍。同时,在过去的20年里,现有视频压缩算法的基本原理并没有大的改变. 虽然它们经过了很好的设计和彻底的调整,但它们是硬编码的,因此无法适应日益增长的需求和日益多样化的视频应用,如社交媒体共享,对象检测、虚拟现实流等等。

Meanwhile, approaches based on deep learning have revolutionized many industries and research disciplines. In particular, in the last two years, the field of image compression has made large leaps: ML-based image compression approaches have been surpassing the commercial codecs by significant margins, and are still far from saturating to their full potential (survey in Section 1.3).


The prevalence of deep learning has further catalyzed the proliferation of architectures for neural network acceleration across a spectrum of devices and machines. This hardware revolution has been increasingly improving the performance of deployed ML-based technologies—rendering video compression a prime candidate for disruption.


In this paper, we introduce a new algorithm for video coding. Our approach is learned end-to-end for the low latency mode, where each frame can only rely on information from the past. This is an important setting for live transmission, and constitutes a self contained research problem and a stepping-stone towards coding in its full generality. In this setting, our approach outperforms all existing video codecs across nearly the entire bitrate range.


We thoroughly evaluate our approach on standard datasets of varying resolutions, and benchmark against all modern commercial codecs in this mode. On standard definition (SD) videos, HEVC/H.265, AVC/H.264 and VP9 typically produce codes up to 60% larger than our algorithm. On high-definition (HD) 1080p videos, H.265 and VP9 typically produce codes up to 20% larger, and H.264 up to 35% larger. Furthermore, our approach does not suffer from blocking artifacts and pixelation, and thus produces videos that are more visually pleasing (see Figure 1).


In Section 1.1, we provide a brief introduction to video coding in general. In Section 1.2, we proceed to describe our contributions. In Section 1.3 we discuss related work, and in Section 1.4 we provide an outline of this paper.


1.1. Video coding in a nutshell

1.1. 视频编码概述

1.1.1 Video frame types
1.1.1 编码视频帧类型

Video codecs are designed for high compression efficiency,and achieve this by exploiting spatial and temporal redundancies within and across video frames ([51, 47, 36, 34] provide great overviews of commercial video coding techniques).Existing video codecs feature 3 types of frames:

1. I-frames (”intra-coded”), compressed using an image codec and do not depend on any other frames;

2. P-frames (”predicted”), extrapolated from frames in the past; and

3. B-frames (”bi-directional”), interpolated from previously
transmitted frames in both the past and future.

While introducing B-frames enables higher coding efficiency,it increases the latency: to decode a given frame,future frames have to first be transmitted and decoded.


1.1.2 Compression procedure
1.1.2 视频压缩过程

In all modern video codecs, P-frame coding is invariably accomplished via two separate steps: (1) motion compensation, followed by (2) residual compression.
Motion compensation.
The goal of this step is to leverage temporal redundancy in the form of translations. This is done via block-matching (overview at [30]), which reconstructs the current target, say xt for time step t, from a handful of previously transmitted reference frames. Specifically, different blocks in the target are compared to ones within the reference frames, across a range of possible displacements. These displacements can be represented as an optical flow map f t, and block-matching can be written as a special case of the flow estimation problem (see Section1.3). In order to minimize the bandwidth required to transmit the flow f t and reduce the complexity of the search, the flows are applied uniformly over large spatial blocks, and discretized to precision of half/quarter/eighth-pixel.
Residual compression.
Following motion compensation,the leftover difference between the target and its motion compensated approximation mt is then compressed. This difference t = xt − mt is known as the residual, and is independently encoded with an image compression algorithm adapted to the sparsity of the residual.
这个步骤的目标是利用时间冗余。这是通过块匹配(在[30]处概述)来完成的,块匹配从几个先前发送的参考帧重建当前帧。具体地说,目标中的块与参考帧中的块进行比较,然后会找到一个与当前块最相似的块,当前块与最相似的块之间的距离就是当前块的位移。这些位移可以表示为光流图f t,块匹配可以写为流估计问题的一个特例(见第1.3节)。为了最小化传输f t所需的带宽并降低搜索的复杂度,将流均匀地应用于大空间块上(译者注:这个大空间,也就是块,共用一个位移),并离散到半/四分之一/八分之一像素的精度(亚像素搜索运动估计)。
在运动补偿之后,压缩目标(当前块)与其运动补偿近似mt(参考帧中的相似块)之间的差异。这种差异 t=xt-mt被称为残差,并且用适应残差稀疏性的图像压缩算法独立地编码。

1.2. Contributions

1.2. 这篇文章所做的工作

This paper presents several novel contributions to video codec design, and to ML modeling of compression:
Compensation beyond translation.
Traditional codecs are constrained to predicting temporal patterns strictly in the form of motion. However, there exists significant redundancy that cannot be captured via simple translations. Consider, for example, an out-of-plane rotation such as a person turning their head sideways. Traditional codecs will not be able to predict a profile face from a frontal view. In contrast, our system is able to learn arbitrary spatio-temporal
patterns, and thus propose more accurate predictions, leading to bitrate savings.


Propagation of a learned state.
In traditional codecs all “prior knowledge” propagated from frame to frame is expressed strictly via reference frames and optical flow maps, both embedded in raw pixel space. These representations are very limited in the class of signals they may characterize,and moreover cannot capture long-term memory.In contrast, we propagate an arbitrary state autonomously learned by the model to maximize information retention
Joint compression of motion and residual.
Each codec must fundamentally decide how to distribute bandwidth among motion and residual. However, the optimal tradeoff between these is different for each frame. In traditional methods, the motion and residual are compressed separately, and there is no easy way to trade them off. Instead, we jointly compress the compensation and residual signals using the same bottleneck. This allows our network to reduce redundancy by learning how to distribute the bitrate among them as a function of frame complexity.

Flexible motion field representation.
In traditional codecs, optical flow is represented with a hierarchical block structure where all pixels within a block share the same motion. Moreover, the motion vectors are quantized to a particular sub-pixel resolution. While this representation is chosen because it can be compressed efficiently, it does not capture complex and fine motion. In contrast, our algorithm has the full flexibility to distribute the bandwidth so that areas that matter more have arbitrarily sophisticated motion
boundaries at an arbitrary flow precision, while unimportant areas are represented very efficiently. See comparisons in Figure 2.
Multi-flow representation.
Consider a video of a train moving behind fine branches of a tree. Such a scene is highly inefficient to represent with traditional systems that use a single flow map, as there are small occlusion patterns that break the flow. Furthermore, the occluded content will have to be synthesized again once it reappears. We propose a representation that allows our method the flexibility to decompose a complex scene into a mixture of multiple simple flows and preserve occluded content.
Spatial rate control. It is critical for any video compression approach to feature a mechanism for assigning different bitrates at different spatial locations for each frame. In ML-based codec modeling, it has been challenging to construct a single model which supports R multiple bitrates, and achieves the same results as R separate, individual
models each trained exclusively for one of the bitrates. In
this work we present a framework for ML-driven spatial rate
control which meets this requirement.

1.3. RelatedWork
ML-based image compression
In the last two years, we have seen a great surge of ML-based image compression approaches [15, 44, 45, 5, 4, 14, 25, 43, 38, 23, 2, 27, 6, 3, 10, 32, 33]. These learned approaches have been reinventing
many of the hard-coded techniques developed in traditional
image coding: the coding scheme, transformations into and
out of a learned codespace, quality assessment, and so on.





