论文阅读——应用于HEVC视频编码器端控制的强化学习算法

一、文章出处

本文题为《Reinforcement Learning for Video Encoder Control in HEVC》,文章链接:原文链接,加载过程较慢容易出现问题,提供资源分享下载链接:分享链接

二、主要内容

文章提出一种基于强化学习优化 HEVC encoder 端控制的算法,通过对 encoder 端决策过程分析与建模最终通过强化学习解决问题。

1.一些概念

① episodes

We view the encoding procedure as a sequence of recurring decision episodes, each covering the search for one CU description.

文章将编码过程视作循环决策 episodes 的序列,每个 episode 对应 a search for one CU description。

② operations and features

In each episode, a discrete number of operations v t v_t vt is available, each of which calculates a vector of features x t x_t xt.

每个 episode 中,有 v t v_t vt 个 operations 可用,每个 operation 都计算出一个特征的向量 x t x_t xt

The operations and associated features are listed in Table.

在这里插入图片描述

The features are defined as follows.

  • R, D
    R is the description length in bits and D is the sum of squared differences (SSD) between the original and reconstructed pixels.
  • H, V
    The sum of squared differences between original pixels and their neighbors in horizontal (H) and vertical (V) directions, whithin the CU.
  • Cbf
    The RQT-root-flag from HEVC syntax. It codes whether the CU has residual coefficient values other than zero or not.
  • Esd
    A flag equal to the ”early skip detection”-condition.
  • Spl
    A vector of two values calculated from 8 × 8 sub-block distortions d k , i , j d_{k,i,j} dk,i,j of all tested merge- candidates. When i i i , j j j index an 8 × 8 sub-block for candidate k k k , the two values are S P L ( 0 ) = min ⁡ k ∑ i , j d k , i , j SPL^{(0)} = \min_k \sum_{i,j}d_{k,i,j} SPL(0)=minki,jdk,i,j and S P L ( 1 ) = ∑ i , j min ⁡ k d k , i , j SPL^{(1)} = \sum_{i,j}\min_kd_{k,i,j} SPL(1)=i,jminkdk,i,j. The idea is that their difference is large when there are many different motion regions, which might be indicative for a split decision.
③ decision domain

A variable z = (picType,QP, level, blockSize), where picType ∈ {0, 1} stands for an intra- (1) or inter- predicted (0) picture, QP ∈ {0…51} is the quantization parameter, level ∈ {0…3} is the temporal-level in the GOP- 8 (group of 8 pictures) temporal prediction structure, and blockSize ∈ {8, 16, 32, 64} is the width of the current CU.

The agent will use the variable z to distinguish between a set of decision domains.

As feature distribution is expected to strongly depend on z, the idea is to train a separate (expert) policy πz for each domain.At the same time, the role of these domains is to present borders to information flow, meaning that all observations made in a domain will only be used to make decisions within that domain.

z = ( p i c T y p e , Q P , l e v e l , b l o c k S i z e ) z = (picType, QP, level, blockSize) z=(picType,QP,level,blockSize),agent 会使用变量 z 区分不同的决策域

features 的分布很大程度上取决于 z 。

④ policy structure

The policy structure we use is best described by a binary tree as depicted in Figure. Starting from the root node s 0 s_0 s0, each level of the tree represents a point in time at which the policy chooses between two actions a 0 a_0 a0 and a 1 a_1 a1.

在这里插入图片描述

文章将 policy 建模成 binary tree 的形式,每一层表示一个时间点,也即一个operation,只能采取 a 0 a_0 a0 a 1 a_1 a1 两种 action, a 0 a_0 a0 代表不采取当前结点对应的 operation , a 1 a_1 a1 代表采取当前结点对应的 operation。

Assuming the decision domain has H H H encoder operations, the tree has height H H H and each tree level, or epoch t ∈ 0... H − 1 t \in {0 . . .H − 1} t0...H1 corresponds to one operation v t v_t vt.

假设某个 decision domain 有 H H H 个 operations ,则对应树高为 H H H ,每一层代表一个 operation v t v_t vt

A policy now consists of a set of binary classifiers g s ( x ) g_s(x) gs(x), one for each node s s s in the tree. The input x x x to a classifier at t t t is simply the vector of all features output by operations executed in previous epochs.

建模成二叉树后,policy 由 一系列的二进制分类器 g s ( x ) g_s(x) gs(x) 组成,每个 g s ( x ) g_s(x) gs(x) 对应一个结点,分类器输入 x x x 为当前结点之前的所有输出 feature vectors。

A decision function g s g_s gs is defined by a hyperplane in feature space, where θ s \theta_s θs is the vector of learnable parameters.
g s ( x ) = { a 0 , if  θ s T x < 0 a 1 , if  θ s T x ≥ 0 g_s(x) = \begin{cases} a_0, & \text {if $\theta_s^Tx < 0$} \\ a_1, & \text{if $\theta_s^Tx \geq 0$} \end{cases} gs(x)={a0,a1,if θsTx<0if θsTx0

每个结点将通过上式作出决策,其中 θ s \theta_s θs 是一个可学习的向量。

⑤ optimization objective

The goal of reinforcement learning is to maximize expected future reward, which our learning algorithm estimates by an empirical reward on a set M of M training examples. We say that each example creates a tree T i T_i Ti, as each example assigns different input data and reward to the binary tree.The overall quantity our algorithm seeks to maximize is the sum over these rewards.
V ^ π ( s 0 ) = 1 M ∑ i R ( π , T i ) \hat{V}^\pi(s_0) = \frac{1}{M}\sum_iR(\pi, T_i) V^π(s0)=M1iR(π,Ti)

如上式所示,采用 MC 的方法采样 M 个 examples 计算 reward。

The i i ith example sampled from the encoder has data ( x i , t , j i , t , c i , t ) (x_{i,t}, j_{i,t}, c_{i,t}) (xi,t,ji,t,ci,t) with t ∈ { 0... H − 1 } t \in \{0 . . .H−1\} t{0...H1} . Using RD-cost values j j j and complexity values c c c, we define a reward function R ( i , s , a ) R(i, s, a) R(i,s,a) giving the immediate reward the agent receives for executing action a at node s in tree T i T_i Ti.

Values c i , t c_{i,t} ci,t are execution durations measured in microseconds by the function gettimeofday() from the GNU C-library (sys/time.h). Values j i , t j_{i,t} ji,t are RD-cost values for operations that provide a description.

在这里插入图片描述

如上式所示,某 example 中(i)的某个结点(s)的某个动作(a)的 reward 值应该为 R ( i , s , a ) = R j ( i , s , a ) + μ R c ( i , s , a ) R(i,s,a) = R_j(i,s,a) + \mu R_c(i,s,a) R(i,s,a)=Rj(i,s,a)+μRc(i,s,a)

式中的第一项 R j ( i , s , a ) R_j(i,s,a) Rj(i,s,a) 是对应 RD-cost 的 reward 分量,第二项中 R c ( i , s , a ) R_c(i,s,a) Rc(i,s,a) 是对应 complexity 的 reward 分量,式中的 μ \mu μ 用作两者的 tradeoff。

With this reward function, R ( π , T i ) = ∑ k R ( i , s k , a k ) R(\pi, T_i) =\sum_kR(i, s_k, a_k) R(π,Ti)=kR(i,sk,ak), where s k s_k sk and a k a_k ak represent the path taken by π \pi π through T i T_i Ti.

2.算法过程

(1) update

在这里插入图片描述

We use an algorithm Algorithm 1 which updates classifiers one by one, each time assuming all other classifiers are fixed.

update 用于一步一步更新 classifier,函数输入是:

  • 当前策略 π \pi π
  • 由外部入口控制的用于更新 g s g_s gs 的 example 子集 L s L_s Ls
  • 全部的 training example 集 M \mathcal M M
  • 控制 classifier 更新速度的 α \alpha α
  • 权衡 reward 中 RD-cost 和 complexity 的 μ \mu μ

函数通过外层循环遍历当前树的每一层,内层循环遍历当前层的所有 state,循环内部工作如下:

  • X \bf X X 用于存放所有 E s E_s Es 中的树中,当前层之前的采取 operation 的对应 features。
  • r 0 \bf r_0 r0 用于存放所有 E s E_s Es 中的树中,当前状态下采取 a 0 a_0 a0 获得的 reward。
  • r 1 \bf r_1 r1 用于存放所有 E s E_s Es 中的树中,当前状态下采取 a 1 a_1 a1 获得的 reward。

在这里插入图片描述

  • 通过 CLR 函数获取当前策略的更新量 θ s ′ \theta'_s θs
  • 根据从入口传入的更新速度 α \alpha α 更新策略 θ \theta θ
(2) CLR

在这里插入图片描述

  • 函数首先以 ω = ∣ r 1 − r 0 ∣ \omega = \mid r_1 - r_0\mid ω=r1r0 作为采样概率,对所有 E s E_s Es 中的 example 进行采样。
  • 对采样结果通过 LR 生成当前策略的更新量 θ s ′ \theta'_s θs

For the base-learner LR, we use logistic regression with the cross-entropy loss function and a quadratic regularization term.

在这里插入图片描述
这里 h θ ( x ) h_{\theta}(x) hθ(x) 的形式类似于 sigmoid 函数的形式(见下式),将特征输入映射到 (0, 1) 区间中便于计算交叉熵损失。
g ( s ) = 1 1 + e − s g(s)=\frac{1}{1+e^{-s}} g(s)=1+es1

(3) Overall algorithm

在这里插入图片描述
该函数循环设置不同的 blockSize 、QP、level,再通过 Enc 根据现有策略进行采样,采样后的轨迹数据存入 M z \mathcal M_z Mz。再通过循环遍历 M z \mathcal M_z Mz 调用 Learn 函数更新策略。

Learn 函数一开始 L s = { 1.. M } L_s = \{1..M\} Ls={1..M},后来逐渐更新 L s L_s Ls 使只训练经过访问过的结点的轨迹,缩小 L s L_s Ls

(4) RD-cost versus complexity trade-off

In order to realize operating points with different RD-cost versus complexity trade-offs, we find suitable values µ on a grid defined by:
μ ˉ k = 2 ( k − 60 ) / 3 \bar{\mu}_k = 2^{(k-60)/3} μˉk=2(k60)/3

We start with an initial policy Π 0 Π_0 Π0 that ignores all features and executes all operations ( g s ( x ) = a 1 g_s(x) = a_1 gs(x)=a1).

最初算法采用初始策略 Π 0 Π_0 Π0,即无视所有 features 并执行全部 operations。通过 μ ˉ 0 \bar{\mu}_0 μˉ0 学到了策略 Π ϵ Π_{\epsilon} Πϵ,策略对应的 RD-cost 值为 J z , ϵ J_{z,\epsilon} Jz,ϵ

In a second step, we collect data with Π ϵ Π_\epsilon Πϵ like in procedure IMPROVE, but without updating the policy. For each obtained M z \mathcal{M}_z Mz, we run LEARN ( M z , µ k ) (M_z, µ_k) (Mz,µk) for k ∈ { 0..75 } k \in \{0..75\} k{0..75}. For each z z z, k k k, we store the RD-cost J z , k = D z , k + λ z R z , k J_{z,k} = D_{z,k} + λ_zR_{z,k} Jz,k=Dz,k+λzRz,k incurred by the final π z π_z πz on M z M_z Mz.

则根据策略 Π ϵ Π_{\epsilon} Πϵ 执行 LEARN ( M z , µ k ) (M_z, µ_k) (Mz,µk) for k ∈ { 0..75 } k \in \{0..75\} k{0..75},即可计算出 J z , k J_{z,k} Jz,k,即可根据以下几式求得 μ \mu μ 值。

在这里插入图片描述
在这里插入图片描述
其中 Δ R \Delta R ΔR 是 a certain target delta rate。

三、算法效果

在这里插入图片描述

  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值