论文阅读——应用于HEVC视频编码器端控制的强化学习算法

最新推荐文章于 2023-03-08 13:07:20 发布

liaojq2020

最新推荐文章于 2023-03-08 13:07:20 发布

阅读量353

点赞数 1

分类专栏：强化学习 HEVC 文章标签： HEVC 强化学习机器学习人工智能视频编码

本文链接：https://blog.csdn.net/qq_43616471/article/details/111396396

版权

强化学习同时被 2 个专栏收录

4 篇文章 1 订阅

订阅专栏

HEVC

3 篇文章 2 订阅

订阅专栏

一、文章出处

本文题为《Reinforcement Learning for Video Encoder Control in HEVC》，文章链接：原文链接，加载过程较慢容易出现问题，提供资源分享下载链接：分享链接

二、主要内容

文章提出一种基于强化学习优化 HEVC encoder 端控制的算法，通过对 encoder 端决策过程分析与建模最终通过强化学习解决问题。

1.一些概念

① episodes

We view the encoding procedure as a sequence of recurring decision episodes, each covering the search for one CU description.

文章将编码过程视作循环决策 episodes 的序列，每个 episode 对应 a search for one CU description。

② operations and features

In each episode, a discrete number of operations $v_t$ is available, each of which calculates a vector of features $x_t$ .

每个 episode 中，有 $v_t$ 个 operations 可用，每个 operation 都计算出一个特征的向量 $x_t$

The operations and associated features are listed in Table.

在这里插入图片描述

The features are defined as follows.

R, D
R is the description length in bits and D is the sum of squared differences (SSD) between the original and reconstructed pixels.
H, V
The sum of squared differences between original pixels and their neighbors in horizontal (H) and vertical (V) directions, whithin the CU.
Cbf
The RQT-root-flag from HEVC syntax. It codes whether the CU has residual coefficient values other than zero or not.
Esd
A flag equal to the ”early skip detection”-condition.
Spl
A vector of two values calculated from 8 × 8 sub-block distortions $d_{k,i,j}$ of all tested merge- candidates. When $i$ , $j$ index an 8 × 8 sub-block for candidate $k$ , the two values are $SPL^{(0)} = \min_k \sum_{i,j}d_{k,i,j}$ and $SPL^{(1)} = \sum_{i,j}\min_kd_{k,i,j}$ . The idea is that their difference is large when there are many different motion regions, which might be indicative for a split decision.

③ decision domain

A variable z = (picType,QP, level, blockSize), where picType ∈ {0, 1} stands for an intra- (1) or inter- predicted (0) picture, QP ∈ {0…51} is the quantization parameter, level ∈ {0…3} is the temporal-level in the GOP- 8 (group of 8 pictures) temporal prediction structure, and blockSize ∈ {8, 16, 32, 64} is the width of the current CU.

The agent will use the variable z to distinguish between a set of decision domains.

As feature distribution is expected to strongly depend on z, the idea is to train a separate (expert) policy πz for each domain.At the same time, the role of these domains is to present borders to information flow, meaning that all observations made in a domain will only be used to make decisions within that domain.

即 $z = (p i c T y p e, Q P, l e v e l, b l o c k S i z e)$ ，agent 会使用变量 z 区分不同的决策域。

features 的分布很大程度上取决于 z 。

④ policy structure

The policy structure we use is best described by a binary tree as depicted in Figure. Starting from the root node $s_0$ , each level of the tree represents a point in time at which the policy chooses between two actions $a_0$ and $a_1$ .

在这里插入图片描述

文章将 policy 建模成 binary tree 的形式，每一层表示一个时间点，也即一个operation，只能采取 $a_0$ 或 $a_1$ 两种 action， $a_0$ 代表不采取当前结点对应的 operation ， $a_1$ 代表采取当前结点对应的 operation。

Assuming the decision domain has $H$ encoder operations, the tree has height $H$ and each tree level, or epoch $\in {0 . . .H − 1}$ corresponds to one operation $v_t$ .

假设某个 decision domain 有 $H$ 个 operations ，则对应树高为 $H$ ，每一层代表一个 operation $v_t$ 。

A policy now consists of a set of binary classifiers $g_s(x)$ , one for each node $s$ in the tree. The input $x$ to a classifier at $t$ is simply the vector of all features output by operations executed in previous epochs.

建模成二叉树后，policy 由一系列的二进制分类器 $g_s(x)$ 组成，每个 $g_s(x)$ 对应一个结点，分类器输入 $x$ 为当前结点之前的所有输出 feature vectors。

A decision function $g_s$ is defined by a hyperplane in feature space, where $\theta_s$ is the vector of learnable parameters.
$g_s(x) = \begin{cases} a_0, & \text {if $\theta_s^Tx < 0$} \\ a_1, & \text{if $\theta_s^Tx \geq 0$} \end{cases}$

每个结点将通过上式作出决策，其中 $\theta_s$ 是一个可学习的向量。

⑤ optimization objective

The goal of reinforcement learning is to maximize expected future reward, which our learning algorithm estimates by an empirical reward on a set M of M training examples. We say that each example creates a tree $T_i$ , as each example assigns different input data and reward to the binary tree.The overall quantity our algorithm seeks to maximize is the sum over these rewards.
$\hat{V}^\pi(s_0) = \frac{1}{M}\sum_iR(\pi, T_i)$

如上式所示，采用 MC 的方法采样 M 个 examples 计算 reward。

The $i$ th example sampled from the encoder has data $x_{i,t}, j_{i,t}, c_{i,t})$ with $\in \{0 . . .H−1\}$ . Using RD-cost values $j$ and complexity values $c$ , we define a reward function $R (i, s, a)$ giving the immediate reward the agent receives for executing action a at node s in tree $T_i$ .

Values $c_{i,t}$ are execution durations measured in microseconds by the function gettimeofday() from the GNU C-library (sys/time.h). Values $j_{i,t}$ are RD-cost values for operations that provide a description.

在这里插入图片描述

如上式所示，某 example 中（i）的某个结点（s）的某个动作（a）的 reward 值应该为 $R_j(i,s,a) + \mu R_c(i,s,a)$

式中的第一项 $R_j(i,s,a)$ 是对应 RD-cost 的 reward 分量，第二项中 $R_c(i,s,a)$ 是对应 complexity 的 reward 分量，式中的 $\mu$ 用作两者的 tradeoff。

With this reward function, $R(\pi, T_i) =\sum_kR(i, s_k, a_k)$ , where $s_k$ and $a_k$ represent the path taken by $\pi$ through $T_i$ .

2.算法过程

(1) update

在这里插入图片描述

We use an algorithm Algorithm 1 which updates classifiers one by one, each time assuming all other classifiers are fixed.

update 用于一步一步更新 classifier，函数输入是：

当前策略 $\pi$
由外部入口控制的用于更新 $g_s$ 的 example 子集 $L_s$
全部的 training example 集 $\mathcal M$
控制 classifier 更新速度的 $\alpha$
权衡 reward 中 RD-cost 和 complexity 的 $\mu$ 。

函数通过外层循环遍历当前树的每一层，内层循环遍历当前层的所有 state，循环内部工作如下：

$\bf X$ 用于存放所有 $E_s$ 中的树中，当前层之前的采取 operation 的对应 features。
$\bf r_0$ 用于存放所有 $E_s$ 中的树中，当前状态下采取 $a_0$ 获得的 reward。
$\bf r_1$ 用于存放所有 $E_s$ 中的树中，当前状态下采取 $a_1$ 获得的 reward。

在这里插入图片描述

通过 CLR 函数获取当前策略的更新量 $\theta'_s$ 。
根据从入口传入的更新速度 $\alpha$ 更新策略 $\theta$ 。

(2) CLR

在这里插入图片描述

函数首先以 $\omega = \mid r_1 - r_0\mid$ 作为采样概率，对所有 $E_s$ 中的 example 进行采样。
对采样结果通过 LR 生成当前策略的更新量 $\theta'_s$ 。

For the base-learner LR, we use logistic regression with the cross-entropy loss function and a quadratic regularization term.

在这里插入图片描述
这里 $h_{\theta}(x)$ 的形式类似于 sigmoid 函数的形式（见下式），将特征输入映射到 (0, 1) 区间中便于计算交叉熵损失。
$g(s)=\frac{1}{1+e^{-s}}$

(3) Overall algorithm

在这里插入图片描述
该函数循环设置不同的 blockSize 、QP、level，再通过 Enc 根据现有策略进行采样，采样后的轨迹数据存入 $\mathcal M_z$ 。再通过循环遍历 $\mathcal M_z$ 调用 Learn 函数更新策略。

Learn 函数一开始 $L_s = \{1..M\}$ ，后来逐渐更新 $L_s$ 使只训练经过访问过的结点的轨迹，缩小 $L_s$ 。

(4) RD-cost versus complexity trade-off

In order to realize operating points with different RD-cost versus complexity trade-offs, we find suitable values µ on a grid defined by:
$\bar{\mu}_k = 2^{(k-60)/3}$

We start with an initial policy $Π_0$ that ignores all features and executes all operations ( $g_s(x) = a_1$ ).

最初算法采用初始策略 $Π_0$ ，即无视所有 features 并执行全部 operations。通过 $\bar{\mu}_0$ 学到了策略 $Π_{\epsilon}$ ，策略对应的 RD-cost 值为 $J_{z,\epsilon}$ 。

In a second step, we collect data with $Π_\epsilon$ like in procedure IMPROVE, but without updating the policy. For each obtained $\mathcal{M}_z$ , we run LEARN $M_z, µ_k)$ for $\in \{0..75\}$ . For each $z$ , $k$ , we store the RD-cost $J_{z,k} = D_{z,k} + λ_zR_{z,k}$ incurred by the final $π_z$ on $M_z$ .

则根据策略 $Π_{\epsilon}$ 执行 LEARN $M_z, µ_k)$ for $\in \{0..75\}$ ，即可计算出 $J_{z,k}$ ，即可根据以下几式求得 $\mu$ 值。

在这里插入图片描述

其中 $\Delta R$ 是 a certain target delta rate。

三、算法效果

在这里插入图片描述

liaojq2020

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
2
评论
论文阅读——应用于HEVC视频编码器端控制的强化学习算法

一、文章出处本文题为《Reinforcement Learning for Video Encoder Control in HEVC》，文章链接：原文链接，加载过程较慢容易出现问题，提供资源分享下载链接：二、主要内容文章提出一种基于强化学习优化 HEVC encoder 端控制的算法，通过对 encoder 端决策过程分析与建模最终通过强化学习解决问题。1.一些概念① episodesWe view the encoding procedure as a sequence of recurr
复制链接

扫一扫

专栏目录