rANS：快速的渐进最优码

山登绝顶我为峰 3(^v^)3

已于 2025-03-30 17:07:06 修改

阅读量823

点赞数 19

分类专栏：计算机文章标签：计算机数学算法密码学信息安全

于 2025-03-26 17:22:09 首次发布

本文链接：https://blog.csdn.net/weixin_44885334/article/details/146535740

版权

计算机专栏收录该内容

27 篇文章

订阅专栏

参考文献：

[Huff52] D. A. Huffman, “A Method for the Construction of Minimum-Redundancy Codes,” in Proceedings of the IRE, vol. 40, no. 9, pp. 1098-1101, Sept. 1952.
[Duda09] Jarek Duda. Asymmetric numeral systems. CoRR abs/0902.0271 (2009).
[Duda13] Jarek Duda. Asymmetric numeral systems: entropy coding combining speed of huffman coding with compression rate of arithmetic coding, 2013. ArXiv preprint, available at https://arxiv.org/abs/1311.2540.
熵编码算法ANS - 知乎
Lossless Compression with Asymmetric Numeral Systems

为了压缩已知概率密度的符号序列，有两种主流方法：

哈夫曼编码：使用二的幂次近似密度（二叉树），速度快，但压缩率低（除非信源密度恰好都是二的幂次）；
算术/范围编码：使用精确密度，压缩率很容易达到香农熵的界，但是速度慢。

[Duda09] 提出了 Asymmetric numeral systems（ANS），它是渐进最优码（平均码长等于香农熵），并且速度比哈夫曼编码更快。

ANS

ANS 的抽象思路是：令 $[n]$ 是大小为 $n$ 的符号表， $q_s, s \in [n]$ 是其密度函数。对于一个充分大的整数 $\in \N$ ，将其视为区间 $x=\{0,1,\cdots,x-1\} \subseteq \N$ ，根据符号密度 $q_s$ 划分为 $n$ 个大小分别约为 $q_sx$ 的子集 $S_s \subseteq x$ ，满足 $\cup_s S_s$ ，那么，在 $x$ 中均匀采样，就等价于首先根据 $q_s$ 采样出 $s$ ，然后在 $S_s$ 中均匀采样。所以， $\iff (s,S_s)$ 存在双射，可以利用它来编码符号 $s$ 的序列。

函数 $D_1: \N \to [n],\, x \mapsto s$ 称为 distributing function，它给出了集合 $x$ 的划分，
$S_s := \{y\in x \mid D_1(y)=s\}$
然后定义 $x_s := |S_s|$ 以及 $D_2(x) := x_{D_1(x)}$

那么，解码函数定义为 $\in \N$ 被 $D_1$ 所标记为的符号 $\in [n]$ ，并相应调整状态到 $x_s$ ，
$D(x) := (D_1(x), D_2(x)) = (s,x_s)$
它是双射（根据 $D_1$ 从低到高枚举出 $x_s$ 个符号 $s$ 对应的那个 $x$ ），编码函数是它的逆，
$C(s,x_s) := x,\,\,\, \text{s.t.}\,\,\, (D_1(x)=s) \wedge (|S_s|=x_s)$
要求找到合适的函数 $D_1$ ，使得总有 $x_s \approx q_s x$ ，那么就有 $\log C(s,x_s) - \log x_s \approx H_2(q_s)$ ，即实现了渐进最优性。

Range variants

[Duda09] 首先提出了一种 Uniform variants（uANS），函数 $D_1$ 将符号 $s$ 几乎均匀地铺在 $\N$ 上，其压缩率很好，但是计算相对较慢。下图展示了 $n=2, q_0=0.7, q_1=0.3$ 的情况，

在这里插入图片描述

之后，[Duda13] 提出了 rANS，函数 $D_1$ 将符号 $s$ 放置在一个区间内，其概率近似不如 uANS 精确，但是只需要更少的计算开销。

假设 $(g_0,g_1,\cdots,g_{n-1}) \in \N^n$ 是符号密度 $q_s$ 的良好近似（缩放 $\sum_s g_s$ 倍），将数组 $\{0,1,\cdots,B-1\}$ 依次划分为大小为 $g_s$ 的区间。定义 $\text{CDF}(s) := \sum_{i=0}^{s-1} g_i$ 是累积密度，初值 $\text{CDF}(0)=0$ ，那么范围 $[\text{CDF}(s),\text{CDF}(s+1))$ 被解码为 $s$

定义函数 $\text{symbol}: \Z \to [n]$ ，
$\text{symbol}(x) = s \iff \text{CDF}(s) \le [x]_B < \text{CDF}(s+1)$
编码函数设为：
$\left\lfloor\frac{x}{g_s}\right\rfloor \cdot B + [x]_{g_s} + \text{CDF}(s) \in \N$
解码函数设为：
$\left(\text{symbol}(x),\, \left\lfloor\frac{x}{B}\right\rfloor\cdot g_s + [x]_B - \text{CDF}(s)\right) \in [n] \times \N$
我在编程时注意到，如果初始 $x = 0$ 并且序列开头有多个 $s = 0$ ，那么总有 $C (s, x) = 0$ ，因此解码时会出现 Bug；对于其他的 $x$ 值，是否存在类似问题？似乎设置 $\sum_s g_s \le B-1$ 以及 $C D F (0) = 1$ 就可以解决该问题（但是这对么？）

假如选取 $B = 2^t$ ，设置 $g_s \approx q_sB$ ，那么上述的 $x\cdot B,\, x/B,\, [x]_B$ 都可以用移位和掩码来快速实现。然而 $x/g_s,\, x\cdot g_s,\, x+[x]_{g_s}$ 等运算，必须在大整数 $x$ 上处理，速度会相对较慢。

rANS 也是渐进最优码，满足：
$\sum_{s \in [n]} q_s \log(C(s,x)) \le \log x + \sum_{s \in [n]} q_s \log\frac{g_s}{B} + \frac{B}{x}$
只要 $x$ 远大于 $B$ ，并且 $g_s/B$ 是 $q_s$ 的良好近似，那么平均码长就是香农熵。

Stream Encoding

对于无限符号序列，状态 $\in \N$ 的比特长度也将无限增长，这使得计算效率降低。 [Duda09] 提出了流式编码，将状态限制在一个固定的区间内，从而提高计算效率，但是压缩率会略微变差。

我们将状态 $x$ 限制在范围 $\{l,l+1,\cdots,lb-1\}$ ，易知 $∣ I ∣ = (b - 1) l$ ，

在这里插入图片描述

利用上述三条规则，

在编码时，
1. 设置初值 $x = 0$ ，
2. 根据 $C (s, x)$ 编码，直到即将 $\ge lb$ （可以用 new_x = C(s,x) 测试，然后赋值 x = new_x）
3. 将 $x]_b$ 输出，并更新 $\gets \lfloor x/b\rfloor$ ，此时 $x < l$ ，然后回到 Step 2，获得 $\in I$
4. 在最后，将 $\in I$ 输出
在解码时，
1. 读取最近的状态 $\in I$
2. 根据 $D (x)$ 解码，直到 $x < l$ （对应编码时跨越 $I$ 的上边界 $b l$ ）
3. 读取大小为 $b$ 的数据块 $d$ ，更新 $\gets xb+d$ ，此时 $\in I$ ，然后回到 Step 2
4. 当输入流为空，就继续解码直到 $x = 0$ （回到初值），终止程序，并翻转解码结果

为了达到渐进最优性，需要让 $l$ 远大于 $B$ 。为了提高计算效率，可以将 $l, b$ 也都设置为二的幂次，尤其是 $\mid \log b$ （这里 $w = 8/32$ ，取决于大数 $\in \N$ 的基本运算单元）

代码实现

首先要自行实现一个大数类 INT，我这里使用数组 uint32_t *data 构造，

const int32_t ANS_N = 32;                                      // sum_s g(s) < 2^{ANS_N}
const int ANS_STATE_MINLEN = 2;                                // l = 2^{ANS_N * ANS_STATE_MINLEN}
const int ANS_STORE_LEN = 1;                                   // b = 2^{ANS_N * ANS_STORE_LEN}
const int ANS_STATE_MAXLEN = ANS_STATE_MINLEN + ANS_STORE_LEN; // I = [l, bl-1]

struct INT // 基底 2^32 的无符号大整数
{
    uint32_t *data = nullptr;
    int32_t len = 0;

    INT(int len); // 构造函数

    INT(const INT &x); // 深度复制

    ~INT(); // 析构函数

    int size(); // 当前大小（字节）

    void resize(int len); // 重新申请内存

    void clear(); // 清零

    void add(uint32_t x); // 加法

    void sub(uint32_t x); // 减法

    void mul(uint32_t x); // 乘法

    uint32_t div(uint32_t x); // 下取整除法，返回余数

    void shift(int i); // 按 word 左移（乘以 2^32），支持负数
};

主要的 rANS 实现如下：

void ANS_Build_G(uint32_t *G, const double *freq, const int B)
{
    for (int s = 0; s < B; s++)
        G[s] = 0xffffffffUL * freq[s]; // Scale and round down
}

void ANS_Build_CDF(uint32_t *CDF, const uint32_t *G, const int B)
{
    CDF[0] = 1; // Handle the case where the first symbol is s = 0. Otherwise, there will be a bug.
    for (int s = 1; s < B; s++)
        CDF[s] = CDF[s - 1] + G[s - 1]; // CDF[s] = Pr[x < s]
}

uint32_t ANS_Symbol(uint32_t x, const uint32_t *CDF, const int B)
{
    int64_t L = 0, R = B;

    while (true) // Use binary search to look up the table, and find M such that CDF[M] <= x < CDF[M + 1]
    {
        if (L >= B - 1)
            return B - 1;
        if (R <= 0)
            return 0;

        int64_t M = (L + R) / 2;
        if (x < CDF[M])
            R = M;
        else if (x >= CDF[M + 1])
            L = M + 1;
        else
            return M;
    }
}

void ANS_C(INT &x, const uint32_t s, const uint32_t *G, const uint32_t *CDF, const int B)
{
    uint32_t gs = G[s];
    uint32_t rem = x.div(gs);

    x.shift(1);
    x.add(rem);
    x.add(CDF[s]);
}

uint32_t ANS_Cinv(INT &y, const uint32_t *G, const uint32_t *CDF, const int B)
{
    uint32_t lsb = y.data[0];
    uint32_t t = ANS_Symbol(lsb, CDF, B);

    y.shift(-1);
    y.mul(G[t]);
    y.add(lsb);
    y.sub(CDF[t]);

    return t;
}

void ANS_Encode(vector<uint32_t> &x, const uint32_t *s, const int slen, const uint32_t *G, const uint32_t *CDF, const int B)
{
    x.clear();
    INT state(ANS_STATE_MAXLEN + 1); // Set the initial value to 0

    for (int i = 0; i < slen; i++)
    {
        INT new_state(state);
        ANS_C(new_state, s[i], G, CDF, B);

        if (new_state.size() > ANS_STATE_MAXLEN) // About to exceed the range I
        {
            for (int i = 0; i < ANS_STORE_LEN; i++)
                x.push_back(state.data[i]); // Store the lowest ANS_STORE_LEN blocks

            state.shift(-ANS_STORE_LEN);   // Make it less than l
            ANS_C(state, s[i], G, CDF, B); // Now belonging to range I
        }
        else
        {
            state.len = state.len;
            memcpy(state.data, new_state.data, new_state.size() * sizeof(uint32_t));
        }
    }

    // Output the remaining state values, with the length between ANS_STATE_MINLEN + 1 and ANS_STATE_MAXLEN
    int l = state.size();
    for (int i = 0; i < l; i++)
        x.push_back(state.data[i]);
}

void ANS_Decode(uint32_t **s, int &slen, const vector<uint32_t> &x, const uint32_t *G, const uint32_t *CDF, const int B)
{
    vector<uint32_t> res;
    int xlen = x.size();
    int num = (xlen - ANS_STATE_MINLEN - 1) / ANS_STORE_LEN;
    int l = xlen - ANS_STORE_LEN * num;
    xlen -= l;

    INT state(ANS_STATE_MAXLEN + 1);
    for (int i = 0; i < l; i++)
        state.data[i] = x[xlen + i]; // Retrieve the latest state

    while (xlen > 0)
    {
        while (state.size() > ANS_STATE_MINLEN) // Check if it still belongs to range I
            res.push_back(ANS_Cinv(state, G, CDF, B));

        state.shift(ANS_STORE_LEN); // Otherwise, the lower bits need to be fetched.
        for (int i = 0; i < ANS_STORE_LEN; i++)
            state.data[i] = x[xlen - ANS_STORE_LEN + i];

        xlen -= ANS_STORE_LEN;
    }

    while (state.size() != 0) // Whether to return to the initial state
        res.push_back(ANS_Cinv(state, G, CDF, B));

    if (*s != nullptr)
        free(*s);

    slen = res.size();
    *s = (uint32_t *)malloc(slen * sizeof(uint32_t));
    for (int i = 0; i < slen; i++)
        (*s)[i] = res[slen - i - 1];
}