Monkey Audio 中将WAV 压缩成 APE 的理论支持

最新推荐文章于 2021-01-22 21:32:28 发布

「已注销」

最新推荐文章于 2021-01-22 21:32:28 发布

阅读量1.4k

点赞数

分类专栏： Digital Image Process 文章标签：压缩 audio WAV APE

Digital Image Process 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

Digital Audio

Sound is simply a wave, and digital audio is the digital representation of this wave. The digital representation is achieved by "sampling" the magnitude of an analog signal many times per second. This can be thought of conceptually as recording the "height" of the wave many times per second. Today's audio CD's store 44,100 samples per second. Since CD's are in stereo, they store both a left and a right value 44,100 times per second. These values are represented by 16 bit integers. Basically a WAV file is a header, followed by an array of R, L, R, L…. Since each sample takes 32 bits (16 for the left, and 16 for the right), and there are 44,100 samples per second, one second of audio takes 1,411,200 bits, or 176,400 bytes.

Lossless Compression

Lossless compression can be broken down into a few basic steps. They are detailed below.

1. Conversion to X,Y

The first step in lossless compression is to more efficiently model the channels L and R as some X and Y values. There is often a great deal of correlation between the L and R channels, and this can be exploited several ways, with one popular way being through the use of mid / side encoding. In this case, a mid (X) and a side (Y) value are encoded instead of a L and a R value. The mid (X) is the midpoint between the L and R channels and the side (Y) is the difference in the channels. This can be achieved:

X = (L + R) / 2
Y = (L - R)

2. Predictor

Next, the X and Y data is passed through a predictor to attempt to remove any redundancy. Basically, the goal of this stage is to make the X and Y arrays contain the smallest possible values while still remaining decompressible. This stage is what separates one compression scheme from another. There are virtually countless ways to do this. Here is a sample using simple linear algebra:

PX and PY are the predicted X and Y; X_-1is the previous X value; X_-2 is the X value two back
PX = (2 * X_-1) - X_-2
PY = (2 * Y_-1) - Y_-2

As an example, if X = (2, 8, 24, ?); PX = (2 * X_-1) - X_-2= (2 * 24) - 8 = 40

Then, these predicted values are compared with the actual value and the difference (error) is what gets sent to the next stage for encoding.

Most good predictors are adaptive, so that they adjust to how "predictable" the data currently is. For example, let's use a factor 'm' that ranges from 0 to 1024 (0 is no prediction and 1024 is full prediction). After each prediction, m is adjusted up or down depending on whether the prediction was helpful or not. So in the previous example, what leaves the predictor is this:

X = (2, 8, 24, ?)
PX = (2 * X_-1) - X_-2= (2 * 24) - 8 = 40

If ? = 45 and m = 512

[Final Value] = ? - (PX * m / 1024) = 45 - (40 * m / 1024) = 45 - (40 * 512 / 1024) = 45 - 20 = 25

After this m would be adjusted upward because a higher m would have been more efficient.

Using different prediction equations and using multiple passes through the predictor can make a fairly substantial difference in compression level. Here is a quick list of some prediction equations as shown in the Shorten technical documentation (for different orders):

P0 = 0
P1 = X_-1
P2 = (2 * X_-1) - X_-2
P3 = (3 * X_-1) - (3 *X_-2) + X_-3

3. Encoding of Data

The goal behind audio compression is to make all of the numbers as small as possible by removing any correlation that may exist between them. Once this is achieved the resulting numbers must be written to disk.

Why are smaller numbers better? They are better because they take less bits to represent. For example, say we want to encode this array of numbers (32 bit longs):

Base 10: 10, 14, 15, 46

or in binary

Base 2: 1010, 1110, 1111, 101110

Now obviously if we want to represent these numbers in the fewest possible bits, it would be quite inefficient to represent them each as separate longs with 32 bits apiece. That would take 128 bits, and just from looking at the same numbers represented in base two, it is obvious that there must be a better way. The ideal thing would be just to slap the four numbers together using the least bits necessary, so 1010, 1110, 1111, 101110 without the commas would be 101011101111101110. The problem here is that we don't know where one number starts and the next begins.

To store small numbers so that they take less bits, but can still be decompressed, we use "entropy coding". There are details about a common entropy coding system called Rice Coding below. Monkey's Audio uses a slightly more advanced entropy coder, but the fundamentals of Rice Coding are still useful.

Entropy Coding

Rice Coding

Rice coding is a way of using less bits to represent small numbers, while still maintaining the ability to tell one number from the next. Basically it works like this:

You make your best guess as to how many bits a number will take, and call that k
Take the rightmost k bits of the number and remember what they are
Imagine the binary number without those rightmost k bits and look at its new value (this is the overflow that doesn't fit in k bits)
Use these values to encode the number. This encoded value is represented as a number of zeroes corresponding to step 3, then a terminating 1 to tell that you're done sending the "overflow", then the k bits from step 2.

Example

Let's work through our example, and try to encode the fourth number in our series 10, 14, 15, 46.

You make your best guess as to how many bits a number will take, and call that k: since the previous 3 numbers took 4 bits, that seems like a reasonable guess so we will set k = 4
Take the rightmost k bits of the number and remember what they are: The right 4 bits of 46 (101110) are 1110
Imagine the binary number without those rightmost k bits and look at its new value (this is the overflow that doesn't fit in k bits): When you take the 1110 away from the right of 101110 you are left with 10 or 2 (in base 10)
Use these values to encode the number. So, we put two 0's, followed by the terminating 1, followed by the k bits 1110. All together we have: 0011110

Now to undue this operation, we just take 0011110 and k = 4 and work our way backwards. We first see that the overflow is 2 (there are two zeroes before the terminating 1). We also see that the last four bits = 1110. So, we take the value 10 (the overflow) and the values 1110 (the k) and just do a little shifting and volah! (overflow is shifted << k bits)

More Technical Version of the Same Example

Here is a little more technical and mathematical description of the same process:

Assuming some integer n is the number to encode, and k is the number of bits to encode directly.

sign (1 for positive, 0 for negative)
n / (2^k) 0's
terminating 1
k least significant bits of n

As an example, if n = 578 and k = 8: 100101000010

sign (1 for positive, 0 for negative) = [1]
n / (2^k) 0's: n / 2^k = 578 / 256 = 2 = [00]
terminating 1: [1]
k least significant bits of n: 578 = [01000010]
put the 1-4 together: [1][00][1][01000010] = 100101000010

During the encode process, the optimum k is determined by looking at the average value over the past however many values (16 - 128 works well), and choosing the optimum k for that average (basically it's guessing what the next value will be, and trying to choose the most efficient k based on that). The optimum k can be calculated as [log(n) / log(2)].

翻译：

ape压缩原理

数字音频：

声音简单的说是一种波，而数字化音频是声波的数字化形式。这是通过对大量的模拟信号在每秒钟“采样”很多次而达到的。这个过程在概念上可以理解为在每秒钟内对声波波形的最高点进行多次记录。现在市面上的音乐CD储存的就是对声音的每秒钟进行44100次的采样。自从CD都以立体声方式来压制时，对声音的采样也变为每秒钟同时对左右声道采样44100次，采样得到的数值用16位的二进制整数来表示。基本上，一个WAV（波形）文件都有一个文件头，后面跟随一系列的右（声道信号），左（声道信号），右，左......而当每个采样数值占用32位二进制数位（16位左声道，16位右声道），每秒钟44100的采样频率时，记录一秒钟的声音就要使用1，411，200个二进制位，或者说是176，400字节（176.4KB）。

无损压缩：

1）转化至X，Y
无损压缩的第一步就是更有效的将左右声道的模型化为X，Y值。通常在左右声道之间存在着大量的相关性，可以通过好几种方式来处理，最常用的是通过使用“中/ 边值编码”。在这种情况下，编码时采用的是一个中点值（X）和一个边值（Y），而不是左右声道数值。中点值（X）是左右声道数值的中间值，边值（Y）是两声道数值的差值。这可由以下的公式得到：

X = (L + R) / 2

Y = (L - R)

2）预测器
下一步，X和Y数据流经一个预测器来去除冗余。基本上，这一步的目的是使得X Y序列中包含尽可能小的可解压的数值。从这一步里，一个压缩进程和另一个压缩进程相互隔开。实际上，有无数种方法可以实现这一步。这里举一个使用简单线形代数的例子：

PX和PY是预测的X，Y值；X1是最初的X值，X2是经过二次预测的返回值；
PX = 2 * X1 - X2
PY = 2 * Y1 - Y2
例如：当X = （2，8，24，？）；PX = (2 * X1) - X2 = (2 * 24) - 8 = 40

那样，将预测值和实际值相减，差值（错误）被传送到下一步编码。

多数好的预测器都是具有适应性的，它们能调整到处理当前数据所需的“可预测”程度。举个例子，当我们使用一个在0到1024之间的数m作为因子（0是无法预测，1024是全预测），每次预测后，m会根据预测是否有用来向上或者向下调整。这样，在前面的例子中，留给预测器的是：

X = (2, 8, 24, ?)

PX = (2 * X1) - X2 = (2 * 24) - 8 = 40

如果 ? = 45 并且 m = 512, 那么 [最终值] = ? - (PX * m / 1024) = 45 - (40 * m / 1024) = 45 - (40 * 512 / 1024) = 45 - 20 = 25

在这以后，m将向上调整，因为更高的m数值将会更有效。

使用不同的预测等式和在预测器里使用乘法处理，将会给压缩级别带来细微的不同。这里有份技术文档里的对应于不同需求的预测等式简表：

P0 = 0

P1 = X1

P2 = (2 * X1) - X2

P3 = (3 * X1) - (3 *X2) + X3

3）数据编码/赖斯编码

音频压缩的目的是要让所有的数尽可能地小，通过去除它们之间存在的冗余。一旦这个目的达到之后，结果数据必须要写入到磁盘里。诸多（可能并不是）最有效途径中的一种，就是采用赖斯（Rise）编码。

为什么越小的数值越好？因为它们能用更少的二进制位来表达。例如，我们要对一个数列（32位字长）进行编码：

十进制：10，14，15，46

转成二进制为：1010，1110，1111，101110

现在，如果我们要用最可能少的数位来表示这些数字的话，对于每个数我们都要用32位二进制来表示，显然是很没效率的。那样要占用128个数位，而且仅从二进制表达的数字看来，它们有相同部分，一定有更好的办法来表示。理想的方法就是使用最少的并且必须的数位，直接把四个数拍到一起，那么1010，1110， 1111，101110在去掉逗号的时候就是101011101111101110。这里的问题就是，我们不知道一个数从哪里开始，而下一个又是从哪里开始。这个时候就该是赖斯(Rise)编码上场了。

赖斯(Rise)编码是一种使用较少的数位来表示小的数目，同时能保持对数字进行区分的能力的方法。基本上它是这样工作的：
1)对于表达一个数目需要多少数位，你做了你最好的猜测，把它记做k；
2)取数目中的右边k位数并且记住；
3)想象没有右边k位数的二进制数，观察它的新值（不符合k位时溢出）；
4)使用这些新值对数目进行编码。编码值用与第3步对应的一组0来表示，0后面以1来终结，用于告知“发出溢出信号完成”，之后跟随第2步得到的k位数。

让我们用我们刚才的例子，对我们的数列10，14，15，46中的第四个数进行编码。

1)既然前面的三个数都占用4个二进制位，那么对于第四个数所占用的数位做个合理的猜测，我们设k=4；
2)取4 6（101110）中的右边4位1110；
3)当你把101110右边4位去掉后就剩下10（二进制）；
4)这样，我们在编码值里先放两个0 ，然后用1截止，再接上k位数1110，最后我们得到0011110。

现在来逆向进行这个操作，我们只需要有数值0011110和k=4。首先我们发现溢出为2（在终止数位1前面有两个0），我们还看到数的最后4位1110。所以，我们只要对溢出值10和数值1110作简单移位就得到原先的数目。

以下是对相同过程更技术化数学化的描述：

假设一个整数n是被编码的数字，k是这个整数直接编码的数位数。
1)标志（1为正，0为负）
2)n / 2^k个0
3)终止数位1
4)余下的k位数

例如，n=578D=100101000010B，k=8，
1)标志:[1]（578为正数）
2)n / 2^k = 578 / 2^8 = 578 / 256 = 2 = [00]
3)终止数位：[1]
4)余下的8位：[01000010]
5)把1~4的结果放到一起：[1][00][1][01000010]=[100101000010]

在编码过程中，最合适的k值视乎前面所有值的平均值而定（取16~128的值为佳）。（基本上是猜测下一个值是什么，然后在此基础上尽量取一个最有效的k值）最适合的k值可以用以下式子来计算：

[log(n) / log(2)]

「已注销」

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Monkey Audio 中将WAV 压缩成 APE 的理论支持

Digital AudioSound is simply a wave, and digital audio is the digital representation of this wave. The digital representation is achieved by "sampling" the magnitude of an analog signal many times
复制链接

扫一扫