Monkey Audio 中将WAV 压缩成 APE 的理论支持

Digital Audio

Sound is simply a wave, and digital audio is the digital representation of this wave. The digital representation is achieved by "sampling" the magnitude of an analog signal many times per second. This can be thought of conceptually as recording the "height" of the wave many times per second. Today's audio CD's store 44,100 samples per second. Since CD's are in stereo, they store both a left and a right value 44,100 times per second. These values are represented by 16 bit integers. Basically a WAV file is a header, followed by an array of R, L, R, L…. Since each sample takes 32 bits (16 for the left, and 16 for the right), and there are 44,100 samples per second, one second of audio takes 1,411,200 bits, or 176,400 bytes.

Lossless Compression

Lossless compression can be broken down into a few basic steps. They are detailed below.

1. Conversion to X,Y

The first step in lossless compression is to more efficiently model the channels L and R as some X and Y values. There is often a great deal of correlation between the L and R channels, and this can be exploited several ways, with one popular way being through the use of mid / side encoding. In this case, a mid (X) and a side (Y) value are encoded instead of a L and a R value. The mid (X) is the midpoint between the L and R channels and the side (Y) is the difference in the channels. This can be achieved:

  • X = (L + R) / 2
  • Y = (L - R)

2. Predictor

Next, the X and Y data is passed through a predictor to attempt to remove any redundancy. Basically, the goal of this stage is to make the X and Y arrays contain the smallest possible values while still remaining decompressible. This stage is what separates one compression scheme from another. There are virtually countless ways to do this. Here is a sample using simple linear algebra:

  • PX and PY are the predicted X and Y; X-1is the previous X value; X-2 is the X value two back
  • PX = (2 * X-1) - X-2
  • PY = (2 * Y-1) - Y-2

As an example, if X = (2, 8, 24, ?); PX = (2 * X-1) - X-2 = (2 * 24) - 8 = 40

Then, these predicted values are compared with the actual value and the difference (error) is what gets sent to the next stage for encoding.

Most good predictors are adaptive, so that they adjust to how "predictable" the data currently is. For example, let's use a factor 'm' that ranges from 0 to 1024 (0 is no prediction and 1024 is full prediction). After each prediction, m is adjusted up or down depending on whether the prediction was helpful or not. So in the previous example, what leaves the predictor is this:

  • X = (2, 8, 24, ?)
  • PX = (2 * X-1) - X-2 = (2 * 24) - 8 = 40

If ? = 45 and m = 512

  • [Final Value] = ? - (PX * m / 1024) = 45 - (40 * m / 1024) = 45 - (40 * 512 / 1024) = 45 - 20 = 25

After this m would be adjusted upward because a higher m would have been more efficient.

Using different prediction equations and using multiple passes through the predictor can make a fairly substantial difference in compression level. Here is a quick list of some prediction equations as shown in the Shorten technical documentation (for different orders):

  • P0 = 0
  • P1 = X-1
  • P2 = (2 * X-1) - X-2
  • P3 = (3 * X-1) - (3 *X-2) + X-3

3. Encoding of Data

The goal behind audio compression is to make all of the numbers as small as possible by removing any correlation that may exist between them. Once this is achieved the resulting numbers must be written to disk.

Why are smaller numbers better? They are better because they take less bits to represent. For example, say we want to encode this array of numbers (32 bit longs):

Base 10: 10, 14, 15, 46

or in binary

Base 2: 1010, 1110, 1111, 101110

Now obviously if we want to represent these numbers in the fewest possible bits, it would be quite inefficient to represent them each as separate longs with 32 bits apiece. That would take 128 bits, and just from looking at the same numbers represented in base two, it is obvious that there must be a better way. The ideal thing would be just to slap the four numbers together using the least bits necessary, so 1010, 1110, 1111, 101110 without the commas would be 101011101111101110. The problem here is that we don't know where one number starts and the next begins.

To store small numbers so that they take less bits, but can still be decompressed, we use "entropy coding". There are details about a common entropy coding system called Rice Coding below. Monkey's Audio uses a slightly more advanced entropy coder, but the fundamentals of Rice Coding are still useful.

Entropy Coding

Rice Coding

Rice coding is a way of using less bits to represent small numbers, while still maintaining the ability to tell one number from the next. Basically it works like this:

  1. You make your best guess as to how many bits a number will take, and call that k
  2. Take the rightmost k bits of the number and remember what they are
  3. Imagine the binary number without those rightmost k bits and look at its new value (this is the overflow that doesn't fit in k bits)
  4. Use these values to encode the number. This encoded value is represented as a number of zeroes corresponding to step 3, then a terminating 1 to tell that you're done sending the "overflow", then the k bits from step 2.


Let's work through our example, and try to encode the fourth number in our series 10, 14, 15, 46.

  1. You make your best guess as to how many bits a number will take, and call that k: since the previous 3 numbers took 4 bits, that seems like a reasonable guess so we will set k = 4
  2. Take the rightmost k bits of the number and remember what they are: The right 4 bits of 46 (101110) are 1110
  3. Imagine the binary number without those rightmost k bits and look at its new value (this is the overflow that doesn't fit in k bits): When you take the 1110 away from the right of 101110 you are left with 10 or 2 (in base 10)
  4. Use these values to encode the number. So, we put two 0's, followed by the terminating 1, followed by the k bits 1110. All together we have: 0011110

Now to undue this operation, we just take 0011110 and k = 4 and work our way backwards. We first see that the overflow is 2 (there are two zeroes before the terminating 1). We also see that the last four bits = 1110. So, we take the value 10 (the overflow) and the values 1110 (the k) and just do a little shifting and volah! (overflow is shifted << k bits)

More Technical Version of the Same Example

Here is a little more technical and mathematical description of the same process:

Assuming some integer n is the number to encode, and k is the number of bits to encode directly.

  1. sign (1 for positive, 0 for negative)
  2. n / (2k) 0's
  3. terminating 1
  4. k least significant bits of n

As an example, if n = 578 and k = 8: 100101000010

  1. sign (1 for positive, 0 for negative) = [1]
  2. n / (2k) 0's: n / 2k = 578 / 256 = 2 = [00]
  3. terminating 1: [1]
  4. k least significant bits of n: 578 = [01000010]
  5. put the 1-4 together: [1][00][1][01000010] = 100101000010

During the encode process, the optimum k is determined by looking at the average value over the past however many values (16 - 128 works well), and choosing the optimum k for that average (basically it's guessing what the next value will be, and trying to choose the most efficient k based on that). The optimum k can be calculated as [log(n) / log(2)].



声 音简单的说是一种波,而数字化音频是声波的数字化形式。这是通过对大量的模拟信号在每秒钟“采样”很多次而达到的。这个过程在概念上可以理解为在每秒钟内 对声波波形的最高点进行多次记录。现在市面上的音乐CD储存的就是对声音的每秒钟进行44100次的采样。自从CD都以立体声方式来压制时,对声音的采样 也变为每秒钟同时对左右声道采样44100次,采样得到的数值用16位的二进制整数来表示。基本上,一个WAV(波形)文件都有一个文件头,后面跟随一系 列的右(声道信号),左(声道信号),右,左......而当每个采样数值占用32位二进制数位(16位左声道,16位右声道),每秒钟44100的采样 频率时,记录一秒钟的声音就要使用1,411,200个二进制位,或者说是176,400字节(176.4KB)。


无 损压缩的第一步就是更有效的将左右声道的模型化为X,Y值 。通常在左右声道之间存在着大量的相关性,可以通过好几种方式来处理,最常用的是通过使用“中/ 边值编码”。在这种情况下,编码时采用的是一个中点值(X)和一个边值(Y),而不是左右声道数值。中点值(X)是左右声道数值的中间值,边值(Y)是两 声道数值的差值。这可由以下的公式得到:

X = (L + R) / 2

Y = (L - R)

下一步,X和Y数据流经一个预测器来去除冗余。基本上,这一步的目的是使得X Y序列中包含尽可能小的可解压的数值。从这一步里,一个压缩进程和另一个压缩进程相互隔开。实际上,有无数种方法可以实现这一步。这里举一个使用简单线形代数的例子:

PX = 2 * X1 - X2
PY = 2 * Y1 - Y2
例如:当X = (2,8,24,?);PX = (2 * X1) - X2 = (2 * 24) - 8 = 40


多 数好的预测器都是具有适应性的,它们能调整到处理当前数据所需的“可预测”程度。举个例子,当我们使用一个在0到1024之间的数m作为因子(0是无法预 测,1024是全预测),每次预测后,m会根据预测是否有用来向上或者向下调整。这样,在前面的例子中,留给预测器的是:

X = (2, 8, 24, ?)

PX = (2 * X1) - X2 = (2 * 24) - 8 = 40

如果 ? = 45 并且 m = 512, 那么 [最终值] = ? - (PX * m / 1024) = 45 - (40 * m / 1024) = 45 - (40 * 512 / 1024) = 45 - 20 = 25


使用不同的预测等式和在预测器里使用乘法处理 ,将会给压缩级别带来细微的不同。这里有份技术文档里的对应于不同需求的预测等式简表:

P0 = 0

P1 = X1

P2 = (2 * X1) - X2

P3 = (3 * X1) - (3 *X2) + X3






现 在,如果我们要用最可能少的数位来表示这些数字的话,对于每个数我们都要用32位二进制来表示,显然是很没效率的。那样要占用128个数位,而且仅从二进 制表达的数字看来,它们有相同部分,一定有更好的办法来表示。理想的方法就是使用最少的并且必须的数位,直接把四个数拍到一起,那么1010,1110, 1111,101110在去掉逗号的时候就是101011101111101110。这里的问题就是,我们不知道一个数从哪里开始,而下一个又是从哪里开 始。这个时候就该是赖斯(Rise)编码上场了。

赖斯(Rise)编码 是一种使用较少的数位来表示小的数目,同时能保持对数字进行区分的能力的方法。基本上它是这样工作的:


2)取4 6(101110)中的右边4位1110;
4)这样,我们在编码值里先放两个0 ,然后用1截止,再接上k位数1110,最后我们得到0011110。



2)n / 2^k个0

2)n / 2^k = 578 / 2^8 = 578 / 256 = 2 = [00]

在编码 过程中,最合适的k值视乎前面所有值的平均值而定(取16~128的值为佳)。(基本上是猜测下一个值是什么,然后在此基础上尽量取一个最有效的k值)最适合的k值可以用以下式子来计算:

[log(n) / log(2)]





