Monkey Audio 中将WAV 压缩成 APE 的理论支持

Digital Audio

Sound is simply a wave, and digital audio is the digital representation of this wave. The digital representation is achieved by "sampling" the magnitude of an analog signal many times per second. This can be thought of conceptually as recording the "height" of the wave many times per second. Today's audio CD's store 44,100 samples per second. Since CD's are in stereo, they store both a left and a right value 44,100 times per second. These values are represented by 16 bit integers. Basically a WAV file is a header, followed by an array of R, L, R, L…. Since each sample takes 32 bits (16 for the left, and 16 for the right), and there are 44,100 samples per second, one second of audio takes 1,411,200 bits, or 176,400 bytes.

Lossless Compression

Lossless compression can be broken down into a few basic steps. They are detailed below.

1. Conversion to X,Y

The first step in lossless compression is to more efficiently model the channels L and R as some X and Y values. There is often a great deal of correlation between the L and R channels, and this can be exploited several ways, with one popular way being through the use of mid / side encoding. In this case, a mid (X) and a side (Y) value are encoded instead of a L and a R value. The mid (X) is the midpoint between the L and R channels and the side (Y) is the difference in the channels. This can be achieved:

  • X = (L + R) / 2
  • Y = (L - R)

2. Predictor

Next, the X and Y data is passed through a predictor to attempt to remove any redundancy. Basically, the goal of this stage is to make the X and Y arrays contain the smallest possible values while still remaining decompressible. This stage is what separates one compression scheme from another. There are virtually countless ways to do this. Here is a sample using simple linear algebra:

  • PX and PY are the predicted X and Y; X-1is the previous X value; X-2 is the X value two back
  • PX = (2 * X-1) - X-2
  • PY = (2 * Y-1) - Y-2

As an example, if X = (2, 8, 24, ?); PX = (2 * X-1) - X-2 = (2 * 24) - 8 = 40

Then, these predicted values are compared with the actual value and the difference (error) is what gets sent to the next stage for encoding.

Most good predictors are adaptive, so that they adjust to how "predictable" the data currently is. For example, let's use a factor 'm' that ranges from 0 to 1024 (0 is no prediction and 1024 is full prediction). After each prediction, m is adjusted up or down depending on whether the prediction was helpful or not. So in the previous example, what leaves the predictor is this:

  • X = (2, 8, 24, ?)
  • PX = (2 * X-1) - X-2 = (2 * 24) - 8 = 40

If ? = 45 and m = 512

  • [Final Value] = ? - (PX * m / 1024) = 45 - (40 * m / 1024) = 45 - (40 * 512 / 1024) = 45 - 20 = 25

After this m would be adjusted upward because a higher m would have been more efficient.

Using different prediction equations and using multiple passes through the predictor can make a fairly substantial difference in compression level. Here is a quick list of some prediction equations as shown in the Shorten technical documentation (for different orders):

  • P0 = 0
  • P1 = X-1
  • P2 = (2 * X-1) - X-2
  • P3 = (3 * X-1) - (3 *X-2) + X-3

3. Encoding of Data

The goal behind audio compression is to make all of the numbers as small as possible by removing any correlation that may exist between them. Once this is achieved the resulting numbers must be written to disk.

Why are smaller numbers better? They are better because they take less bits to represent. For example, say we want to encode this array of numbers (32 bit longs):

Base 10: 10, 14, 15, 46

or in binary

Base 2: 1010, 1110, 1111, 101110

Now obviously if we want to represent these numbers in the fewest possible bits, it would be quite inefficient to represent them each as separate longs with 32 bits apiece. That would take 128 bits, and just from looking at the same numbers represented in base two, it is obvious that there must be a better way. The ideal thing would be just to slap the four numbers together using the least bits necessary, so 1010, 1110, 1111, 101110 without the commas would be 101011101111101110. The problem here is that we don't know where one number starts and the next begins.

To store small numbers so that they take less bits, but can still be decompressed, we use "entropy coding". There are details about a common entropy coding system called Rice Coding below. Monkey's Audio uses a slightly more advanced entropy coder, but the fundamentals of Rice Coding are still useful.

Entropy Coding

Rice Coding

Rice coding is a way of using less bits to represent small numbers, while still maintaining the ability to tell one number from the next. Basically it works like this:

  1. You make your best guess as to how many bits a number will take, and call that k
  2. Take the rightmost k bits of the number and remember what they are
  3. Imagine the binary number without those rightmost k bits and look at its new value (this is the overflow that doesn't fit in k bits)
  4. Use these values to encode the number. This encoded value is represented as a number of zeroes corresponding to step 3, then a terminating 1 to tell that you're done sending the "overflow", then the k bits from step 2.

Example

Let's work through our example, and try to encode the fourth number in our series 10, 14, 15, 46.

  1. You make your best guess as to how many bits a number will take, and call that k: since the previous 3 numbers took 4 bits, that seems like a reasonable guess so we will set k = 4
  2. Take the rightmost k bits of the number and remember what they are: The right 4 bits of 46 (101110) are 1110
  3. Imagine the binary number without those rightmost k bits and look at its new value (this is the overflow that doesn't fit in k bits): When you take the 1110 away from the right of 101110 you are left with 10 or 2 (in base 10)
  4. Use these values to encode the number. So, we put two 0's, followed by the terminating 1, followed by the k bits 1110. All together we have: 0011110

Now to undue this operation, we just take 0011110 and k = 4 and work our way backwards. We first see that the overflow is 2 (there are two zeroes before the terminating 1). We also see that the last four bits = 1110. So, we take the value 10 (the overflow) and the values 1110 (the k) and just do a little shifting and volah! (overflow is shifted << k bits)

More Technical Version of the Same Example

Here is a little more technical and mathematical description of the same process:

Assuming some integer n is the number to encode, and k is the number of bits to encode directly.

  1. sign (1 for positive, 0 for negative)
  2. n / (2k) 0's
  3. terminating 1
  4. k least significant bits of n

As an example, if n = 578 and k = 8: 100101000010

  1. sign (1 for positive, 0 for negative) = [1]
  2. n / (2k) 0's: n / 2k = 578 / 256 = 2 = [00]
  3. terminating 1: [1]
  4. k least significant bits of n: 578 = [01000010]
  5. put the 1-4 together: [1][00][1][01000010] = 100101000010

During the encode process, the optimum k is determined by looking at the average value over the past however many values (16 - 128 works well), and choosing the optimum k for that average (basically it's guessing what the next value will be, and trying to choose the most efficient k based on that). The optimum k can be calculated as [log(n) / log(2)].

翻译:

ape压缩原理
数字音频:

声 音简单的说是一种波,而数字化音频是声波的数字化形式。这是通过对大量的模拟信号在每秒钟“采样”很多次而达到的。这个过程在概念上可以理解为在每秒钟内 对声波波形的最高点进行多次记录。现在市面上的音乐CD储存的就是对声音的每秒钟进行44100次的采样。自从CD都以立体声方式来压制时,对声音的采样 也变为每秒钟同时对左右声道采样44100次,采样得到的数值用16位的二进制整数来表示。基本上,一个WAV(波形)文件都有一个文件头,后面跟随一系 列的右(声道信号),左(声道信号),右,左......而当每个采样数值占用32位二进制数位(16位左声道,16位右声道),每秒钟44100的采样 频率时,记录一秒钟的声音就要使用1,411,200个二进制位,或者说是176,400字节(176.4KB)。


无损压缩:

1)转化至X,Y
无 损压缩的第一步就是更有效的将左右声道的模型化为X,Y值 。通常在左右声道之间存在着大量的相关性,可以通过好几种方式来处理,最常用的是通过使用“中/ 边值编码”。在这种情况下,编码时采用的是一个中点值(X)和一个边值(Y),而不是左右声道数值。中点值(X)是左右声道数值的中间值,边值(Y)是两 声道数值的差值。这可由以下的公式得到:

X = (L + R) / 2

Y = (L - R)


2)预测器
下一步,X和Y数据流经一个预测器来去除冗余。基本上,这一步的目的是使得X Y序列中包含尽可能小的可解压的数值。从这一步里,一个压缩进程和另一个压缩进程相互隔开。实际上,有无数种方法可以实现这一步。这里举一个使用简单线形代数的例子:

PX和PY是预测的X,Y值;X1是最初的X值,X2是经过二次预测的返回值;
PX = 2 * X1 - X2
PY = 2 * Y1 - Y2
例如:当X = (2,8,24,?);PX = (2 * X1) - X2 = (2 * 24) - 8 = 40

那样,将预测值和实际值相减,差值(错误)被传送到下一步编码。

多 数好的预测器都是具有适应性的,它们能调整到处理当前数据所需的“可预测”程度。举个例子,当我们使用一个在0到1024之间的数m作为因子(0是无法预 测,1024是全预测),每次预测后,m会根据预测是否有用来向上或者向下调整。这样,在前面的例子中,留给预测器的是:

X = (2, 8, 24, ?)

PX = (2 * X1) - X2 = (2 * 24) - 8 = 40

如果 ? = 45 并且 m = 512, 那么 [最终值] = ? - (PX * m / 1024) = 45 - (40 * m / 1024) = 45 - (40 * 512 / 1024) = 45 - 20 = 25

在这以后,m将向上调整,因为更高的m数值将会更有效。

使用不同的预测等式和在预测器里使用乘法处理 ,将会给压缩级别带来细微的不同。这里有份技术文档里的对应于不同需求的预测等式简表:

P0 = 0

P1 = X1

P2 = (2 * X1) - X2

P3 = (3 * X1) - (3 *X2) + X3

3)数据编码/赖斯编码

音频压缩的目的是要让所有的数尽可能地小,通过去除它们之间存在的冗余。一旦这个目的达到之后,结果数据必须要写入到磁盘里。诸多(可能并不是)最有效途径中的一种,就是采用赖斯(Rise)编码。

为什么越小的数值越好?因为它们能用更少的二进制位来表达。例如,我们要对一个数列(32位字长)进行编码:

十进制:10,14,15,46

转成二进制为:1010,1110,1111,101110

现 在,如果我们要用最可能少的数位来表示这些数字的话,对于每个数我们都要用32位二进制来表示,显然是很没效率的。那样要占用128个数位,而且仅从二进 制表达的数字看来,它们有相同部分,一定有更好的办法来表示。理想的方法就是使用最少的并且必须的数位,直接把四个数拍到一起,那么1010,1110, 1111,101110在去掉逗号的时候就是101011101111101110。这里的问题就是,我们不知道一个数从哪里开始,而下一个又是从哪里开 始。这个时候就该是赖斯(Rise)编码上场了。

赖斯(Rise)编码 是一种使用较少的数位来表示小的数目,同时能保持对数字进行区分的能力的方法。基本上它是这样工作的:
1)对于表达一个数目需要多少数位,你做了你最好的猜测,把它记做k;
2)取数目中的右边k位数并且记住;
3)想象没有右边k位数的二进制数,观察它的新值(不符合k位时溢出);
4)使用这些新值对数目进行编码。编码值用与第3步对应的一组0来表示,0后面以1来终结,用于告知“发出溢出信号完成”,之后跟随第2步得到的k位数。

让我们用我们刚才的例子,对我们的数列10,14,15,46中的第四个数进行编码。

1)既然前面的三个数都占用4个二进制位,那么对于第四个数所占用的数位做个合理的猜测,我们设k=4;
2)取4 6(101110)中的右边4位1110;
3)当你把101110右边4位去掉后就剩下10(二进制);
4)这样,我们在编码值里先放两个0 ,然后用1截止,再接上k位数1110,最后我们得到0011110。

现在来逆向进行这个操作,我们只需要有数值0011110和k=4。首先我们发现溢出为2(在终止数位1前面有两个0),我们还看到数的最后4位1110。所以,我们只要对溢出值10和数值1110作简单移位就得到原先的数目。

以下是对相同过程更技术化数学化的描述:

假设一个整数n是被编码的数字,k是这个整数直接编码的数位数。
1)标志(1为正,0为负)
2)n / 2^k个0
3)终止数位1
4)余下的k位数

例如,n=578D=100101000010B,k=8,
1)标志:[1](578为正数)
2)n / 2^k = 578 / 2^8 = 578 / 256 = 2 = [00]
3)终止数位:[1]
4)余下的8位:[01000010]
5)把1~4的结果放到一起:[1][00][1][01000010]=[100101000010]

在编码 过程中,最合适的k值视乎前面所有值的平均值而定(取16~128的值为佳)。(基本上是猜测下一个值是什么,然后在此基础上尽量取一个最有效的k值)最适合的k值可以用以下式子来计算:

[log(n) / log(2)]

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
软件名称:Monkey's Audio 软件版本:4.06 运行环境:WinALL 汉化类型:汉化安装版 软件性质:免费 官方主页:http://www.monkeysaudio.com ━━━━━━━━━━━━━━━ 软件简介 ━━━━━━━━━━━━━━━ Monkey's Audio(http://www.monkeysaudio.com)是一种无损压缩技术。他并不是由什么大公司发明的,完全是一个个人业余兴趣作品。为了使Monkey's Audio能有更好的发展,现在这个软件已经公开了源代码。   这种压缩格式的特点是无损压缩,也就是说对压缩数据进行还原之后得到的数据与原来的数据是完全相同的。该格式的特点尤其适合那些拥有一对“金耳朵”并且一直对mp3的音质耿耿于怀的音乐发烧友。后者有人会问,要无损压缩,我随便找个压缩软件比如WINRAR都可以达到不错的效果了,还要这种格式干嘛? 所以这里要澄清一个误解:使用普通的压缩软件进行压缩无疑是可以得到不错的压缩效果,有时候甚至更优于使用Monkey's Audio,但是压缩软件生成的压缩包必须要先解压还原之后才能播放里面的内容,而Monkey's Audio这种无损压缩编码得到的文件可以直接使用播放器(比如WinAMP)进行播放。Monkey's Audio压缩效果大约在2:1左右,也就是说压缩结果是原来的二分之一大小。一张CD大约需要330MB左右的空间存放,相比之下还是比较占空间的。由于这个原因,所以对音质要求不是太高的人通常都选择VBR方式的mp3而不是选择它。   与Monkey's Audio类似的编码格式还包括WavPack、RKAU、Shorten等等。由于相对不是那么出名,又或者在某些方面不够Monkey's Audio做得好,因此获得的关注程度就逊色很多。   Monkey's Audio 软件在Monkey‘s Audio官方网站可以下载,专用于压缩wav文件为ape文件或解压ape文件为wav文件   Monkey's Audio 是一种快速且易于操作的数字音乐压缩方案,它所生成的 APE 无损压缩格式,压缩率接近50%,压缩后的品质跟原始声音的品质完全是一样的(据说有人将 Wav 文件生成 APE 格式后,再转换回 Wav 文件,结果对比两个Wav 文件的 MD5 值时,发现居然完全一样)!   APE 无损压缩音频格式已被绝大多数音频播放器如:Winamp、Media Jukebox以及越来越多的应用程序支持Monkey's AudioAPE压缩、解压缩、转换操作的平台。   本汉化安装版包括了原版的一切组件,无需原版文件,另外,用于 Winamp的APE插件安装程序也替换为汉化版。 ━━━━━━━━━━━━━━━ 汉化历程 ━━━━━━━━━━━━━━━ 2006.01.11 发布 Monkey's Audio 4.01 b1 汉化版 2006.03.14 发布 Monkey's Audio 4.01 b2 汉化版 2009.02.02 发布 Monkey's Audio 4.03 汉化版 2009.02.03 发布 Monkey's Audio 4.04 汉化版 2009.02.05 发布 Monkey's Audio 4.05 汉化版 2009.03.19 发布 Monkey's Audio 4.06 汉化版

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值