kgb压缩_KGB Archiver可能是可用的最佳压缩工具吗? 还是最慢?

kgb压缩

kgb压缩

kgb_archiver_top

File compression is so ubiquitous that it is now built into many operating systems as a standard feature. Zip files are generally the default archival format – occasionally replaced by RARs – but KGB Archiver is a tool that offers unparalleled levels of compression, although it does come at quite a price.

文件压缩无处不在,现在已作为许多标准功能内置到许多操作系统中。 Zip文件通常是默认的存档格式-有时被RAR取代-但KGB Archiver是一种提供无与伦比的压缩级别的工具,尽管它确实要付出很大的代价。

There is no financial cost associated with the program – which is in no way related to the former Russian security agency – but if you want to get the most from the app you are going to have to invest a good deal of time.

该程序没有任何财务成本–与前俄罗斯安全机构没有任何关系–但是如果您想从该应用程序中获得最大收益,则将需要花费大量时间。

This is an app that claims to offer an ‘unbelievably high compression rate’ and the software developer produces figures claiming compression levels that are around twice that of the zip format.

这是一款声称提供“令人难以置信的高压缩率”的应用程序,并且软件开发人员提供的数字声称压缩水平约为zip格式的两倍。

下载KGB存档器 (Download KGB Archiver)

The official website for the program is no longer online as the software has not been updated for a couple of years (although you can still find the site in the Internet Archive). However, the project is still available on SourceForge so you can download the utility and exchange thoughts and ideas with other users.

该程序的官方网站不再在线,因为该软件已经两年没有更新了(尽管您仍然可以在Internet Archive中找到该站点)。 但是,该项目在SourceForge上仍然可用,因此您可以下载该实用程序并与其他用户交换想法和想法。

kgb_archiver_1

Head over to the project page and ignore the download button. For some reason a language pack for the program is highlighted as the main download rather than the program itself.

转到项目页面,并忽略下载按钮。 由于某种原因,该程序的语言包被突出显示为主要下载内容,而不是程序本身。

kgb_archiver_2

To download the software, head to the Files section of the page, look in the KGB Archiver 2 folder and then download the .msi installer file from the 2.0 beta 2 folder. After running through the installation the compression tool can be accessed through the context menu in Explorer – just right click a file, folder, or selection of items and select the ‘Compress to xxx.kgb’ option.

要下载该软件,请转到页面的“文件”部分,在KGB Archiver 2文件夹中查找,然后从2.0 beta 2文件夹下载.msi安装程序文件。 安装完成后,可以通过资源管理器中的上下文菜单访问压缩工具-只需右键单击文件,文件夹或选择的项目,然后选择“压缩到xxx.kgb”选项。

This is where things start to get interesting. There are only two compression formats to choose from, KGB and Zip, but there are lots of other options. You can choose from one of seven different compression algorithms and assuming you opt to use the sixth of seventh method, you then specify the level of compression you would like to apply.

这是事情开始变得有趣的地方。 只有两种压缩格式可供选择,即KGB和Zip,但还有许多其他选择。 您可以从七个不同的压缩算法中选择一种,并假设您选择使用第七种方法中的第六种,然后指定要应用的压缩级别。

kgb_archiver_3

There are no fewer than ten compression levels available ranging from Minimal through Above Medium to Maximum. You can also password-protect archives and create self-extracting archives, but it is the compression itself that is of real interest.

可用的压缩级别不下十种,从最小到中等以上到最大。 您也可以用密码保护档案并创建自解压档案,但是压缩本身才是真正令人感兴趣的。

测试压缩 (Testing Compression)

As a test I worked with a folder filled with 100 JPEGs totalling 222MB in size. Using Windows’ built in compressed folders feature, this was reduced to a zip file that was 221MB – virtually no change, but it was accomplished in a matter of seconds.

作为测试,我使用了一个装有100个JPEG的文件夹,总大小为222MB。 使用Windows的内置压缩文件夹功能,该文件被压缩为221MB的zip文件-几乎没有变化,但仅需几秒钟即可完成。

Running the same folder through KGB Archiver gave very different results. First, the compression process took around half an hour, but the resulting archive was significantly smaller at just 174Mb.

通过KGB Archiver运行相同的文件夹会产生非常不同的结果。 首先,压缩过程耗时约半小时,但生成的存档明显较小,仅为174Mb。

As a second test, I collected a random selection of files: a few MP3s, text files, Word documents, images and executables totalling 93.5MB. Again, Windows’ Compressed Folder made little difference in the size, reducing it to just 90.5MB, but it did so very quickly.

作为第二项测试,我收集了随机选择的文件:一些MP3,文本文件,Word文档,图像和可执行文件,总计93.5MB。 同样,Windows的“压缩文件夹”在大小上几乎没有什么不同,将其减小到只有90.5MB,但是却很快。

KGB Archiver fared somewhat better, producing an archive of 81.6MB. Again, this took over thirty minutes, and this is on a quad-core machine. Is a saving like this worth it? That’s entirely for you to decide.

KGB Archiver表现更好,产生了81.6MB的存档。 同样,这花费了30分钟以上,并且是在四核计算机上。 这样的储蓄值得吗? 这完全由您决定。

kgb_archiver_4

There are obviously some types of file that are easier to compress than others. Simple text files can be seriously crushed in size while many videos and music files are already compressed to some extent. What you can expect from KGB Archiver really depends on the files you are working with.

显然,有些文件类型比其他文件更容易压缩。 简单的文本文件可能会严重压缩,而许多视频和音乐文件已经在某种程度上被压缩。 您对KGB Archiver的期望实际上取决于您使用的文件。

现实世界中的压缩 (Compression in the Real World)

The need for file compression has diminished over the years as hard drive capacities have spiraled upward and internet connection speeds have increased.

多年来,随着硬盘驱动器容量螺旋上升和互联网连接速度提高,对文件压缩的​​需求已减少。

kgb_archiver_5

I first learned of KGB Archiver four or five years ago. I stumbled across a website that claimed to have used the tool to reduce the contents of an Office 2007 installation CD to a mere 1.5MB – down from over 400MB.

四,五年前,我第一次了解KGB Archiver。 我偶然发现了一个网站,该网站声称已使用该工具将Office 2007安装CD的内容从超过400MB减少到仅1.5MB。

This level of compression seemed unbelievable, so I had to check it out – just for the purposes of research, you understand, I already had a fully functional copy of Office and no need to obtain a pirated version.

这种压缩程度似乎令人难以置信,因此我不得不检查一下–仅出于研究目的,您知道,我已经拥有Office的完整功能副本,无需获得盗版。

Having downloaded the archive, I set about the task of extract its contents. I seem to recall the process taking a full day, but when it was complete there was indeed a fully functioning Office installation ready to be used.

下载完档案后,我着手提取其内容。 我似乎记得这一过程需要花费一整天的时间,但是当它完成后,确实可以使用功能齐全的Office安装程序了。

I have not been able to replicate such impressive compression levels, but I have certainly found that KGB Archiver squashes files more than any other archiver I have used.

我无法复制如此令人印象深刻的压缩级别,但是我当然发现KGB Archiver压缩文件的次数超过了我使用过的任何其他压缩程序。

kgb_archiver_6

In reality, there are few practical uses for KGB Archiver – at least when the highest compression level is selected. Used on a small files there is little difference in file size compared to other tools. When conditions are right, however – which means compressing either a large number of files, very large files, or certain particularly compliant file types – the levels of compression that can be achieved are staggering.

实际上,至少在选择了最高压缩级别时,KGB Archiver几乎没有实际用途。 与其他工具相比,用于小型文件的文件大小几乎没有差异。 但是,当条件合适时(这意味着压缩大量文件,非常大的文件或某些特别符合要求的文件类型),可以达到的压缩级别令人震惊。

This is great in some respects, but the time requirement is something of a double whammy. However long it takes to compress the files you are working with, you should factor in roughly the same amount for decompression.

在某些方面,这是很棒的,但是时间要求是双重的。 压缩要使用的文件需要花费多长时间,因此应将解压缩的数量考虑在内。

What do you think of KGB Archiver? Is it a useful tool or nothing more than a gimmick? Is it something you could see yourself using?

您如何看待KGB存档器 ? 它是有用的工具,还是只不过是a头? 您可以使用自己的东西吗?

翻译自: https://www.howtogeek.com/135326/could-kgb-archiver-be-the-best-compression-tool-available-or-just-the-slowest/

kgb压缩

KGB Archiver console version ?005-2006 Tomasz Pawlak, tomekp17@gmail.com, mod by Slawek (poczta-sn@gazeta.pl) based on PAQ6 by Matt Mahoney PAQ6v2 - File archiver and compressor. (C) 2004, Matt Mahoney, mmahoney@cs.fit.edu This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License version 2 as published by the Free Software Foundation at http://www.gnu.org/licenses/gpl.txt or (at your option) any later version. This program is distributed without any warranty. USAGE To compress: PAQ6 -3 archive file file... (1 or more file names), or or (MSDOS): dir/b | PAQ6 -3 archive (read file names from input) or (UNIX): ls | PAQ6 -3 archive To decompress: PAQ6 archive (no option) To list contents: more < archive Compression: The files listed are compressed and stored in the archive, which is created. The archive must not already exist. File names may specify a path, which is stored. If there are no file names on the command line, then PAQ6 prompts for them, reading until the first blank line or end of file. The -3 is optional, and is used to trade off compression vs. speed and memory. Valid options are -0 to -9. Higher numbers compress better but run slower and use more memory. -3 is the default, and gives a reasonable tradeoff. Recommended options are: -0 to -2 for fast (2X over -3) but poor compression, uses 2-6 MB memory -3 for reasonably fast and good compression, uses 18 MB (default) -4 better compression but 3.5X slower, uses 64 MB -5 slightly better compression, 6X slower than -3, uses 154 MB -6 about like -5, uses 202 MB memory -7 to -9 use 404 MB, 808 MB, 1616 MB, about the same speed as -5 Decompression: No file names are specified. The archive must exist. If a path is stored, the file is extracted to the appropriate directory, which must exist. PAQ6 does not create directories. If the file to be extracted already exists, it is not replaced; rather it is compared with the archived file, and the offset of the first difference is reported. The decompressor requires as much memory as was used to compress. There is no option. It is not possible to add, remove, or update files in an existing archive. If you want to do this, extract the files, delete the archive, and create a new archive with just the files you want. TO COMPILE gxx -O PAQ6.cpp DJGPP 2.95.2 bcc32 -O2 PAQ6.cpp Borland 5.5.1 sc -o PAQ6.cpp Digital Mars 8.35n g++ -O produces the fastest executable among free compilers, followed by Borland and Mars. However Intel 8 will produce the fastest and smallest Windows executable overall, followed by Microsoft VC++ .net 7.1 /O2 /G7 PAQ6 DESCRIPTION 1. OVERVIEW A PAQ6 archive has a header, listing the names and lengths of the files it contains in human-readable format, followed by the compressed data. The first line of the header is "PAQ6 -m" where -m is the memory option. The data is compressed as if all the files were concatenated into one long string. PAQ6 uses predictive arithmetic coding. The string, y, is compressed by representing it as a base 256 number, x, such that: P(s < y) <= x < P(s <= y) (1) where s is chosen randomly from the probability distribution P, and x has the minimum number of digits (bytes) needed to satisfy (1). Such coding is within 1 byte of the Shannon limit, log 1/P(y), so compression depends almost entirely on the goodness of the model, P, i.e. how well it estimates the probability distribution of strings that might be input to the compressor. Coding and decoding are illustrated in Fig. 1. An encoder, given P and y, outputs x. A decoder, given P and x, outputs y. Note that given P in equation (1), that you can find either x from y or y from x. Note also that both computations can be done incrementally. As the leading characters of y are known, the range of possible x narrows, so the leading digits can be output as they become known. For decompression, as the digits of x are read, the set of possible y satisfying (1) is restricted to an increasingly narrow lexicographical range containing y. All of the strings in this range will share a growing prefix. Each time the prefix grows, we can output a character. y +--------------------------+ Uncomp- | V ressed | +---------+ p +----------+ x Compressed Data --+--->| Model |----->| Encoder |----+ Data +---------+ +----------+ | | +----------+ V y +---------+ p +----------+ y Uncompressed +--->| Model |----->| Decoder |----+---> Data | +---------+ +----------+ | | | +-------------------------------------+ Fig. 1. Predictive arithmetic compression and decompression Note that the model, which estimates P, is identical for compression and decompression. Modeling can be expressed incrementally by the chain rule: P(y) = P(y_1) P(y_2|y_1) P(y_3|y_1 y_2) ... P(y_n|y_1 y_2 ... y_n-1) (2) where y_i means the i'th character of the string y. The output of the model is a distribution over the next character, y_i, given the context of characters seen so far, y_1 ... y_i-1. To simplify coding, PAQ6 uses a binary string alphabet. Thus, the output of a model is an estimate of P(y_i = 1 | context) (henceforth p), where y_i is the i'th bit, and the context is the previous i - 1 bits of uncompressed data. 2. PAQ6 MODEL The PAQ6 model consists of a weighted mix of independent submodels which make predictions based on different contexts. The submodels are weighted adaptively to favor those making the best predictions. The output of two independent mixers (which use sets of weights selected by different contexts) are averaged. This estimate is then adjusted by secondary symbol estimation (SSE), which maps the probability to a new probability based on previous experience and the current context. This final estimate is then fed to the encoder as illustrated in Fig. 2. Uncompressed input -----+--------------------+-------------+-------------+ | | | | V V | | +---------+ n0, n1 +----------+ | | | Model 1 |--------->| Mixer 1 |\ p | | +---------+ \ / | | \ V V \ / +----------+ \ +-----+ +------------+ +---------+ \ / \| | p | | Comp- | Model 2 | \/ + | SSE |--->| Arithmetic |--> ressed +---------+ /\ | | | Encoder | output ... / \ /| | | | / \ +----------+ / +-----+ +------------+ +---------+ / \ | Mixer 2 | / | Model N |--------->| |/ p +---------+ +----------+ Fig. 2. PAQ6 Model details for compression. The model is identical for decompression, but the encoder is replaced with a decoder. In Sections 2-6, the description applies to the default memory option (-5, or MEM = 5). For smaller values of MEM, some components are omitted and the number of contexts is less. 3. MIXER The mixers compute a probability by a weighted summation of the N models. Each model outputs two numbers, n0 and n1 represeting the relative probability of a 0 or 1, respectively. These are combined using weighted summations to estimate the probability p that the next bit will be a 1: SUM_i=1..N w_i n1_i (3) p = -------------------, n_i = n0_i + n1_i SUM_i=1..N w_i n_i The weights w_i are adjusted after each bit of uncompressed data becomes known in order to reduce the cost (code length) of that bit. The cost of a 1 bit is -log(p), and the cost of a 0 is -log(1-p). We find the gradient of the weight space by taking the partial derivatives of the cost with respect to w_i, then adjusting w_i in the direction of the gradient to reduce the cost. This adjustment is: w_i := w_i + e[ny_i/(SUM_j (w_j+wo) ny_j) - n_i/(SUM_j (w_j+wo) n_j)] where e and wo are small constants, and ny_i means n0_i if the actual bit is a 0, or n1_i if the bit is a 1. The weight offset wo prevents the gradient from going to infinity as the weights go to 0. e is set to around .004, trading off between faster adaptation (larger e) and less noise for better compression of stationary data (smaller e). There are two mixers, whose outputs are averaged together before being input to the SSE stage. Each mixer maintains a set of weights which is selected by a context. Mixer 1 maintains 16 weight vectors, selected by the 3 high order bits of the previous byte and on whether the data is text or binary. Mixer 2 maintains 16 weight vectors, selected by the 2 high order bits of each of the previous 2 bytes. To distinguish text from binary data, we use the heuristic that space characters are more common in text than NUL bytes, while NULs are more common in binary data. We compare the position of the 4th from last space with the position of the 4th from last 0 byte. 4. CONTEXT MODELS Individual submodels output a prediction in the form of two numbers, n0 and n1, representing relative probabilities of 0 and 1. Generally this is done by storing a pair of counters (c0,c1) in a hash table indexed by context. When a 0 or 1 is encountered in a context, the appropriate count is increased by 1. Also, in order to favor newer data over old, the opposite count is decreased by the following heuristic: If the count > 25 then replace with sqrt(count) + 6 (rounding down) Else if the count > 1 then replace with count / 2 (rounding down) The outputs are derived from the counts in a way that favors highly predictive contexts, i.e. those where one count is large and the other is small. For the case of c1 >= c0 the following heuristic is used. If c0 = 0 then n0 = 0, n1 = 4 c0 Else n0 = 1, n1 = c1 / c0 For the case of c1 < c0 we use the same heuristic swapping 0 and 1. In the following example, we encounter a long string of zeros followed by a string of ones and show the model output. Note how n0 and n1 predict the relative outcome of 0 and 1 respectively, favoring the most recent data, with weight n = n0 + n1 Input c0 c1 n0 n1 ----- -- -- -- -- 0000000000 10 0 40 0 00000000001 5 1 5 1 000000000011 2 2 1 1 0000000000111 1 3 1 3 00000000001111 1 4 1 4 Table 1. Example of counter state (c0,c1) and outputs (n0,n1) In order to represent (c0,c1) as an 8-bit state, counts are restricted to the values 0-40, 44, 48, 56, 64, 96, 128, 160, 192, 224, or 255. Large counts are incremented probabilistically. For example, if c0 = 40 and a 0 is encountered, then c0 is set to 44 with probability 1/4. Decreases in counter values are deterministic, and are rounded down to the nearest representable state. Counters are stored in a hash table indexed by contexts starting on byte boundaries and ending on nibble (4-bit) boundaries. Each hash element contains 15 counter states, representing the 15 possible values for the 0-3 remaining bits of the context after the last nibble boundary. Hash collisions are detected by storing an 8-bit checksum of the context. Each bucket contains 4 elements in a move-to-front queue. When a new element is to be inserted, the priority of the two least recently accessed elements are compared by using n (n0+n1) of the initial counter as the priority, and the lower priority element is discarded. Hash buckets are aligned on 64 byte addresses to minimize cache misses. 5. RUN LENGTH MODELS A second type of model is used to efficiently represent runs of up to 255 identical bytes within a context. For example, given the sequence "abc...abc...abc..." then a run length model would map "ab" -> ("c", 3) using a hash table indexed by "ab". If a new value is seen, e.g. "abd", then the state is updated to the new character and a count of 1, i.e. "ab" -> ("d", 1). A run length context is accessed 8 times, once for each bit. If the bits seen so far are consistent with the modeled character, then the output of a run length model is (n0,n1) = (0,n) if the next bit is a 1, or (n,0) if the next bit is a 0, where n is the count (1 to 255). If the bits seen so far are not consistent with the predicted byte, then the output is (0,0). These counts are added to the counter state counts to produce the model output. Run lengths are stored in a hash table without collision detection, so an element occupies 2 bytes. Generally, most models store one run length for every 8 counter pairs, so 20% of the memory is allocated to them. Run lengths are used only for memory option (-MEM) of 5 or higher. 6. SUBMODEL DETAILS Submodels differ mainly in their contexts. These are as follows: a. DefaultModel. (n0,n1) = (1,1) regardless of context. b. CharModel (N-gram model). A context consists of the last 0 to N whole bytes, plus the 0 to 7 bits of the partially read current byte. The maximum N depends on the -MEM option as shown in the table below. The order 0 and 1 contexts use a counter state lookup table rather than a hash table. Order Counters Run lengths ----- -------- ----------- 0 2^8 1 2^16 2 2^(MEM+15) 2^(MEM+12), MEM >= 5 3 2^(MEM+17) 2^(MEM+14), MEM >= 5 4 2^(MEM+18) 2^(MEM+15), MEM >= 5 5 2^(MEM+18), MEM >= 1 2^(MEM+15), MEM >= 5 6 2^(MEM+18), MEM >= 3 2^(MEM+15), MEM >= 5 7 2^(MEM+18), MEM >= 3 2^(MEM+15), MEM >= 5 8 2^20, MEM = 5 2^17, MEM = 5 2^(MEM+14), MEM >= 6 2^(MEM+14), MEM >= 6 9 2^20, MEM = 5 2^17, MEM = 5 2^(MEM+14), MEM >= 6 2^(MEM+14), MEM >= 6 Table 2. Number of modeled contexts of length 0-9 c. MatchModel (long context). A context is the last n whole bytes (plus extra bits) where n >=8. Up to 4 matching contexts are found by indexing into a rotating input buffer whose size depends on MEM. The index is a hash table of 32-bit pointers with 1/4 as many elements as the buffer (and therefore occupying an equal amount of memory). The table is indexed by a hashes of 8 byte contexts. No collision detection is used. In order to detect very long matches at a long distance (for example, versions of a file compressed together), 1/16 of the pointers (chosen randomly) are indexed by a hash of a 32 byte context. For each match found, the output is (n0,n1) = (w,0) or (0,w) (depending on the next bit) with a weight of w = length^2 / 4 (maximum 511), depending on the length of the context in bytes. The four outputs are added together. d. RecordModel. This models data with fixed length records, such as tables. The model attempts to find the record length by searching for characters that repeat in the pattern x..x..x..x where the interval between 4 successive occurrences is identical and at least 2. Because of uncertainty in this method, the two most recent values (which must be different) are used. The following 5 contexts are modeled; 1. The two bytes above the current bit for each repeat length. 2. The byte above and the previous byte (to the left) for each repeat length. 3. The byte above and the current position modulo the repeat length, for the longer of the two lengths only. e. SparseModel. This models contexts with gaps. It considers the following contexts, where x denotes the bytes considered and ? denotes the bit being predicted (plus preceding bits, which are included in the context). x.x? (first and third byte back) x..x? x...x? x....x? xx.? x.x.? xx..? c ... c?, gap length c ... xc?, gap length Table 3. Sparse model contexts The last two examples model variable gap lengths between the last byte and its previous occurrence. The length of the gap (up to 255) is part of the context. e. AnalogModel. This is intended to model 16-bit audio (mono or stereo), 24-bit color images, 8-bit data (such as grayscale images). Contexts drop the low order bits, and include the position within the file modulo 2, 3, or 4. There are 8 models, combined into 4 by addition before mixing. An x represents those bits which are part of the context. 16 bit audio: xxxxxx.. ........ xxxxxx.. ........ ? (position mod 2) xxxx.... ........ xxxxxx.. ........ ? (position mod 2) xxxxxx.. ........ ........ ........ xxxxxx.. ........ xxxxxx.. ........ ? (position mod 4 for stereo audio) 24 bit color: xxxx.... ........ ........ xxxxxxxx ........ ........ ? (position mod 3) xxxxxx.. xxxx.... xxxx.... ? (position mod 3) 8 bit data: xxx..... xxxxx... xxxxxxx. ? CCITT images (1 bit per pixel, 216 bytes wide, e.g. calgary/pic) xxxxxxxx (skip 215 bytes...) xxxxxxxx (skip 215 bytes...) ? Table 4. Analog models. f. WordModel. This is intended to model text files. There are 3 contexts: 1. The current word 2. The previous and current words 3. The second to last and current words (skipping a word) A word is defined in two different ways, resulting in a total of 6 different contexts: 1. Any sequence of characters with ASCII code > 32 (not white space). Upper case characters are converted to lower case. 2. Any sequence of A-Z and a-z (case sensitive). g. ExeModel. This models 32-bit Intel .exe and .dll files by changing relative 32-bit CALL addresses to absolute. These instructions have the form (in hex) "E8 xx yy zz 00" or "E8 xx yy zz FF" where the 32-bit operand is stored least significant byte first. These are converted to absolute addresses by adding the position of the E8 byte, and then stored in a 256 element table indexed by the low order byte (xx) along with an 8-bit count. If another E8 xx ... 00/FF with the same value of xx is encountered, then the old value is replaced and the count set back to 1. During modeling, when "E8 xx" is encountered, the bytes yy, zz, and 00/FF are predicted by adjusting xx to absolute address, then looking up the address in the table indexed by xx. If the context matches the table entry up to the current bit, then the next bit from the table is predicted with weight n for yy, 4n for zz, and 16n for 00/FF, where n is the count. 7. SSE The purpose of the SSE stage is to further adjust the probability output from the mixers to agree with actual experience. Ideally this should not be necessary, but in reality this can improve compression. For example, when "compressing" random data, the output probability should be 0.5 regardless of what the models say. SSE will learn this by mapping all input probabilities to 0.5. | Output __ | p / | / | __/ | / | / | | | / |/ Input p +------------- Fig. 3. Example of an SSE mapping. SSE maps the probability p back to p using a piecewise linear function with 32 segments. Each vertex is represented by a pair of 8-bit counters (n0, n1) except that now the counters use a stationary model. When the input is p and a 0 or 1 is observed, then the corresponding count (n0 or n1) of the two vertices on either side of p are incremented. When a count exceeds the maximum of 255, both counts are halved. The output probability is a linear interpolation of n1/n between the vertices on either side. The vertices are scaled to be longer in the middle of the graph and short near the ends. The intial counts are set so that p maps to itself. SSE is context sensitive. There are 2048 separately maintained SSE functions, selected by the 0-7 bits of the current (partial) byte and the 2 high order bits of the previous byte, and on whether the data is text or binary, using the same heuristic as for selecting the mixer context. The final output to the encoder is a weighted average of the SSE input and output, with the output receiving 3/4 of the weight: p := (3 SSE(p) + p) / 4. (4) 8. MEMORY USAGE The -m option (MEM = 0 through 9) controls model and memory usage. Smaller numbers compress faster and use less memory, while higher numbers compress better. For MEM < 5, only one mixer is used. For MEM < 4, bit counts are stored in nonstationary counters, but no run length is stored (decreasing memory by 20%). For MEM < 1, SSE is not used. For MEM < 5, the record, sparse, and analog models are not used. For MEM < 4, the word model is not used. The order of the char model ranges from 4 to 9 depending on MEM for MEM as shown in Table 6. Run Memory used by........................ Total MEM Mixers Len Order Char Match Record Sparse Analog Word SSE Memory (MB) --- ------ --- ----- ---- ----- ------ ------ ------ ---- --- ----------- 0 1 no 4 .5 1 1.5 1 1 no 5 1 2 .12 3 2 1 no 5 2 4 .12 6 3 1 no 7 10 8 .12 18 4 1 no 7 20 16 6 6 1 15 .12 64 5 2 yes 9 66 32 13 11 2 30 .12 154 6 2 yes 9 112 32 13 11 4 30 .12 202 7 2 yes 9 224 64 25 22 9 60 .12 404 8 2 yes 9 448 128 50 45 18 120 .12 808 9 2 yes 9 992 256 100 90 36 240 .12 1616 Table 5. Memory usage depending on MEM (-0 to -9 option). 9. EXPERIMENTAL RESULTS Results on the Calgary corpos are shown below for some top data compressors as of Dec. 30, 2003. Options are set for maximum compression. When possible, the files are all compressed into a single archive. Run times are on a 705 MHz Duron with 256 MB memory, and include 3 seconds to run WRT when applicable. PAQ6 was compiled with DJGPP (g++) 2.95.2 -O. Original size Options 3,141,622 Time Author ------------- ------- --------- ---- ------ gzip 1.2.4 -9 1,017,624 2 Jean Loup Gailly epm r9 c 668,115 49 Serge Osnach rkc a -M80m -td+ 661,102 91 Malcolm Taylor slim 20 a 659,213 159 Serge Voskoboynikov compressia 1.0 beta 650,398 66 Yaakov Gringeler durilca v.03a (as in README) 647,028 30 Dmitry Shkarin PAQ5 661,811 361 Matt Mahoney WRT11 + PAQ5 638,635 258 Przemyslaw Skibinski + PAQ6 -0 858,954 52 -1 750,031 66 -2 725,798 76 -3 709,806 97 -4 655,694 354 -5 648,951 625 -6 648,892 636 WRT11 + PAQ6 -6 626,395 446 WRT20 + PAQ6 -6 617,734 439 Table 6. Compressed size of the Calgary corpus. WRT11 is a word reducing transform written by Przemyslaw Skibinski. It uses an external English dictionary to replace words with 1-3 byte symbols to improve compression. rkc, compressia, and durilca use a similar approach. WRT20 is a newer version of WRT11. 10. ACKNOWLEDGMENTS Thanks to Serge Osnach for introducing me to SSE (in PAQ1SSE/PAQ2) and the sparse models (PAQ3N). Also, credit to Eugene Shelwein, Dmitry Shkarin for suggestions on using multiple character SSE contexts. Credit to Eugene, Serge, and Jason Schmidt for developing faster and smaller executables of previous versions. Credit to Werner Bergmans and Berto Destasio for testing and evaluating them, including modifications that improve compression at the cost of more memory. Credit to Alexander Ratushnyak who found a bug in PAQ4 decompression, and also in PAQ6 decompression for very small files (both fixed). Thanks to Berto for writing PAQ5-EMILCONT-DEUTERIUM from which this program is derived (which he derived from PAQ5). His improvements to PAQ5 include a new Counter state table and additional contexts for CharModel and SparseModel. I refined the state table by adding more representable states and modified the return counts to give greater weight when there is a large difference between the two counts. I expect there will be better versions in the future. If you make any changes, please change the name of the program (e.g. PAQ7), including the string in the archive header by redefining PROGNAME below. This will prevent any confusion about versions or archive compatibility. Also, give yourself credit in the help message.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值