编码解码 data compression

最新推荐文章于 2021-05-23 09:07:14 发布

numenshane1

最新推荐文章于 2021-05-23 09:07:14 发布

阅读量1k

点赞数

分类专栏： algorithm 文章标签： compression string character algorithm encoding output

algorithm 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

编码的种类　　编码(Encoding)在认知上是解释传入的刺激的一种基本知觉的过程。技术上来说，这是一个复杂的、多阶段的转换过程，从较为客观的感觉输入（例如光、声）到主观上有意义的体验。

　　字符编码(Character encoding)是一套法则，使用该法则能够对自然语言的字符的一个集合（如字母表或音节表），与其他东西的一个集合（如号码或电脉冲）进行配对。

　　文字编码(Text encoding)使用一种标记语言来标记一篇文字的结构和其他特征，以方便计算机进行处理。

　　语义编码(Semantics encoding)，以正式语言乙对正式语言甲进行语义编码，即是使用语言乙表达语言甲所有的词汇（如程序或说明）的一种方法。

　　电子编码(Electronic encoding)是将一个信号转换成为一个代码，这种代码是被优化过的以利于传输或存储。转换工作通常由一个编解码器完成。

　　神经编码(Neural encoding)是指信息在神经元中被如何描绘的方法。

　　记忆编码(Memory encoding)是把感觉转换成记忆的过程。

　　加密(Encryption)是为了保密而对信息进行转换的过程。

　　译码(Transcoding)是将编码从一种格式转换到另一种格式的过程。

减少数据流中的冗余信息统计高频单词替换

数学游戏

设计具体的压缩算法的过程通常更像是一场数学游戏。开发者首先要寻找一种能尽量精确地统计或估计信息中符号出现概率的方法，然后还要设计一套用最短的代码描述每个符号的编码规则。统计学知识对于前一项工作相当有效，迄今为止，人们已经陆续实现了静态模型、半静态模型、自适应模型、 Markov 模型、部分匹配预测模型等概率统计模型。相对而言，编码方法的发展历程更为曲折一些。

Huffman 编码 A Method for the Construction of Minimum Redundancy Codes

异族传说

逆向思维永远是科学和技术领域里出奇制胜的法宝。大多数人绞尽脑汁想改进 Huffman 或算术编码，以获得一种兼顾了运行速度和压缩效果的“完美”编码的时候

创造出了一系列比 Huffman 编码更有效，比算术编码更快捷的压缩算法----LZ 系列算法

按照时间顺序， LZ 系列算法的发展历程大致是：

顺序数据压缩的一个通用算法（ A Universal Algorithm for Sequential Data Compression ）

通过可变比率编码的独立序列的压缩（ Compression of Individual Sequences via Variable Rate Coding ）

高性能数据压缩技术（ A Technique for High Performance Data Compression ） LZW 算法

新技术特性

在Oracle9i中虽然引入了表压缩，但是有很大的限制。只能对批量装载操作（比如直接路径装载，CTAS等）涉及的数据进行压缩，普通的DML操作的数据是无法压缩的。这应该是对于写操作的压缩难题没有解决，一直遗留到Oracle11g，总算是解决了关系数据压缩的写性能问题。Oracle的表压缩是针对Block级别的数据压缩，主要技术和Oracle9i差不多，还是在Block中引入symbol表，将block中的重复数据在symbol中用一个项表示。Oracle会对block进行批量压缩，而不是每次在block中写入数据时都进行压缩，通过这种方式，可以尽量降低数据压缩对于DML操作的性能影响。这样，在block级别应该会引入一个新的参数，用于控制block中未压缩的数据量达到某个标准以后进行压缩操作。

LZW

http://marknelson.us/1989/10/01/lzw-data-compression/

The routines shown here belong in any programmer's toolbox. For example, a program that has a few dozen help screens could easily chop 50K bytes off by compressing the screens. Or 500K bytes of software could be distributed to end users on a single 360K byte floppy disk. Highly redundant database files can be compressed down to 10% of their original size. Once the tools are available, the applications for compression will show up on a regular basis.

LZW Fundamentals

The algorithm is surprisingly simple. In a nutshell, LZW compression replaces strings of characters with single codes. It does not do any analysis of the incoming text. Instead, it just adds every new string of characters it sees to a table of strings. Compression occurs when a single code is output instead of a string of characters.

The code that the LZW algorithm outputs can be of any arbitrary length, but it must have more bits in it than a single character. The first 256 codes (when using eight bit characters) are by default assigned to the standard character set. The remaining codes are assigned to strings as the algorithm proceeds. The sample program runs as shown with 12 bit codes. This means codes 0-255 refer to individual bytes, while codes 256-4095 refer to substrings.

Compression

The LZW compression algorithm in its simplest form is shown in Figure 1. A quick examination of the algorithm shows that LZW is always trying to output codes for strings that are already known. And each time a new code is output, a new string is added to the string table.

Routine LZW_COMPRESS

CODE:

STRING = get input character
WHILE there are still input characters DO
CHARACTER = get input character
IF STRING+CHARACTER is in the string table then
STRING = STRING+character
ELSE
output the code for STRING
add STRING+CHARACTER to the string table
STRING = CHARACTER
END of IF
END of WHILE
output the code for STRING

The Compression Algorithm
Figure 1

A sample string used to demonstrate the algorithm is shown in Figure 2. The input string is a short list of English words separated by the '/' character. Stepping through the start of the algorithm for this string, you can see that the first pass through the loop, a check is performed to see if the string "/W" is in the table. Since it isn't, the code for '/' is output, and the string "/W" is added to the table. Since we have 256 characters already defined for codes 0-255, the first string definition can be assigned to code 256. After the third letter, 'E', has been read in, the second string code, "WE" is added to the table, and the code for letter 'W' is output. This continues until in the second word, the characters '/' and 'W' are read in, matching string number 256. In this case, the code 256 is output, and a three character string is added to the string table. The process continues until the string is exhausted and all of the codes have been output.

Input String = /WED/WE/WEE/WEB/WET
Character Input	Code Output	New code value	New String
/W	/	256	/W
E	W	257	WE
D	E	258	ED
/	D	259	D/
WE	256	260	/WE
/	E	261	E/
WEE	260	262	/WEE
/W	261	263	E/W
EB	257	264	WEB
/	B	265	B/
WET	260	266	/WET
EOF	T

The Compression Process
Figure 2

The sample output for the string is shown in Figure 2 along with the resulting string table. As can be seen, the string table fills up rapidly, since a new string is added to the table each time a code is output. In this highly redundant input, 5 code substitutions were output, along with 7 characters. If we were using 9 bit codes for output, the 19 character input string would be reduced to a 13.5 byte output string. Of course, this example was carefully chosen to demonstrate code substitution. In real world examples, compression usually doesn't begin until a sizable table has been built, usually after at least one hundred or so bytes have been read in.

词典

http://www.codeproject.com/KB/recipes/Patterns.aspx

numenshane1

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
编码解码 data compression

http://www.codeproject.com/KB/recipes/Patterns.aspxhttp://www.alexmayers.com/projects/DIPRE/ 算法不断扩大搜索范围
复制链接

扫一扫

专栏目录