Green hand
文本压缩的模型就是要预测(或统计)字符出现的概率,模型提供这种字符的概率分布函数,再有解码器应用相同的分布函数进行解码。下面实现初步的字符级的模型。
Equation: Entropy=Sum(-P[i]*log[P[i]])
Semi-static modeling
At the first sight at the text, we calculate the possibility of each character (i.e. i–P[i]), then we utilize the equation(*1) to set up the length of each character’s code.Adaptive modeling
We start with a smooth PDF of characters, then calculate the possibilities of each character from just the text we just have received, e.g. with a 1000-characters passage, while we have decoding or encoding at the 400th character and the word ‘u’ has been found 20 times in these 400 read words, we put P[‘u’]=20.0/400. In this way, both encoding and decoding share the same PDF model. To avoid the ‘zero-frequency’ issue, we initiate each character first appearing 1 time.Canonical Huffman modeling
Taking this case for instance, with the using of a casual Huffman model, decoding n characters requisites n-1 inner nodes and n leaves, which each of these leaves acquires 2 pointers, on the Huffman tree. Finally, we need 4n words to decode n words, and in practice, with decoding 1MB words to storing 16MB memory at most.
Comparing to the case of canonical Huffman tree, we use just n+100 memory.
Canonical Huffman tree is a subset of Huffman tree.
First, we provide the principles and some parameters:
Principles:
(*1). the codes should be with good coherence, e.g. 3D,4D,5D
(*2). the 1st code with length-i can be calculate from the last code with length-(i-1) using the equation(*2)
(*3). the 1st minimal length code should be 0D
Parameters:
firstcode[i]: the first code with i-length, we can calculate it with equation(*2), it’s truly a binary code;
numl[i]: total amount of i-length code;
index[i]: the index of the first i-length-code in the dictionary.
Equation(2): firstcode[i]=2(last_code[i-1]+1), firstcode[min_len]=0
Second, construct the code words:
e.g.
words ‘a’~’u’ with the code length ‘a’-3, (‘b’:’i’)-4, (‘j’:’u’)-5, with Principal-1 we could get ‘a’ with code ‘000b’. With Principle-2 we can easily get ‘b’ with ‘0010b’, ‘c’ with ‘0011b’ etc.
Finally, decoding algorithm:
长度为i的码字的前j位的数值大于长度为j的码字的数值.
we first find out the actual length of the next pending code and the deviation between code and firstcode[i] can assist us to locate the location in the dictionary.
Python codes:
”’
Only character-level with 26 English characters to be as an example, without complimenting encoding Canonical Huffman Model
#!/usr/bin/env python
import re
def lines(file):
'''
to seperate single characters into a list and add '\n' at the