文本压缩1

最新推荐文章于 2023-11-12 20:22:32 发布

Cavill

最新推荐文章于 2023-11-12 20:22:32 发布

阅读量523

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/Cavill/article/details/48804209

版权

本文介绍了文本压缩的基础，包括熵的计算公式，以及两种不同的建模方法：半静态建模和自适应建模。讨论了典型的赫夫曼建模，并提出了一种改进的原理和参数，以减少内存使用。最后给出了字符级的Python实现代码示例。

摘要由CSDN通过智能技术生成

Green hand
文本压缩的模型就是要预测（或统计）字符出现的概率，模型提供这种字符的概率分布函数，再有解码器应用相同的分布函数进行解码。下面实现初步的字符级的模型。
Equation: Entropy=Sum(-P[i]*log[P[i]])

Semi-static modeling
At the first sight at the text, we calculate the possibility of each character (i.e. i–P[i]), then we utilize the equation(*1) to set up the length of each character’s code.
Adaptive modeling
We start with a smooth PDF of characters, then calculate the possibilities of each character from just the text we just have received, e.g. with a 1000-characters passage, while we have decoding or encoding at the 400th character and the word ‘u’ has been found 20 times in these 400 read words, we put P[‘u’]=20.0/400. In this way, both encoding and decoding share the same PDF model. To avoid the ‘zero-frequency’ issue, we initiate each character first appearing 1 time.
Canonical Huffman modeling
Taking this case for instance, with the using of a casual Huffman model, decoding n characters requisites n-1 inner nodes and n leaves, which each of these leaves acquires 2 pointers, on the Huffman tree. Finally, we need 4n words to decode n words, and in practice, with decoding 1MB words to storing 16MB memory at most.

Comparing to the case of canonical Huffman tree, we use just n+100 memory.

Canonical Huffman tree is a subset of Huffman tree.
First, we provide the principles and some parameters:
Principles:
(*1). the codes should be with good coherence, e.g. 3D,4D,5D
(*2). the 1st code with length-i can be calculate from the last code with length-(i-1) using the equation(*2)
(*3). the 1st minimal length code should be 0D
Parameters:
firstcode[i]: the first code with i-length, we can calculate it with equation(*2), it’s truly a binary code;
numl[i]: total amount of i-length code;
index[i]: the index of the first i-length-code in the dictionary.
Equation(2): firstcode[i]=2(last_code[i-1]+1), firstcode[min_len]=0

Second, construct the code words:
e.g.
words ‘a’~’u’ with the code length ‘a’-3, (‘b’:’i’)-4, (‘j’:’u’)-5, with Principal-1 we could get ‘a’ with code ‘000b’. With Principle-2 we can easily get ‘b’ with ‘0010b’, ‘c’ with ‘0011b’ etc.

Finally, decoding algorithm:
长度为i的码字的前j位的数值大于长度为j的码字的数值.
we first find out the actual length of the next pending code and the deviation between code and firstcode[i] can assist us to locate the location in the dictionary.

Python codes:
”’

Only character-level with 26 English characters to be as an example, without complimenting encoding Canonical Huffman Model

#!/usr/bin/env python
import re

def lines(file):
    '''

    to seperate single characters into a list and add '\n' at the

最低0.47元/天解锁文章

Cavill

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录