huffman code 的不错算法。(优先队列)

You’ve probably heard about David Huffman and his popular compression algorithm. If you didn’t, you’ll find that info on the Internet. I will not bore you with history or math lessons in this article. I’m going to try to show you a practical example of this algorithm applied to a character string.

The idea behind Huffman coding is based upon the frequency of a symbol in a sequence. The symbol that is the most frequent in that sequence gets a new code that is very small, the least frequent symbol will get a code that is very long, so that when we’ll translate the input we want to encode the most frequent symbols will take less space than they used to and the least frequent symbols will take more space but because they’re less frequent it won’t matter that much. For this application I chose the symbol to be 8 bits long so that the symbol will be a character (char).

We could just as easily have chosen the symbol to be 16 bits long, so we could have grouped 2 characters together as a symbol or 10 bits or 20 etc. Depending on the input we expect to have, we’ll chose the size of the symbol and the way we use it. For example, if I expect to encode raw video files, I’ll chose the symbol to be the size of a pixel. Keep in mind that when increasing or decreasing the size of the symbol, it will affect the size of the code for each symbol because the bigger the size, the more symbols you can have of that size. There are less ways to write the ones and zeroes on 8 bits than there are on 16 bits. You’ll want to adjust the size of the symbol depending on how the ones and zeroes are likely to repeat themselves in a sequence.

For this algorithm you need to have a basic understanding of binary tree data structure and the priority queue data structure. In the source code we’ll actually use the priority queue code available in a previous article.

Let’s say we have the string “beep boop beer!” which in his actual form, occupies 1 byte of memory for each character. That means that in total, it occupies 15*8 = 120 bits of memory. Through encoding, the string will occupy 40 bits. (Theoretically, in this application we’ll output to the console a string of 40 char elements of 0 and 1 representing the encoded version of the string in bits. For this to occupy 40 bits we need to convert that string directly into bits using logical bit operations which we’ll not discuss now.)

To better understand this example, we’ll going to apply it on an example. The string “beep boop beer!” is a very good example to illustrate this. In order to obtain the code for each element depending on it’s frequency we’ll need to build a binary tree such that each leaf of the tree will contain a symbol (a character from the string). The tree will be build from the leafs to the root, meaning that the elements of least frequency will be farther from the root than the elements that are more frequent. You’ll see soon why we chose to do this.

To build the tree this way we’ll use a priority queue with a slight modification, that the element with the least priority is the most important. Meaning that the elements that are the least frequent will be the first ones we get from the queue. We need to do this so we can build the tree from the leaves to the root.

Firstly we calculate the frequency of each character :

Character Frequency
‘b’ 3
‘e’ 4
‘p’ 2
‘ ‘ 2
‘o’ 2
‘r’ 1
‘!’ 1

After calculating the frequencies, we’ll create binary tree nodes for each character and we’ll introduce them in the priority queue with the frequency as priority :

We now get the first two elements from the queue and create a link between them by creating a new binary tree node to have them both as successors, so that the characters are siblings and we add their priorities. After that we add the new node we created with the sum of the priorities of it’s successors as it’s priority in the queue. (The numbers represent the priority, i.e. their frequency.)

We repeat the same steps and we get the following :

Now after we link the last two elements we’ll get the final tree :

Now, to obtain the code for each symbol we just need to traverse the trees until we get to that symbol and after each step we take to the left we add a 0 to the code or 1 if we go right.

If we do this, we’ll get the following codes :

Character Code
‘b’ 00
‘e’ 11
‘p’ 101
‘ ‘ 011
‘o’ 010
‘r’ 1000
‘!’ 1001

To decode a string of bits we just need to traverse the tree for each bit, if the bit is 0 we take a left step and if the bit is 1 we take a right step until we hit a leaf (which is the symbol we are looking for). For example, if we have the string “101 11 101 11″ and our tree, decoding it we’ll get the string “pepe”.

It’s very important to observe that not one code is a prefix of another code for another symbol. In our example, if 00 is the code for ‘b’, 000 cannot be a code for any other symbol because there’s going to be a conflict. We’ll never reach that symbol because after taking steps for the first two bits we’ll get ‘b’, we’re never going the find the symbol for 000.

A practical aspect of implementing this algorithm is considering to build a Huffman table as soon as we have the tree. The table is basically a linked list or an array that contains each symbol with it’s code because it will make encoding something more efficient. It’s hard to look for a symbol by traversing a tree and at the same time calculating it’s code because we don’t know where exactly in the tree is that symbol located. As a principle, we use a Huffman table for encoding and a Huffman tree for decoding.

The input string : beep boop beer!

The input string in binary : 0110 0010 0110 0101 0110 0101 0111 0000 0010 0000 0110 0010 0110 1111 0110 1111 0111 0000 0010 0000 0110 0010 0110 0101 0110 0101 0111 0010 0010 0001

The encoded string : 0011 1110 1011 0001 0010 1010 1100 1111 1000 1001

As you can see there is a major difference in the ASCII version of the string and the Huffman coded version.

The source code behaves as described above. You’ll find more details in the comments in the code.

All the sources have been compiled and verified using the C99 standard. Happy programming.

1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值