HDU 1053 Entropy【哈夫曼编码入门题】

Entropy

Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 65536/32768 K (Java/Others)
Total Submission(s): 5992    Accepted Submission(s): 2517


Problem Description
An entropy encoder is a data encoding method that achieves lossless data compression by encoding a message with “wasted” or “extra” information removed. In other words, entropy encoding removes information that was not necessary in the first place to accurately encode the message. A high degree of entropy implies a message with a great deal of wasted information; english text encoded in ASCII is an example of a message type that has very high entropy. Already compressed messages, such as JPEG graphics or ZIP archives, have very little entropy and do not benefit from further attempts at entropy encoding.

English text encoded in ASCII has a high degree of entropy because all characters are encoded using the same number of bits, eight. It is a known fact that the letters E, L, N, R, S and T occur at a considerably higher frequency than do most other letters in english text. If a way could be found to encode just these letters with four bits, then the new encoding would be smaller, would contain all the original information, and would have less entropy. ASCII uses a fixed number of bits for a reason, however: it’s easy, since one is always dealing with a fixed number of bits to represent each possible glyph or character. How would an encoding scheme that used four bits for the above letters be able to distinguish between the four-bit codes and eight-bit codes? This seemingly difficult problem is solved using what is known as a “prefix-free variable-length” encoding.

In such an encoding, any number of bits can be used to represent any glyph, and glyphs not present in the message are simply not encoded. However, in order to be able to recover the information, no bit pattern that encodes a glyph is allowed to be the prefix of any other encoding bit pattern. This allows the encoded bitstream to be read bit by bit, and whenever a set of bits is encountered that represents a glyph, that glyph can be decoded. If the prefix-free constraint was not enforced, then such a decoding would be impossible.

Consider the text “AAAAABCD”. Using ASCII, encoding this would require 64 bits. If, instead, we encode “A” with the bit pattern “00”, “B” with “01”, “C” with “10”, and “D” with “11” then we can encode this text in only 16 bits; the resulting bit pattern would be “0000000000011011”. This is still a fixed-length encoding, however; we’re using two bits per glyph instead of eight. Since the glyph “A” occurs with greater frequency, could we do better by encoding it with fewer bits? In fact we can, but in order to maintain a prefix-free encoding, some of the other bit patterns will become longer than two bits. An optimal encoding is to encode “A” with “0”, “B” with “10”, “C” with “110”, and “D” with “111”. (This is clearly not the only optimal encoding, as it is obvious that the encodings for B, C and D could be interchanged freely for any given encoding without increasing the size of the final encoded message.) Using this encoding, the message encodes in only 13 bits to “0000010110111”, a compression ratio of 4.9 to 1 (that is, each bit in the final encoded message represents as much information as did 4.9 bits in the original encoding). Read through this bit pattern from left to right and you’ll see that the prefix-free encoding makes it simple to decode this into the original text even though the codes have varying bit lengths.

As a second example, consider the text “THE CAT IN THE HAT”. In this text, the letter “T” and the space character both occur with the highest frequency, so they will clearly have the shortest encoding bit patterns in an optimal encoding. The letters “C”, “I’ and “N” only occur once, however, so they will have the longest codes.

There are many possible sets of prefix-free variable-length bit patterns that would yield the optimal encoding, that is, that would allow the text to be encoded in the fewest number of bits. One such optimal encoding is to encode spaces with “00”, “A” with “100”, “C” with “1110”, “E” with “1111”, “H” with “110”, “I” with “1010”, “N” with “1011” and “T” with “01”. The optimal encoding therefore requires only 51 bits compared to the 144 that would be necessary to encode the message with 8-bit ASCII encoding, a compression ratio of 2.8 to 1.
 

Input
The input file will contain a list of text strings, one per line. The text strings will consist only of uppercase alphanumeric characters and underscores (which are used in place of spaces). The end of the input will be signalled by a line containing only the word “END” as the text string. This line should not be processed.
 

Output
For each text string in the input, output the length in bits of the 8-bit ASCII encoding, the length in bits of an optimal prefix-free variable-length encoding, and the compression ratio accurate to one decimal point.
 

Sample Input
  
  
AAAAABCD THE_CAT_IN_THE_HAT END
 

Sample Output
  
  
64 13 4.9 144 51 2.8
 

Source

Greater New York 2000


原题链接:http://acm.hdu.edu.cn/showproblem.php?pid=1053


题意:哈夫曼编码。
题意是,给出一排字符串,要求求出字符的8位编码的长度,哈夫曼编码值,以及之间的比值
因为仅仅只要求求出哈夫曼编码值,所以不用建立哈夫曼树,可以建立优先队列,只要将每次最小的
出队的两个元素合成一个新的大数,然后放进优先队列中,直到只剩下一个元素为止,那个元素就是哈夫曼编码值。

此题关于哈夫曼编码。
哈夫曼树(霍夫曼树)又称为最优树.
基本术语:
1、路径和路径长度
在一棵树中,从一个结点往下可以达到的孩子或孙子结点之间的通路,称为路径。通路中分支的数目称为路径长度。若规定根结点的层数为1,则从根结点到第L层结点的路径长度为L-1。
2、结点的权及带权路径长度
若将树中结点赋给一个有着某种含义的数值,则这个数值称为该结点的权。结点的带权路径长度为:从根结点到该结点之间的路径长度与该结点的权的乘积。
3、树的带权路径长度
树的带权路径长度规定为所有叶子结点的带权路径长度之和,记为WPL。
构造:
假设有n个权值,则构造出的哈夫曼树有n个叶子结点。 n个权值分别设为 w1、w2、…、wn,则哈夫曼树的构造规则为:
(1) 将w1、w2、…,wn看成是有n 棵树的森林(每棵树仅有一个结点);
(2) 在森林中选出两个根结点的权值最小的树合并,作为一棵新树的左、右子树,且新树的根结点权值为其左、右子树根结点权值之和;
(3)从森林中删除选取的两棵树,并将新树加入森林;
(4)重复(2)、(3)步,直到森林中只剩一棵树为止,该树即为所求得的哈夫曼树。
哈夫曼 编码
在数据通信中,需要将传送的文字转换成二进制的字符串,用0,1码的不同排列来表示字符。例如,需传送的报文为“AFTER DATA EAR ARE ART AREA”,这里用到的字符集为“A,E,R,T,F,D”,各字母出现的次数为{8,4,5,3,1,1}。现要求为这些字母设计编码。要区别6个字母,最简单的二进制编码方式是等长编码,固定采用3位二进制,可分别用000、001、010、011、100、101对“A,E,R,T,F,D”进行编码发送,当对方接收报文时再按照三位一分进行译码。显然编码的长度取决报文中不同字符的个数。若报文中可能出现26个不同字符,则固定编码长度为5。然而,传送报文时总是希望总长度尽可能短。在实际应用中,各个字符的出现频度或使用次数是不相同的,如A、B、C的使用频率远远高于X、Y、Z,自然会想到设计编码时,让使用频率高的用短码,使用频率低的用长码,以优化整个报文编码。
为使不等长编码为前缀 编码 (即要求一个字符的编码不能是另一个字符编码的前缀),可用字符集中的每个字符作为叶子结点生成一棵编码二叉树,为了获得传送报文的最短长度,可将每个字符的出现频率作为字符结点的权值赋予该结点上,显然字使用频率越小权值越小,权值越小叶子就越靠下,于是频率小编码长,频率高编码短,这样就保证了此树的最小带权路径长度效果上就是传送报文的最短长度。因此,求传送报文的最短长度问题转化为求由字符集中的所有字符作为叶子结点,由字符出现频率作为其权值所产生的哈夫曼树的问题。利用哈夫曼树来设计二进制的前缀编码,既满足前缀编码的条件,又保证报文编码总长最短。


AC代码1:

/**
  * 行有余力,则来刷题!
  * 博客链接:http://blog.csdn.net/hurmishine
  *
*/
#include <iostream>
#include <cstdio>
#include <cstring>
#include <queue>
#include <vector>
using namespace std;
char str[1000];
int cnt[30];
int main()
{
    //freopen("data.txt","r",stdin);
    while(cin>>str)
    {
        if(!strcmp(str,"END")) break;
        memset(cnt,0,sizeof(cnt));
        int len=strlen(str);
        for(int i=0;i<len;i++)
        {
            if(str[i]=='_')
                cnt[26]++;
            else
                cnt[str[i]-'A']++;
        }
        //C++ CE,G++ AC
        priority_queue<int,vector<int>,greater<int> >q;
        for(int i=0;i<=26;i++)
        {
            if(cnt[i])
            {
                q.push(cnt[i]);
            }
        }
        int sum=0;
        while(q.size()>1)
        {
            int x=q.top();q.pop();
            int y=q.top();q.pop();
            q.push(x+y);
            sum+=(x+y);
        }
        if(sum==0)//只有一个元素
            sum=len;
        printf("%d %d %.1lf\n",len*8,sum,(len*8.0)/(sum*1.0));
    }
    return 0;
}

AC代码2:

/**
  * 行有余力,则来刷题!
  * 博客链接:http://blog.csdn.net/hurmishine
  *
*/
#include <iostream>
#include <cstdio>
#include <cstring>
#include <queue>
#include <vector>
using namespace std;
char str[1000];
int cnt[30];
struct Node
{
    int x;
    friend bool operator <(Node a,Node b)
    {
        return a.x > b.x;
    }
};
int main()
{
    //freopen("data.txt","r",stdin);
    while(cin>>str)
    {
        if(!strcmp(str,"END")) break;
        memset(cnt,0,sizeof(cnt));
        int len=strlen(str);
        for(int i=0;i<len;i++)
        {
            if(str[i]=='_')
                cnt[26]++;
            else
                cnt[str[i]-'A']++;
        }
        priority_queue<Node>q;
        Node node1,node2;
        for(int i=0;i<=26;i++)
        {
            if(cnt[i])
            {
                node1.x=cnt[i];
                q.push(node1);
            }

        }
        int sum=0;
        while(q.size()>1)
        {
            node1=q.top();q.pop();
            node2=q.top();q.pop();
            int t=node1.x+node2.x;
            sum+=t;
            node1.x+=node2.x;
            q.push(node1);

        }
        if(sum==0)
            sum=len;
        printf("%d %d %.1lf\n",len*8,sum,(len*8.0)/(sum*1.0));
    }
    return 0;
}

参考博客: http://blog.csdn.net/techmonster/article/details/49889261

http://blog.csdn.net/acmman/article/details/38554187


尊重原创,转载请注明出处:http://blog.csdn.net/hurmishine

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值