贪心算法之Entropy

 

Entropy

Description

An entropy encoderis a data encoding method that achieves lossless data compression by encoding amessage with "wasted" or "extra" information removed. Inother words, entropy encoding removes information that was not necessary in thefirst place to accurately encode the message. A high degree of entropy impliesa message with a great deal of wasted information; english text encoded inASCII is an example of a message type that has very high entropy. Alreadycompressed messages, such as JPEG graphics or ZIP archives, have very littleentropy and do not benefit from further attempts at entropy encoding. 

English text encoded in ASCII has a high degree of entropy because allcharacters are encoded using the same number of bits, eight. It is a known factthat the letters E, L, N, R, S and T occur at a considerably higher frequencythan do most other letters in english text. If a way could be found to encodejust these letters with four bits, then the new encoding would be smaller,would contain all the original information, and would have less entropy. ASCIIuses a fixed number of bits for a reason, however: it’s easy, since one isalways dealing with a fixed number of bits to represent each possible glyph orcharacter. How would an encoding scheme that used four bits for the aboveletters be able to distinguish between the four-bit codes and eight-bit codes?This seemingly difficult problem is solved using what is known as a"prefix-free variable-length" encoding. 

In such an encoding, any number of bits can be used to represent any glyph, andglyphs not present in the message are simply not encoded. However, in order tobe able to recover the information, no bit pattern that encodes a glyph isallowed to be the prefix of any other encoding bit pattern. This allows theencoded bitstream to be read bit by bit, and whenever a set of bits isencountered that represents a glyph, that glyph can be decoded. If theprefix-free constraint was not enforced, then such a decoding would beimpossible. 

Consider the text "AAAAABCD". Using ASCII, encoding this wouldrequire 64 bits. If, instead, we encode "A" with the bit pattern"00", "B" with "01", "C" with"10", and "D" with "11" then we can encode thistext in only 16 bits; the resulting bit pattern would be "0000000000011011".This is still a fixed-length encoding, however; we’re using two bits per glyphinstead of eight. Since the glyph "A" occurs with greater frequency,could we do better by encoding it with fewer bits? In fact we can, but in orderto maintain a prefix-free encoding, some of the other bit patterns will becomelonger than two bits. An optimal encoding is to encode "A" with"0", "B" with "10", "C" with"110", and "D" with "111". (This is clearly notthe only optimal encoding, as it is obvious that the encodings for B, C and Dcould be interchanged freely for any given encoding without increasing the sizeof the final encoded message.) Using this encoding, the message encodes in only13 bits to "0000010110111", a compression ratio of 4.9 to 1 (that is,each bit in the final encoded message represents as much information as did 4.9bits in the original encoding). Read through this bit pattern from left toright and you’ll see that the prefix-free encoding makes it simple to decodethis into the original text even though the codes have varying bit lengths. 

As a second example, consider the text "THE CAT IN THE HAT". In thistext, the letter "T" and the space character both occur with thehighest frequency, so they will clearly have the shortest encoding bit patternsin an optimal encoding. The letters "C", "I’ and "N"only occur once, however, so they will have the longest codes. 

There are many possible sets of prefix-free variable-length bit patterns thatwould yield the optimal encoding, that is, that would allow the text to beencoded in the fewest number of bits. One such optimal encoding is to encodespaces with "00", "A" with "100", "C"with "1110", "E" with "1111", "H" with"110", "I" with "1010", "N" with"1011" and "T" with "01". The optimal encoding thereforerequires only 51 bits compared to the 144 that would be necessary to encode themessage with 8-bit ASCII encoding, a compression ratio of 2.8 to 1. 

Input

The input filewill contain a list of text strings, one per line. The text strings will consistonly of uppercase alphanumeric characters and underscores (which are used inplace of spaces). The end of the input will be signalled by a line containingonly the word “END” as the text string. This line should not be processed.

Output

For each textstring in the input, output the length in bits of the 8-bit ASCII encoding, thelength in bits of an optimal prefix-free variable-length encoding, and thecompression ratio accurate to one decimal point.

Sample Input

AAAAABCD

THE_CAT_IN_THE_HAT

END

Sample Output

64 13 4.9

144 51 2.8

 

题意解读:题目很长,但是意思就是哈弗曼编码。

解题思路:灵活运用哈弗曼编码就能解题,但是得考虑边界情况,即输入的字符串全部由一个相同字符组成的情况。

代码如下:

// 1521.cpp : 定义控制台应用程序的入口点。

//

 

#include"stdafx.h"

 

#include<iostream>

#include<iomanip>

#include<string>

using namespacestd;

 

#define maxchar1000000

charvalue[maxchar];//存储输入的字符

struct Node

{

       char c;//字符

       int count;//出现次数

       int longs;//用几位数字编码

       int leftson,rightson;//左右子树

       int parent;//父母结点

       bool visit;//访问标识

};

Node node [55];//字符总数27*2

void result();

int min(int i);//动态选择最小的数

void ceng(introot,int i);

int main()

{

       result();

       return 0;

}

void result()

{

       //清空数组

       for(int i=0;i<maxchar;i++)

              value[i]='\0';

       for(int i=0;i<26;i++)

       {

              node[i].c='A'+i;

              node[i].count=0;//置初始出现次数为0

              node[i].leftson=node[i].rightson=node[i].parent=-1;//父母结点、子结点均用-1表示没有

              node[i].visit=false;

              node[i].longs=0;//编码长度

       }

       node[26].c='_';

       node[26].count=node[26].longs=0;

       node[26].leftson=node[26].rightson=node[26].parent=-1;

       node[26].visit=false;

       cin>>value;

       while(strcmp(value,"END")!=0)

       {

              int length=0;

              int zifulength=0;

              for(int i=0;value[i]!='\0';i++)

              {

                     for(int j=0;j<27;j++)

                     {

                            if(value[i]==node[j].c)

                            {

                                   node[j].count+=1;//统计出现次数

                            }

                     }

                     zifulength=i;//计数输入字符串长度

              }

              for(int i=0;i<27;i++)

              {

                     if(node[i].count>0)

                     {

                            //用来记录一共出现过几种字符

                            length+=1;

                     }

              }

              int start=27;//0-26存储的是A-Z和-

              for(int i=0;i<length-1;i++)

              {

                     //对于n个字符,只需执行n-1次就能得到结果

                     int min1=min(i);//每执行一次就会增加一个新节点

                     int min2=min(i);

                     //更新父母结点及子结点

                     node[start].count=node[min1].count+node[min2].count;

                     node[start].leftson=min1;

                     node[start].rightson=min2;

                     node[min1].parent=start;

                     node[min2].parent=start;

                     node[start].parent=-1;

                     node[start].visit=false;

                     start+=1;

              }

              if(length==1)

              {

                     //当只有一个字符出现时上述for不会执行,此时输出的最后结果一定是8.0

                     double resul=8.0;

                     printf("%d %d%.1f\n", 8*(zifulength+1), zifulength+1, resul);

 

              }

              if(length>1)

              {

                     int root=27+length-2;//根结点一定位于最后产生的一个结点

                     ceng(root,0);//为每个结点求得所用编码长度

                     int sum=0;

                     for(int i=0;i<27;i++)

                     {

                            if(node[i].count>0)

                            {

                                   sum+=node[i].count*node[i].longs;

                            }

                     }

                     floatresult=(float(8*(zifulength+1))/sum);

                     printf("%d %d%.1f\n", 8*(zifulength+1), sum, (double) result);

              }

              for(int i=0;i<maxchar;i++)

              value[i]='\0';

              for(int i=0;i<26;i++)

              {

                     node[i].c='A'+i;

                     node[i].count=0;

                     node[i].leftson=node[i].rightson=node[i].parent=-1;

                     node[i].visit=false;

                     node[i].longs=0;

              }

              node[26].c='_';

              node[26].count=node[26].longs=0;

              node[26].leftson=node[26].rightson=node[26].parent=-1;

              node[26].visit=false;

              cin>>value;

 

       }

}

int min(int i)

{

       int min=maxchar;

       int mini=-1;

       for(int start=0;start<27+i;start++)

       {

              if(node[start].visit==false)

              {

                     if((node[start].count>0)&&(node[start].count<min))

                     {

                            min=node[start].count;

                            mini=start;

                     }

              }

       }

       if(mini!=-1)

              node[mini].visit=true;

       return mini;

}

void ceng(introot,int i)

{

       node[root].longs=i;

       if(node[root].leftson!=-1)

              ceng(node[root].leftson,i+1);

       if(node[root].rightson!=-1)

              ceng(node[root].rightson,i+1);

}

如果运行在浙大acm上,则用以下头文件:

#include <stdio.h>
#include<iostream>
#include<iomanip>
#include<string>

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值