java实现哈夫曼编码(huffman)编码

  这篇博客主要讲解如何用java实现哈夫曼编码(Huffman)。

概念

  首先,我来简单说一下哈夫曼编码(Huffman),它主要是数据编码的一种方式,也是数据压缩的一种方法,将某些特定的字符转化为二进制字符,并在转换过程中降低原有字符串的存储量。其具体方法是先按出现的概率大小排队,把两个最小的概率相加,作为新的概率 和剩余的概率重新排队,再把最小的两个概率相加,再重新排队,直到最后变成1。每次相 加时都将“0”和“1”赋与相加的两个概率,读出时由该符号开始一直走到最后的“1”, 将路线上所遇到的“0”和“1”按最低位到最高位的顺序排好,就是该符号的哈夫曼编码。

练习要求

1、本次实验要求用Huffman编码实现对已有英文材料进行信源编码。
2、根据所给资料,以26个字母作为信源符号,统计它们的分布概率,构建信源的概率模型。
3、利用选定的编码技术给各信源符号进行编码。
4、压缩英文文本,并计算压缩效率。
5、拓展:列出信源材料真实的信源符号,构建信源概率模型,并进行编码。(这个本次还没有实现)

java实现

方法一:
  主要是使用二叉树存储对应的字符编码,将某个字符的编码转化为二叉树的一条路径,遍历二叉树的所有子节点的路径就可以得到该字符的编码。

代码实现

OtherPersonMethodHuffman类

import java.io.IOException;
import java.util.*;
import static java.lang.System.out;

/**
 * 哈夫曼树构造类:
 * @author Canlong
 */
public class OtherPersonMethodHuffman {

    public static void main(String[] args) {
        String fileStr = "";
        double[] tempProb =null;
        String[] strCodeAll = null;
        try {
            tempProb = Test2.countNum();

        }catch (IOException e){
            e.printStackTrace();
        }
        long[] probLong=new long[tempProb.length];
        //为了方便计算将概率乘以10^18次方,将其转为long类型
        for(int i=0;i<tempProb.length;i++){
            probLong[i]=Math.round(tempProb[i] * 1000000000000000000L);
        }
        strCodeAll = countCodeStr(probLong);
        //将字母和对应的二进制编码放入对应的map集合中
        Map<String,String> codeTable = new HashMap<>();
        //二进制作为key,字母作为valued的map
        HashMap<String,String> codeTableBinToChar =  new HashMap<>();
        fileStr = Test2.fileStr;
        //测试huffman编码
        //fileStr = "afdjljaf-safsdf";
        for(int i=0;i<strCodeAll.length;i++){
            codeTable.put(String.valueOf((char)('a'+i)),strCodeAll[i]);
            codeTableBinToChar.put(strCodeAll[i],String.valueOf((char)('a'+i)));
        }
        String codeFileStr ="";
        for(int i=0;i<fileStr.length();i++){
            if(codeTable.containsKey(String.valueOf(fileStr.charAt(i)).toLowerCase())){
                codeFileStr += codeTable.get(String.valueOf(fileStr.charAt(i)).toLowerCase());
            }
        }
        out.println("huffman 编码后为:"+codeFileStr);
        out.println("huffman 编码后的长度:"+codeFileStr.length());
        out.println("huffman 编码后的所占的比特数就是为:6851");
        String decodeStr = decodeStr(codeTable,codeFileStr);
        out.println("huffman译码后的原文:"+decodeStr);
        out.println("huffman译码后的长度::"+decodeStr.length());
        out.println("因为一个char字符占1个字节,即8比特,所以其一共占了,1649*8="+decodeStr.length()*8+"个比特。");
        out.println("因此,该huffman编码压缩比为:6851/13192="+6851.0/13192.0*100+"%");

    }

    /**
     * huffman编码的译码系统,也是解码(目前这里只能做到对26个字母的进行编码)
     * @param codeTable 字符和编码映射表,map集合
     * @param codeFileStr 要解码的二进制字符串
     * @return 返回解码后的字符串,就是编码前的字符串
     */
    public static String decodeStr(Map<String,String> codeTable,String codeFileStr){
        String decodeFileStr = "";
        while(codeFileStr.length()>1) {
            //如果已经编码了的字符串中有与编码表中一样的编码就将那个编码对应的字符加在译码字符串后
            for (int i = 'a'; i <= 'z'; i++) {
                String key = String.valueOf((char) i);
                String codeStr = codeTable.get(key);
                //截取编码字符串与编码表中编码相同的位数进行对比
                String compareStr="";
                if(codeFileStr.length()>=codeStr.length()) {
                    compareStr = codeFileStr.substring(0, codeStr.length());
                }
                if (compareStr.equalsIgnoreCase(codeStr)) {
                    decodeFileStr += String.valueOf((char) i);
                    //编码完一段后,将那一段去掉(这步是挺关键的一步)
                    codeFileStr = codeFileStr.substring(codeStr.length());
                }
            }
        }
        return decodeFileStr;
    }

    /**
     * 计算概率对应的编码
     * @param probInts 概率数组
     * @return 编码字符数组
     */
    public static String[] countCodeStr(long[] probInts){
        String[] strCodes = new String[probInts.length];
        Node [] tempsnodes = new Node[probInts.length];
        for(int i=0;i<probInts.length;i++){
            tempsnodes [i]= new Node(probInts[i]);
        }
        List<Node> nodes = Arrays.asList(tempsnodes);
        Node node = OtherPersonMethodHuffman.build(nodes);
        PrintTree(node);
        //找出所有路径
        FindShortestBTPath getPathTool = new FindShortestBTPath();
        ArrayList<ArrayList<Long>> pathArr =  getPathTool.FindAllPath(node,1);
        HashMap<Long,Long> mapNode = getPathTool.mapNode;
        //输出二叉树
        // out.println(pathArr);
        // out.println(mapNode);
        for(ArrayList<Long> arr1 : pathArr){
            String tempStr = "";
            for(int i=0;i<arr1.size();i++){
                tempStr+=mapNode.get(arr1.get(i)).toString();
                //遍历概率
            }
            for(int j=0;j<probInts.length;j++){
                //如果一条路径中最后那个节点等于我们所定义的概率,则将对应的二进制字符串数组加上该概率对应的二进制编码
                if(probInts[j] == arr1.get(arr1.size()-1)){
                    strCodes[j]=tempStr;
                }
            }
        }
        String[] realStrCodes = new String[strCodes.length];
        //输出二进制编码
        out.println();
        for(int i=0;i<strCodes.length;i++){
            out.println(""+(char)(i+'a')+"字母对应的编码为:"+strCodes[i].substring(1,strCodes[i].length()));
            realStrCodes[i]=strCodes[i].substring(1,strCodes[i].length());
        }
        return realStrCodes;
    }

    /**
     * 构造哈夫曼树
     * @param nodes 结点集合
     * @return 构造出来的树的根结点
     */
    private static Node build(List<Node> nodes) {
        nodes = new ArrayList<Node>(nodes);
        sortList(nodes);
        while (nodes.size() > 1) {
            createAndReplace(nodes);
        }
        return nodes.get(0);
    }

    /**
     * 组合两个权值最小结点,并在结点列表中用它们的父结点替换它们
     * @param nodes 结点集合
     */
    private static void createAndReplace(List<Node> nodes) {
        Node left = nodes.get(0);
        Node right = nodes.get(1);
        Node parent = new Node(left.getValue() + right.getValue());
        parent.setLeftChild(left);
        parent.setRightChild(right);
        nodes.remove(0);
        nodes.remove(0);
        nodes.add(parent);
        sortList(nodes);
    }

    /**
     * 将结点集合由大到小排序
     */
    private static void sortList(List<Node> nodes) {
        Collections.sort(nodes);
    }

    /**
     * 打印树结构,显示的格式是node(left,right)
     * @param node
     */
    public static void PrintTree(Node node) {
        Node left = null;
        Node right = null;
        if(node!=null) {
            out.print(node.getValue());
            left = node.getLeftChild();
            right = node.getRightChild();
            //out.println("("+(left!=null?left.getValue():" ") +","+ (right!= null?right.getValue():" ")+")");

        }
        if(left!=null){ PrintTree(left); }
        if(right!=null){ PrintTree(right); }
    }
}

/**
 * 二叉树节点
 * @Canlong
 */
class Node implements Comparable {
    private long value;
    private Node leftChild;
    private Node rightChild;
    public Node(long value) {
        this.value = value;
    }
    public long getValue() {
        return value;
    }
    public void setValue(long value) {
        this.value = value;
    }
    public Node getLeftChild() {
        return leftChild;
    }
    public void setLeftChild(Node leftChild) {
        this.leftChild = leftChild;
    }
    public Node getRightChild() {
        return rightChild;
    }
    public void setRightChild(Node rightChild) {
        this.rightChild = rightChild;
    }
    @Override
    public int compareTo(Object o) {
        Node that = (Node) o;
        double result = this.value - that.value;
        return result > 0 ? 1 : result == 0 ? 0 : -1;
    }
}

/**
 * 寻找最短的二叉搜索的路径,从根节点到叶子结点(这个类是参考别人的)
 * @author Canlong
 *
 */
 class FindShortestBTPath {

     //如果是true则是左结点,如果是false则是右结点
     int flag = 1;
    //map集合,用来记录叶子节点与是编码为0还是1 的的对应关系
    HashMap<Long,Long> mapNode = new HashMap<>();
    // 用来记录所有的路径
    private ArrayList<ArrayList<Long>> allPaths = new ArrayList<ArrayList<Long>>();
    // 用来记录一条路径
    private ArrayList<Long> onePath = new ArrayList<Long>();

    // 返回所有的路径
    public ArrayList<ArrayList<Long>> FindAllPath(Node root,long tempFlag) {
        if(root == null){
            return allPaths;
        }
        // 把当前结点加入到路径当中来
        onePath.add(root.getValue());
        if(mapNode.containsKey(root)==false){
            mapNode.put(root.getValue(),tempFlag);
        }

        // 如果为叶子结点,则把onePath加入到allPaths当中
        if(root.getLeftChild() == null && root.getRightChild() == null){
            allPaths.add(new ArrayList<Long>(onePath));
        }
        FindAllPath(root.getLeftChild(),1);
        FindAllPath(root.getRightChild(),0);
        // 这个地方可以通过画递归树来理解,无论叶子结点是左结点还是右结点,都会经过下面这一步,而且至关重要
        onePath.remove(onePath.size() - 1);
        return allPaths;
    }
}

Test2类(是之前的一篇博客所实现的内容) ,具体请看之前博客《博客信息安全的一个实验——构建信源模型》

import java.io.*;
import static java.lang.System.out;

/**
 * *构建信源模型
 *  @author Canlong
*/
public class Test2 {

    static String fileStr = "";
    public static void main(String[] args){
        //第一小题
        array5Col();
        //第二小题
        try {
            countNum();
        } catch (IOException e){
            throw new RuntimeException("读取不到文件");
        }
    }

    /**
     * 1.随机产生一个一行五列数组,使其恰好符合信源概率的要求
     */
    public static void array5Col(){
        //1的概率为0.2,2的概率为0.3,3的概率为0.5
        int[] array = {1,1,2,2,2,3,3,3,3,3};
        //一行五列的数组
        int[][] array5 = new int[1][5];
        out.println("其信源概率为:1的概率为0.2,2的概率为0.3,3的概率为0.5。产生一个一行五列数组:");
        for(int i=0;i<array5[0].length;i++) {
            int randomNum = (int) Math.floor(Math.random() * 10);
            array5[0][i]=array[randomNum];
            out.print(array5[0][i]+",");
        }
        //换行
        out.println();
    }

    /**
     * 2.统计文件中26个字母的频率并计算信息熵
     * @throws IOException 抛出找不到文件异常
     */
    public static double[] countNum() throws IOException {
        //文件路径
        String strPath  = "C:/Users/hasee/Desktop/Types of Speech.txt";
        //26个字母出现的总次数
        double sumAllNum = 0;
        //存储频率
        double[] frequency = new double[26];
        //模型的信息熵 entropy
        double infoEntr = 0.0;

        //读取文件
        BufferedReader bw = new BufferedReader(new InputStreamReader(new FileInputStream(strPath),"UTF-8"));
        //存储文件的字符串
        StringBuilder textStrBuilder = new StringBuilder();
        String line;
        while((line=bw.readLine())!= null){
            textStrBuilder.append(line);
        }
        String textStr = textStrBuilder.toString();
        out.println("要统计的字符串为:\r\n"+textStr);
        fileStr=textStr;
        textStr = textStr.toLowerCase();
        //统计字符串各个字母的个数
        char[] textChar = textStr.toCharArray();
        //存放26个字母和对应的次数
        char[][] char26AndNum = new char[2][26];
        //将26个字母放入到字符数组
        //表示字符a的编码数
        int intA = 97;
        //表示字符z的编码数
        int intZ = 123;
        for(int i=intA;i<intZ;i++){
            char26AndNum[0][i-intA]=(char)(i);
        }
        //比较字符串和26个字母的是否相等,并且计算次数
        for(int i=0;i<textChar.length;i++){
            //法一:循环26个字母,判断是否相等
//            for(int j=0;j<char26AndNum[0].length;j++){
//                //如果字符相等,则对应的二维数组+1
//                if(Character.toString(textChar[i]).equals(Character.toString(char26AndNum[0][j]))){
//                    char26AndNum[1][j]++;
//                }
//            }
            //法二,将26个字母ASCII码-'a'作为数组下标,当字母等于那个数组下标时,直接将该元素++
            if(textChar[i] >= 'a' && textChar[i]<='z'){
                char26AndNum[1][textChar[i]-'a']++;
            }
        }
        //输出26个字母及其所对应次数,即计算频数
        for(int i=0;i<char26AndNum[1].length;i++){
            sumAllNum += (double)char26AndNum[1][i];
        }
        out.println("总次数为:"+sumAllNum);
        //计算频率
        for(int i=0;i<char26AndNum[1].length;i++) {
            frequency[i] = char26AndNum[1][i] / sumAllNum;
            out.println("字母为:" + char26AndNum[0][i] + ",对应出现的次数为:" + (int) char26AndNum[1][i] + ",其频率为:" + frequency[i]);

            if (frequency[i] != 0) {
                //计算信息熵,信息熵=频率1*log2(1/频率)
                infoEntr -= frequency[i] * (Math.log(frequency[i]) / Math.log(2));
            }
        }
        out.println("信息熵为:"+infoEntr);
        return frequency;
    }
}

涉及的Types of Speech.txt文件

Standard usage includes those words and expressions understood, used, and accepted by a majority of the speakers of a language in any situation regardless of the level of formality. As such, these words and expressions are well defined and listed in standard dictionaries. Colloquialisms, on the other hand, are familiar words and idioms that are understood by almost all speakers of a language and used in informal speech or writing, but not considered appropriate for more formal situations. Almost all idiomatic expressions are colloquial language. Slang, however, refers to words and expressions understood by a large number of speakers but not accepted as good, formal usage by the majority. Colloquial expressions and even slang may be found in standard dictionaries but will be so identified. Both colloquial usage and slang are more common in speech than in writing.Colloquial speech often passes into standard speech. Some slang also passes into standard speech, but other slang expressions enjoy momentary popularity followed by obscurity. In some cases, the majority never accepts certain slang phrases but nevertheless retains them in their collective memories. Every generation seems to require its own set of words to describe familiar objects and events. It has been pointed out by a number of linguists that three cultural conditions are necessary for the creation of a large body of slang expressions. First, the introduction and acceptance of new objects and situations in the society; second, a diverse population with a large number of subgroups; third, association among the subgroups and the majority population.Finally, it is worth noting that the terms 'standard' 'colloquial' and 'slang' exist only as abstract labels for scholars who study language. Only a tiny number of the speakers of any language will be aware that they are using colloquial or slang expressions. Most speakers of English will, during appropriate situations, select and use all three types of expressions. 

结果
要统计的字符串为:
Standard usage includes those words and expressions understood, used, and accepted by a majority of the speakers of a language in any situation regardless of the level of formality. As such, these words and expressions are well defined and listed in standard dictionaries. Colloquialisms, on the other hand, are familiar words and idioms that are understood by almost all speakers of a language and used in informal speech or writing, but not considered appropriate for more formal situations. Almost all idiomatic expressions are colloquial language. Slang, however, refers to words and expressions understood by a large number of speakers but not accepted as good, formal usage by the majority. Colloquial expressions and even slang may be found in standard dictionaries but will be so identified. Both colloquial usage and slang are more common in speech than in writing.Colloquial speech often passes into standard speech. Some slang also passes into standard speech, but other slang expressions enjoy momentary popularity followed by obscurity. In some cases, the majority never accepts certain slang phrases but nevertheless retains them in their collective memories. Every generation seems to require its own set of words to describe familiar objects and events. It has been pointed out by a number of linguists that three cultural conditions are necessary for the creation of a large body of slang expressions. First, the introduction and acceptance of new objects and situations in the society; second, a diverse population with a large number of subgroups; third, association among the subgroups and the majority population.Finally, it is worth noting that the terms 'standard' 'colloquial' and 'slang' exist only as abstract labels for scholars who study language. Only a tiny number of the speakers of any language will be aware that they are using colloquial or slang expressions. Most speakers of English will, during appropriate situations, select and use all three types of expressions. 
总次数为:1651.0
字母为:a,对应出现的次数为:152,其频率为:0.09206541490006057
字母为:b,对应出现的次数为:29,其频率为:0.01756511205330103
字母为:c,对应出现的次数为:47,其频率为:0.028467595396729255
字母为:d,对应出现的次数为:69,其频率为:0.041792852816474865
字母为:e,对应出现的次数为:181,其频率为:0.1096305269533616
字母为:f,对应出现的次数为:33,其频率为:0.019987886129618413
字母为:g,对应出现的次数为:38,其频率为:0.02301635372501514
字母为:h,对应出现的次数为:44,其频率为:0.026650514839491216
字母为:i,对应出现的次数为:111,其频率为:0.06723198061780739
字母为:j,对应出现的次数为:7,其频率为:0.004239854633555421
字母为:k,对应出现的次数为:5,其频率为:0.0030284675953967293
字母为:l,对应出现的次数为:86,其频率为:0.05208964264082374
字母为:m,对应出现的次数为:35,其频率为:0.021199273167777106
字母为:n,对应出现的次数为:121,其频率为:0.07328891580860085
字母为:o,对应出现的次数为:139,其频率为:0.08419139915202907
字母为:p,对应出现的次数为:42,其频率为:0.025439127801332527
字母为:q,对应出现的次数为:8,其频率为:0.004845548152634767
字母为:r,对应出现的次数为:107,其频率为:0.06480920654149
字母为:s,对应出现的次数为:154,其频率为:0.09327680193821926
字母为:t,对应出现的次数为:122,其频率为:0.07389460932768019
字母为:u,对应出现的次数为:54,其频率为:0.03270745003028468
字母为:v,对应出现的次数为:9,其频率为:0.005451241671714113
字母为:w,对应出现的次数为:19,其频率为:0.01150817686250757
字母为:x,对应出现的次数为:10,其频率为:0.0060569351907934586
字母为:y,对应出现的次数为:29,其频率为:0.01756511205330103
字母为:z,对应出现的次数为:0,其频率为:0.0
信息熵为:4.1696890030146925
99999999999999997640218049666868563218534221683827982492065414900060560932768019382192642168382798304058081072077528770442085208964264082373655118110236220472266505148394912162846759539672925610963052695336160059781950333131434426892792247122955012840702604482131863597819503331318308903694730466421332525741974561060569351907934597268322228952151302846759539673003028467595396730423985463355542117565112053301032327074500302846766480920654149000014052089642640823267231980617807384732889158086008483288915808600847941532404603270745007389460932768019279345850999394308375529981829194441756511205330103219987886129618412417928528164748641756511205330102948419139915202907291459721380981222430042398546335542119927316777710421804966686856450102967898243488804845548152634767545124167171411311508176862507570484554815263476682301635372501514025439127801332528
a字母对应的编码为:111
b字母对应的编码为:001011
c字母对应的编码为:10100
d字母对应的编码为:00100
e字母对应的编码为:100
f字母对应的编码为:001010
g字母对应的编码为:000001
h字母对应的编码为:10101
i字母对应的编码为:0101
j字母对应的编码为:01111100
k字母对应的编码为:011111000
l字母对应的编码为:1011
m字母对应的编码为:000011
n字母对应的编码为:0100
o字母对应的编码为:0001
p字母对应的编码为:000000
q字母对应的编码为:00001011
r字母对应的编码为:0110
s字母对应的编码为:110
t字母对应的编码为:0011
u字母对应的编码为:01110
v字母对应的编码为:00001010
w字母对应的编码为:0000100
x字母对应的编码为:0111111
y字母对应的编码为:001011
z字母对应的编码为:011111001
huffman 编码后为:110001111101000010011101100010001110110111000001100010101001010010110111000100100110001110101000111010000001000001011000100110111010000100100011111100000001101001101100101000101001100111001000010010001101100011000100010010001110110100001001110100001001111010010100100000000001110000100001011001011111000011111011111000001011001010011001011000100101000111010110011000000010011101111100010001101100001001010111101111101000000010111011100000110001010100111010000101111001010011011101110011010100010100011010000000111101100010010111001101100001001010001110101100101110000001010100101100010010100010100001011000001111110110101001100101111111011001110101001010100111010110011010000001000001011000100110111010000100100011111100000001101001101100101000101001101110110100000010010010111011001001000010100101010010000100111010000100101101011100011100001000101010011000111110100001001110110001000010001011010000110101000101001110110010110011010100000110111011000100001011011100101111101101011100000111100001010000111010110000010011101011000110101011110100001001110110100001010111000011010110110101111011000001000001011000100110111010000100010100100010100010000111100011101011110011111011010001110010000100100011011000110001000100100001011001011111101100001100011100011111101110111100000001001110111110001000110110000100101011110111110100000001011101110000011001110100001000111011010000100010101000101010000101000010110000011111101111000000010010010100101010001011000001000110010100110101010000000100101101110001101000001001110100000101001100101001001000110100001001110000000000000110000100000001100101111001110000101000010110000011000101101000010100001011000001111110111100101001101110111001101010001010011011110110000110001110001111110111011010100100010100010000111110011010110100100011111100000001101001101100101000101001101110110100101000001101110110001000010110111001011111011101111101000000010111011100000110011010111110100000001101010001000010010000001010100011001101000010101000110110001100010000100000101100010011011101000010010001111110000000110100110110010100010100110011100100001001000110110001100010001001000010110010111111011111011000000110001000111000001100101110001100001001010110000000100111011111000100011011000101101110001101000001001111110100101001000000000011100001001111100000010001000100100001010000101100000111111011011101101110000011000010110010110011101011000000111110111110000010110010100110010111010000011011101100010000101101110010111110111000111111000000011010011011001010001010011011101000010010000001010100010011010111110100000001000011111001011001011100001010000101110010000100010101001100011111010000100111011000100001000101101000011010100010100111011001011001100010110111000110000100010110111011001011100110000101010010010001000011010100101001011000010000101100010011101011010000011011101100010000101101110010111110110111011011100000110011101000010011010111110100000001111011010000001100010110100101000001000011000011000101000101010011000000010010010100101010011101011110100010101000000100011001010011010101000000011010000011011101100010000101101110010111110111100000001001001010010101000100101000111000100000000111110110100110010101000011000111000111110100001001110110001001100000001001001010010101110000100001110011010111110100000001111101111000010000001111101101001100101010000110001110001111101000010011101100010011000000010010010100101010010110111000110001001110101100011011010111110100000001100011111100000001101001101100101000101001101000100011111000001001011000011000100001110001000011111011000101100000000010000000111010111110110010100110010110010100001101110110001000010010000100001011001011000100101111010100011100110010100110010110101010011000010000111001010011111010011000111010110000001111101111100000101100101001100101101001000000101010001101111010010100100000000001111010100100011000111110101010011010111110100000001000000101010110111110100110001011011100011010010000001010100011000111010110010111001101100110100001111101010100110001110101100000011010101000011101011000101011010100000110111011100101000011010100001010100000011100000011000101100101100110100000010101000110001011000001100010010001101110011010100010100110100100000011110001100010110100000010110111001010110100010100111100001000010001001101000011000100101000001000001011000100110001100010010010011010100011001010010111000010101110000110101101101011110110000100101101111100100101000011110111010000100100000010101000100001111001010011101011111100010111001000100000000000101010100001110000100000101110001100101100101111101000111000001100101110001100001001010101101010100000001011100101110001111000111010111100110011101010110100100101000111010110011011100110111101110100000101000010001010011010100010100110111011010001001001010010011011011101100010110010100001011000111010110010100011010011100110101000101000001001010111101111101100000011000010110001001000010110001001010110101111101000000011000111111000000011010011011001010001010011000101001010110110001100111010110001010100001101100001001000111010100001101010001010011101000010011110100101001000000000011111010010100100000100101001001000000100000100101101111100100101000011110111010000100110010100110111011100110101000101001100101010000111010110011000011010001011000011001011110100101000001010000100111001000101000010101000110110100000000000100000001110101111100110101000101000000100010100111010111110111110110000001100010001110000011001011100011000010010101100111000101100000101100001011100000001100011101010101011000100111110110000110100010111100110101000101001110000110001010000000100111010110011001110001011000001011000010111000000011011101000010000111010110000001111101111100000101100101001100101100000000010000000111010111110011010100010100001010010101001111011101100101101010011010111000001000001011000111010101000001001101010100000001001110101111001100111010110000111000110000011110110001111101000010011101100010010100000110111011000100001011011100101111101111101000010011010111110100000001100011111101011100011000101001011001011111110111001011110001101101111010000111011111001011100101111000101000010110110101001010100011011111011011000001001010100011100011011100010000101110111110100000001011101110000011000001010010110010111110011010101000010110100011100000110010111000110000100101000111010110011000000010011101111100010001101100001001010111010000101110111110100000001011101110000011000000100010110111011001011100111000010011101101000011101011110011001110101100001011111011010001110110010101000000011010000011011101100010000101101110010111110110001011011010111110100000001100011111100000001101001101100101000101001100000110001110001111000000010011101111100010001101100001001010100010000000110110101110101010000100010110111011001000111001100101010000000111100000000000001100001000000011001011110011100110010100110111011100110101000101001101101001011100101000011111010000100011101101001111011101100111010101101001000011001011000000100110000100101010001111110000000110100110110010100010100110
huffman 编码后的长度:6951
huffman 编码后的所占的比特数就是为:6851
huffman译码后的原文:standardusageincludesthosewordsandexpressionsunderstoodusedandacceptedybamajorityofthespeajnrsofalanguageinanysituationregardlessofthelevelofformalityassuchthesewordsandexpressionsarewelldefinedandlistedinstandarddictionariescolloquialismsontheotherhandarefamiliarwordsandidiomsthatareunderstoodybalmostallspeajnrsofalanguageandusedininformalspeechorwritingyutnotconsideredappropriateformoreformalsituationsalmostallidiomaticexpressionsarecolloquiallanguageslanghoweverreferstowordsandexpressionsunderstoodybalargenumyerofspeajnrsyutnotacceptedasgoodformalusageybthemajoritycolloquialexpressionsandevenslangmabyefoundinstandarddictionariesyutwillyesoidentifiedyothcolloquialusageandslangaremorecommoninspeechthaninwritingcolloquialspeechoftenpassesintostandardspeechsomeslangalsopassesintostandardspeechyutotherslangexpressionsenjoymomentarypopularityfollowedyboyscurityinsomecasesthemajorityneveracceptscertainslangphrasesyutneverthelessretainsthemintheircollectivememorieseverygenerationseemstorequireitsownsetofwordstodescriyefamiliaroyzfoauedeventsithasyeenpointedoutybanumyeroflinguiststhatthreeculturalconditionsarenecessaryforthecreationofalargeyodyofslangexpressionsfirsttheintroductionandacceptanceofnewoyzfoauedsituationsinthesocietysecondadiversepopulationwithalargenumyerofsuygroupsthirdassociationamongthesuygroupsandthemajoritypopulationfinallyitisworthnotingthatthetermsstandardcolloquialandslangexistonlyasabstractlabelsforscholarswhostudylanguageonlyatinynumyerofthespeajnrsofanylanguagewillyeawarethattheyareusingcolloquialorslangexpressionsmostspeajnrsofenglishwillduringappropriatesituationsselectanduseallthreetypesofexpressions
huffman译码后的长度::1649
因为一个char字符占1个字节,即8比特,所以其一共占了,1649*8=13192个比特。
因此,该huffman编码压缩比为:6851/13192=51.932989690721655%

方法二:
  想不用二叉树来做的,但是由于没有时间,所以还没有写出来。立个flag,等有空了来完善。

总结

  这次的练习个人觉得有一定难度,需要对二叉树和哈夫曼编码(Huffman)概念比较熟悉,特别是在编码过程中如何将对应字符转化为二进制字符。虽然本次练习,自己做了出来,但是还存在着许多不足的地方,例如如果存在两个字符的频率相等的话,可能会存在问题,还有就是还没有实现其他字符的二进制字符串转化,单单实现了26个字母的。另外就是还没有考虑时间复杂度和空间复杂度等问题,有待优化。因为在解码(译码)的时候,我是通过遍历整个字符串的的某个字符内嵌遍历编码映射表来实现,即获取到某个字符串对应二进制的得到它的长度,然后从要解码的字符串中截取相应长度,再去比较它们的内容是否相等。这样的效率可能会比较低,但是目前还没有想到更好的方法。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值