利用Huffman Tree实现文本的压缩和解压缩

实现目标

本次实验要求实现:

  1. 读入一段txt文本,统计该段文本中各字符(包括字母,标点,空格回车等)的出现次数;
  2. 以此构建一棵Huffman树,实现对文本的Huffman编码,输出编码表和编码后的dat文件,从而达到文本压缩的目的,并计算压缩比;
  3. 根据Huffman树和压缩后的文件,对文件进行解码,还原成最初的txt文本,要求将所有符号、回车换行等一同还原。

代码链接:Huffman Tree

实现思路

字符统计

本次实验使用的输入文本如下:

There's something down there. It's Gollum.
Gollum?
He's been following us for three days.
He escaped the dungeons of Baraddur?
Escaped,or was set loose.Now the Ring has brought him here.He will never be rid of his need for it.He hates and loves the Ring, as he hates and loves himself.Smeagol's life is a sad story.Yes, Smeagol he was once called. Before the Ring found him.Before it drove him mad.
It's a pity Bilbo didn't kill him when he had the chance.
Pity?It is pity that stayed Bilbo's hand.Many that live deserve death. Some that die deserve life.Can you give it to them, Frodo?Do not be too eager to deal out death and judgment.Even the very wise can not see all ends.My heart tells me that Gollum has some part to play yet, for good or ill,before this is over.The pity of Bilbo may rule the fate of many.
I wish the Ring had never come to me.I wish none of this had happened.
So do all who live to see such times. But that is not for them to decide.All we have to decide is what to do with the time that is given to us.There are other forces at work in this world, Frodo, besides the will of evil.Bilbo was meant to find the Ring.In which case, you also were meant to have it.

这段文本中包含了大小写英文字母,换行、空格,以及标点符号等,读取文本并统计字符这一点相对简单,用一个58长度的数组来记录相关字符的出现频率,在代码的main函数部分实现:

	ifstream infile("inputfile1.txt");
	infile >> noskipws;  //使得读取时不会跳过空白符
	char a;
	double leng = 0;
	while (!infile.eof())
	{
		//infile >> a;
		a = infile.get();
		leng++;
		if ((int)a<123&&(int)a>96) //小写字母
		{
			tim[(int)a - 97]++;
		}
		else if (a == ' ')
			{
			tim[26]++;
			}
		else if (int(a) == 10)  //换行
		{
			tim[27]++;
		}
		else if (a == ',')
		{
			tim[28]++;
		}
		else if (a == '.')
		{
			tim[29]++;
		}
		else if (a == '?')
		{
			tim[30]++;
		}
		else if ((int)a == 39)  //单引号
		{
			tim[31]++;
		}
		else if (a > 64 && a < 91)  //大写字母
		{
			tim[(int)a -33]++;
		}
	}

Huffman树

Huffman树是一种用于数据压缩的二叉树结构,它是一种最优前缀编码树。在Huffman树中,字符的编码长度与其出现频率成反比,频率越高的字符编码长度越短,频率越低的字符编码长度越长。这样设计的编码方式可以有效地减少数据的存储空间,实现数据压缩。

Huffman树的构建过程是通过Huffman算法实现的。算法的基本思想是先根据数据项的频率构建一系列只包含一个数据项的二叉树(也可以看作是只有根节点的树),然后将频率最低的两个二叉树合并成一个新的二叉树,其权值为两者之和,然后再将这个新的二叉树插入到原来的二叉树集合中,重复这个过程,直到最后只有一个二叉树,这个二叉树就是Huffman树。

在Huffman树中,字符的编码是从根节点到叶子节点的路径,每经过一个左子节点,编码就加上一个0,每经过一个右子节点,编码就加上一个1。由于Huffman树是最优前缀编码树,所以没有一个字符的编码是另一个字符编码的前缀,这样可以确保在解压缩时能够正确地还原原始数据。

在上一步得到每个字符的出现次数后,为每个字符创建一个HuffmanNode,然后开始构建Huffman树。该部分我利用了一个vector<HuffmanNode*>& letter的向量容器来构建Huffman树,实际上可以用STL自带的priority_queue优先队列来实现一棵Huffman树,且性能更好,后续会对此处进行改良。

HuffmanTree::HuffmanTree(vector<HuffmanNode*>& letter, int n)
{
	HuffmanNode* tmp1, * tmp2, * tmp3;
	while (letter.size() != 1)
	{
		tmp1 = letter.back();
		letter.pop_back();
		tmp2 = letter.back();
		letter.pop_back();
		tmp3 = new HuffmanNode(tmp1->val + tmp2->val, '0', NULL, tmp1, tmp2);
		tmp1->parent = tmp3;
		tmp2->parent = tmp3;
		letter.push_back(tmp3);
		sort(letter.begin(), letter.end(),myCmp);  //如果采用priority_queue,
												//就可以避免每一次的排序操作,性能更好
		
	}
	tmp3 = letter.back();
	root = tmp3;	

}

文本编码

在构建好树后该部分也并不复杂,根据从根节点到该节点的路径即可进行编码:

	string ch[58];  //该数组用于记录各字符的编码
	out << "字符" << "   " << "出现次数" << "   " << "对应编码" << endl;
	while (letter.size() != 0)
	{
		HuffmanNode* tmp, * p,*tmp1;
		string code;
		tmp = letter.back();
		tmp1 = tmp;
		letter.pop_back();
		out << tmp->a << "         " << tmp->val << "       ";
		p = tmp->parent;
		while (p != NULL)
		{
			if (tmp == p->leftChild)
				code += "0";
			else
				code += "1";
			tmp = p;
			p = tmp->parent;
		}
		code.reserve();
		string code1(code.rbegin(), code.rend());
		out << code1 << endl;
		if ((int)tmp1->a > 96 && (int)tmp1->a < 123)
			ch[(int)tmp1->a - 97] = code1;
		else if (tmp1->a == ' ')
			ch[26] = code1;
		else if (tmp1->a == char(10))
			ch[27] = code1;
		else if (tmp1->a == ',')
			ch[28] = code1;
		else if (tmp1->a == '.')
			ch[29] = code1;
		else if (tmp1->a == '?')
			ch[30] = code1;
		else if ((int)tmp1->a == 39)
			ch[31] = code1;
		else if ((int)tmp1->a > 64 && (int)tmp1->a < 91)
			ch[(int)tmp1->a - 33] = code1;
	}

得到结果:
在这里插入图片描述

压缩文本

完成上述工作后,该部分也很简单,每次读入一个字符,然后通过ch数组的编码表,在压缩文件中输出该字符对应的编码即可。

while (!in.eof())
	{
		in >> a;
		if (int(a) != 13&&!in.eof())
		{
			if ((int)a > 96 && (int)a < 123)
				 out01 << ch[int(a) - 97];
				//out01.write(ch[(int)a - 97].c_str(), sizeof(ch[(int)a - 97]));
			else if (a == ' ')
				out01 << ch[26];
			else if (int(a) == 10)
				out01 << ch[27];
			else if (a == ',')
				out01 << ch[28];
			else if (a == '.')
				out01 << ch[29];
			else if (a == '?')
				out01 << ch[30];
			else if ((int)a == 39)
				out01 << ch[31];
			else if ((int)a > 64)
				out01 << ch[(int)a - 33];
			//cout << a;
		}
	}

解压缩文件,获取原txt文本

此部分按顺序读入01序列串,然后在Huffman树上查找相关字符即可

	char a;
	HuffmanNode* tmp;
	while (dat.peek()!=EOF)
	{
		tmp = tree->root;
		while (tmp->leftChild != NULL || tmp->rightChild != NULL)
		{
			dat >> a;
			
			if (a=='1')
				tmp = tmp->rightChild;
			else
				tmp = tmp->leftChild;
		}
		//if((int)tmp->a!=10)
		  out << tmp->a;   //HuffmanNode的成员变量a是其代表的字符
		//else
			//out<<'\n';
		cout << tmp->a;
		
	}

计算压缩比

在压缩文件时,我并没有将01串按位写入dat文件,而是采用了按字节写入的方式,实际压缩时,需要每八位01串凑成一个byte,然后写入dat文件。因此在计算压缩比时需要将dat文件的长度除以8。得到的结果为:
在这里插入图片描述
压缩比约为55%。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值