基于哈夫曼树的文件压缩与解压

最新推荐文章于 2022-08-07 16:19:07 发布

WA的一声哭出来 pnq

最新推荐文章于 2022-08-07 16:19:07 发布

阅读量1.8k

点赞数 10

本文链接：https://blog.csdn.net/qq_44103902/article/details/115038338

版权

文章目录

哈夫曼编码
树的压缩存储
如何标识文件结尾
压缩文件结构
压缩、解压大致流程：
部分功能的具体实现
完整源码

本程序基于哈夫曼树，实现了文件的无损压缩和解压。

压缩过程：以字节为单位，统计0~255中每个数字出现的次数，依此构建哈夫曼树，获得编码表，然后按照编码表将文件重新编码。其中，为了给解压提供其所需的哈夫曼树的结构，还需将哈夫曼树使用01序列表示，并存储在压缩后的文件头部，以便解压时使用。

解压过程中，读取文件头部的信息，构建哈夫曼树，获得编码表，反向解析成原文件。

此程序层次清晰，各函数之间相对独立，易于拆分重组或者迁移。

哈夫曼编码

原始数据一般使用定长编码，但当各个字符的频率不一样时，如果能够让使用频率高的用短码，使用频率低的用长码，则可大大减少文件的长度。
例如，我们有一个字符串 ‘ABCDAABCABA’
其中A,B,C,D出现的次数分别为5，3，2，1，如果用定长编码：

A --- 00
B --- 01
C --- 10
D --- 11

则需要22个比特来表示：

00 01 10 11 00 00 01 10 00 01 00

如果换一种编码方式：

A --- 0
B --- 10
C --- 110
D --- 111

则只需要20个比特来表示：

0 10 110 111 0 0 10 110 0 10 0

当不同字符的频率差异越大，压缩效果就越明显。
这种编码方式就是哈夫曼编码

树的压缩存储

前面提到，要让解压过程顺利进行，有3种方案：
1.保存编码表，直接照表解压；
2.保存哈夫曼树；
3.保存之前的对每个字节出现次数的统计结果，重新构建哈夫曼树。

如果使用方案1，由于编码表中的每一项长度不固定，不方便保存。
如果使用方案3，假设每个字节出现的次数用32位表示，则需要32*256个比特，空间占用过大。

这里使用方案2，将哈夫曼树编码成01序列，编码方式如下：
在使用栈对树进行深度优先遍历的过程中，出入栈的顺序可以唯一地表示树的结构。

例如，深度优先遍历下面这棵树时，出入栈的顺序为:

0001011011

0表示入栈(下图绿色)，1表示出栈（下图蓝色）

在这里插入图片描述
解析此01序列时，可以模拟深搜的出入栈将这颗树还原。

一般的，对于一颗有n个节点的二叉树，我们只需要2n个比特就可以表示其结构。

当然，这只是还原了哈夫曼树的结构，哈夫曼树的每个叶子节点是需要保存一个信息的，即此节点对应哪个字节，这样，存储哈夫曼树时，还要在2n个表示树结构的01序列后面，存储遍历树时访问到叶子节点的序列。
如下图，在树的结构信息后面，还需要3*8个比特来完整表达这棵树。

这样一来，以下01序列就可以完整地表达上图这颗哈夫曼树了:

0001011011 01100111 01100100 01101111

在这里插入图片描述

极限条件下，如果文件中0~255都出现了，哈夫曼树上就有256+255个节点，需要2*(256+255)=1022个比特表示树的结构，8*256=2048个比特表示叶子节点地信息，共1022+2048=3070个比特。

如何标识文件结尾

文件都是以字节为单位的，也就是说比特数是8的整数倍，但是压缩后比特数就不一定是8的整数倍了，写入文件的时候后面就会有0~7位的无效位，如果放着不管，将会导致无效位被误判位数据，从而使得解压后文件尾部多出若干个字节。
例如，一个文件下存储着三个字节

{'g', 'o', 'o' , 'o' , 'd'}

树的结构为：

0001011011

叶子节点序列：

01100111 01100100 01101111

数据：

00 1 1 1 01

连起来：

000101101101100100011011110011101

在这里插入图片描述
一共33个比特，就会有7位无效位，如果不管，后面无论是填0还是填1，都会导致解压时多解压出一些数据来：

解压结果：

{'g', 'o', 'o' , 'o' , 'd', 'g', 'g', 'g'}

所以需要一个适当的方式告诉解压程序说 “这儿已经到头了！”
当然，最简单的方式就是把这个结尾位置33放在文件头部（最开始我也是这么想的），不过考虑到文件较大时这个数字会非常大，拿32位存也只能管到大概512M的文件。

这里我采用的方式是将这个无效位的位数7记录下来，放在最开头的3位，（因为需要表示的数字范围是[0,7]，3位刚刚好，相比32位省下了不少空间）。

当然前面多放了3位后，数据往后移，后面的无效也就变成了4
所以最终字符数组{‘g’, ‘o’, ‘o’ , ‘o’ , ‘d’}压缩后是这样的：
在这里插入图片描述
啊对了，还有一件事儿，可能你想问“你怎么知道表示树的结构的长度具体是多少？”其实从第4位开始，从前往后读，读到0和1的个数相等时，就表示结束了，且长度恰好为树的节点个数的两倍。

压缩文件结构

基于以上述描述，我们对压缩后的文件的结构定义如下：

名称	含义	长度（单位：比特）
useless_bit_cnt	结尾无效位长度（单位：比特）	3
tree_struct_code	01序列，表示哈夫曼树的结构	不固定，读到0和1的位数相同时结束
byte_sequence	01序列，表示哈夫曼树的叶子信息	不固定，tree_struct_code.size()/2+1
data	01序列，数据区	不固定
useless_bit	无效位	useless_bit_cnt

压缩、解压大致流程：

在这里插入图片描述

部分功能的具体实现

哈夫曼树的节点定义

struct Node{
	unsigned char c; // 此节点是哪个字节，仅当是叶子节点有意义
	int weight; // 此节点的权值
	Node *lchild;
	Node *rchild;

	Node(unsigned char c_, int weight_, Node *lchild_ = NULL, Node *rchild_ = NULL){
		c = c_;
		weight = weight_;
		lchild = lchild_;
		rchild = rchild_;
	}
};

构建哈夫曼树

经典的优先队列实现方式

class Compare_Node_Pointer{
  public:
    bool operator () (Node* &a, Node* &b) const{
        return a->weight > b->weight;
    }
};

Node *create_hfmTree_by_byte_cnt(int const byte_cnt[]){
	/**
	 *根据byte_cnt生成哈夫曼树，然后返回该树的根节点
	 * @byte_cnt: 长度为256的数组，保存某字节在文件中出现的次数
	 * 			  例如：byte_cnt[65] = 101 表示字节65在文件中出现了101次 
	 */
	priority_queue<Node*, vector<Node*>, Compare_Node_Pointer> q;
	for ( int i=0; i<256; i++ ){
		if ( byte_cnt[i] ) {
			q.push( new Node(i, byte_cnt[i]) );
		}
	}
	// cout << "q.size() = " << q.size() << endl; //log
	while ( q.size() > 1 ) {
		Node *a = q.top(); q.pop();
		Node *b = q.top(); q.pop();
		q.push( new Node('x', a->weight + b->weight, a, b) );
	}
	return q.top();
}

树的打印

用于调试和观察中间结果。

琢磨出了个按照路径码前后位是否相等来决定打印四个空格（“ ”）还是一条竖线加三个空格（“| ”）的方式，可以打印出一颗漂亮的二叉树。

详见二叉树的变美之路(如何将二叉树打印地漂亮一点）

void print_hfmTree(Node *root, int deep = 1, string code=".") {
	/**
	 *打印这颗二叉树 
	 * @root 树的根节点
	 * @deep 此节点的深度
	 * @code 从根节点遍历到此处的路径码，向左用0表示，向右用1表示
	 */
	if (!root) {
		return;
	}
	print_hfmTree(root->rchild, deep+1, code+"1");
	for (int i = 0; i < deep; ++i){
		printf(i==deep-1?"+---": (code[i]==code[i+1]?"     ":"|    "));
	}
	if (root->lchild){
		printf("(_)\n");
	}else{
		printf("(%d)\n", root->c);
	}
	print_hfmTree(root->lchild, deep+1, code+"0");
}

可以打印出这种漂亮些的效果：
漂亮二叉树
在这里插入图片描述

树的编解码

压缩过程中，将哈夫曼树编码为01序列存储到压缩后的文件中，解压时，将文件中存储的01序列解析回哈夫曼树。

编码：

深度优先遍历这颗哈夫曼树。

每当压栈（push）时，往tree_struct_code尾部追加一个0，出栈（pop）时，往tree_struct_code尾部追加一个1。

每当访问到叶子节点时，将叶子节点对应的字节追加到byte_sequence尾部。

void encode_hfmTree(Node const *root, vector<bool> &tree_struct_code, vector<unsigned char> &byte_sequence){
	/**
	 * 将哈夫曼树用01编码，将树的结构用01序列表示，叶子节点用字节序列表示
	 * 使用栈先序遍历哈夫曼树，0表示入栈，1表示出栈
	 * @root: 哈夫曼树的根节点
	 * @tree_struct_code: 输出，用01序列表示的压缩过的树的结构
	 * @byte_sequence: 输出，先序遍历哈夫曼树时访问叶子节点的序列
	 */
	stack<const Node *> s;
	s.push(root);
	tree_struct_code.push_back(0);
	map<const Node *, bool> vis;
	while(s.size()){
		const Node *curr = s.top();
		vis[curr] = true;
		if (curr->lchild){
			if (!vis[curr->lchild]) {
				s.push(curr->lchild);
				tree_struct_code.push_back(0);
			} else if (!vis[curr->rchild]) {
				s.push(curr->rchild);
				tree_struct_code.push_back(0);
			} else {
				tree_struct_code.push_back(1); s.pop();
			}
		} else {
			tree_struct_code.push_back(1); s.pop();
			byte_sequence.push_back(curr->c);
		}

	}
}

解码

模拟深度优先遍历过程。
遍历tree_struct_code，遇到0则new一个节点并入栈（push），遇到1则出栈（pop），如果pop前栈顶的节点是叶子节，则从叶byte_sequence中取出一个填入叶子节点。

Node* decode_hfmTree(const vector<bool> &tree_struct_code, const vector<unsigned char> &byte_sequence){
	/**
	 * 将01序列和字节序列解码成哈夫曼树
	 * @tree_struct_code: 01序列，表示一棵哈夫曼树
	 * @byte_sequence: 叶子节点序列
	 * 先序建树，0表示向下新建节点（当前节点没有左孩子则新建左孩子，有则新建右孩子）；
	 * 1表示回溯（向上）
	 */
	stack<Node *> s;
	Node *root;
	int p = 0;
	for ( auto i : tree_struct_code ) {
		if (i==0) {
			if (s.size()) {
				Node *curr = new Node(0,0);
				if (s.top()->lchild) {
					s.top()->rchild = curr;
				} else {
					s.top()->lchild = curr;
				}
				s.push(curr);
			} else {
				s.push( new Node(0, 0) );
			}
		} else {
			if (s.top()->lchild==NULL && s.top()->rchild==NULL){
				s.top()->c = byte_sequence[p++];
			}
			root = s.top(); s.pop();
		}
	}
	return root;
}

按位读写

在压缩和解压过程中，分别需要用到按位写和按位读。
读取比较简单，用左移运算就行。

写入则需要根据写0还是写1来执行或运算或者与运算：
写0与运算
写1用或运算

bool get_by_bit(unsigned char const arr[], int idx) {
	/**
	 * 按比特获取字符数组中下标为idx的那一位
	 * @arr: 字符数组 
	 * @idx: 下标
	 */
	return arr[idx/8] & (1 << (7 - idx%8));
}

void set_by_bit(unsigned char arr[], int idx, bool value){
	/**
	 * 按比特设置字符数组中下标为idx的那一位
	 * @arr: 字符数组 
	 * @idx: 下标
	 */
	value ?
	arr[idx/8] |= (1 << (7 - idx%8)) :
	arr[idx/8] &= (~(1 << (7 - idx%8)));
}

获取编码表

得益于C++的引用特性，用递归实现非常简洁

void get_encode_table(vector<string> &encode_table, Node * root, string curr_code = ""){
	/**
	 * 从哈夫曼树获取编码表
	 * @encode_table 输出，编码表，即（ 字节 --> 01序列 ）的映射关系
	 * @root 哈夫曼树的根节点
	 * @curr_code 当前路径码
	 */
	if (root->lchild==NULL){
		encode_table[root->c] = curr_code;
		return;
	}else{
		get_encode_table(encode_table, root->lchild, curr_code+"0");
		get_encode_table(encode_table, root->rchild, curr_code+"1");
	}
}

完整源码

以下为2021-03-24提交的版本，其他版本见此github链接

#include <iostream>
#include <queue>
#include <stack>
#include <map>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

#include <sys/types.h>
#include <sys/stat.h>
#include <sys/fcntl.h>
#include <errno.h>
using namespace std;


struct Node{
	unsigned char c; // 此节点是哪个字节，仅当是叶子节点有意义
	int weight; // 此节点的权值
	Node *lchild;
	Node *rchild;

	Node(unsigned char c_, int weight_, Node *lchild_ = NULL, Node *rchild_ = NULL){
		c = c_;
		weight = weight_;
		lchild = lchild_;
		rchild = rchild_;
	}
};


class Compare_Node_Pointer{
  public:
    bool operator () (Node* &a, Node* &b) const{
        return a->weight > b->weight;
    }
};

Node *create_hfmTree_by_byte_cnt(int const byte_cnt[]){
	/**
	 *根据byte_cnt生成哈夫曼树，然后返回该树的根节点
	 * @byte_cnt: 长度为256的数组，保存某字节在文件中出现的次数
	 * 			  例如：byte_cnt[65] = 101 表示字节65在文件中出现了101次 
	 */
	priority_queue<Node*, vector<Node*>, Compare_Node_Pointer> q;
	for ( int i=0; i<256; i++ ){
		if ( byte_cnt[i] ) {
			q.push( new Node(i, byte_cnt[i]) );
		}
	}
	// cout << "q.size() = " << q.size() << endl; //log
	while ( q.size() > 1 ) {
		Node *a = q.top(); q.pop();
		Node *b = q.top(); q.pop();
		q.push( new Node('x', a->weight + b->weight, a, b) );
	}
	return q.top();
}

void print_hfmTree(Node *root, int deep = 1, string code=".") {
	/**
	 *打印这颗二叉树 
	 * @root 树的根节点
	 * @deep 此节点的深度
	 * @code 从根节点遍历到此处的路径码，向左用0表示，向右用1表示
	 */
	if (!root) {
		return;
	}
	// cout << "__LINE__ = " << __LINE__ << endl; //log
	// cout << "deep = " << deep << endl; //log
	print_hfmTree(root->rchild, deep+1, code+"1");
	for (int i = 0; i < deep; ++i){
		printf(i==deep-1?"+---": (code[i]==code[i+1]?"     ":"|    "));
	}
	if (root->lchild){
		printf("(_)\n");
	}else{
		printf("(%d)\n", root->c);
	}
	print_hfmTree(root->lchild, deep+1, code+"0");
}

bool get_by_bit(unsigned char const arr[], int idx) {
	/**
	 * 按比特获取字符数组中下标为idx的那一位
	 * @arr: 字符数组 
	 * @idx: 下标
	 */
	return arr[idx/8] & (1 << (7 - idx%8));
}

void set_by_bit(unsigned char arr[], int idx, bool value){
	/**
	 * 按比特设置字符数组中下标为idx的那一位
	 * @arr: 字符数组 
	 * @idx: 下标
	 */
	value ?
	arr[idx/8] |= (1 << (7 - idx%8)) :
	arr[idx/8] &= (~(1 << (7 - idx%8)));
}

void encode_hfmTree(Node const *root, vector<bool> &tree_struct_code, vector<unsigned char> &byte_sequence){
	/**
	 * 将哈夫曼树用01编码，将树的结构用01序列表示，叶子节点用字节序列表示
	 * 使用栈先序遍历哈夫曼树，0表示入栈，1表示出栈
	 * @root: 哈夫曼树的根节点
	 * @tree_struct_code: 输出，用01序列表示的压缩过的树的结构
	 * @byte_sequence: 输出，先序遍历哈夫曼树时访问叶子节点的序列
	 */
	stack<const Node *> s;
	s.push(root);
	tree_struct_code.push_back(0);
	map<const Node *, bool> vis;
	while(s.size()){
		const Node *curr = s.top();
		vis[curr] = true;
		if (curr->lchild){
			if (!vis[curr->lchild]) {
				s.push(curr->lchild);
				tree_struct_code.push_back(0);
			} else if (!vis[curr->rchild]) {
				s.push(curr->rchild);
				tree_struct_code.push_back(0);
			} else {
				tree_struct_code.push_back(1); s.pop();
			}
		} else {
			tree_struct_code.push_back(1); s.pop();
			byte_sequence.push_back(curr->c);
		}

	}
}

Node* decode_hfmTree(const vector<bool> &tree_struct_code, const vector<unsigned char> &byte_sequence){
	/**
	 * 将01序列和字节序列解码成哈夫曼树
	 * @tree_struct_code: 01序列，表示一棵哈夫曼树
	 * @byte_sequence: 叶子节点序列
	 * 先序建树，0表示向下新建节点（当前节点没有左孩子则新建左孩子，有则新建右孩子）；
	 * 1表示回溯（向上）
	 */
	stack<Node *> s;
	Node *root;
	int p = 0;
	for ( auto i : tree_struct_code ) {
		if (i==0) {
			if (s.size()) {
				Node *curr = new Node(0,0);
				if (s.top()->lchild) {
					s.top()->rchild = curr;
				} else {
					s.top()->lchild = curr;
				}
				s.push(curr);
			} else {
				s.push( new Node(0, 0) );
			}
		} else {
			if (s.top()->lchild==NULL && s.top()->rchild==NULL){
				s.top()->c = byte_sequence[p++];
			}
			root = s.top(); s.pop();
		}
	}
	return root;
}

void get_encode_table(vector<string> &encode_table, Node * root, string curr_code = ""){
	/**
	 * 从哈夫曼树获取编码表
	 * @encode_table 输出，编码表，即（ 字节 --> 01序列 ）的映射关系
	 * @root 哈夫曼树的根节点
	 * @curr_code 用于递归迭代
	 */
	if (root->lchild==NULL){
		encode_table[root->c] = curr_code;
		return;
	}else{
		get_encode_table(encode_table, root->lchild, curr_code+"0");
		get_encode_table(encode_table, root->rchild, curr_code+"1");
	}
}

vector<unsigned char> zip_process(unsigned char *buf, int file_len){
	/**
	 * 将buf处开始，file_len字节长的数据压缩，返回压缩后的结果
	 * 压缩后的格式： 结尾无效位长度（3bit） + 表示树结构的01序列 + 表示叶子节点的byte序列 + 压缩表示的数据 + 填充的无效位
	 * @buf: 首地址
	 * @file_len: 需要压缩的长度，单位：字节
	 */

	/* 1. 统计每种字节出现的次数 */
	int byte_cnt[256] = {0};
	for (int i = 0; i < file_len; ++i){
		byte_cnt[buf[i]]++;
	}

	/* 2. 构建哈夫曼树 */
	Node *root = create_hfmTree_by_byte_cnt(byte_cnt);
	print_hfmTree(root); //log
	
	/* 3. 将哈夫曼树的结构编码，获得代表结构的01序列和叶子节点的序列 */
	vector<bool> tree_struct_code;
	vector<unsigned char> byte_sequence;
	encode_hfmTree(root, tree_struct_code, byte_sequence);
	//log
	cout << "__LINE__ = " << __LINE__ << endl; //log
	cout << "tree_struct_code.size() = " << tree_struct_code.size() << endl; //log
	for ( auto v : tree_struct_code ) {
		cout << v;
	}
	cout << endl;
	cout << "byte_sequence.size() = " << byte_sequence.size() << endl; //log
	for ( auto v : byte_sequence ) {
		printf("%d ", v);
	}
	cout << endl;

	/* 4. 获取每个字节对应的01序列(编码表) */
	vector<string > encode_table(256);
	get_encode_table(encode_table, root);
	//log
	for (int i = 0; i < 256; ++i){
		printf("%d -> %s\n", i, encode_table[i].c_str());
	}

	/* 5. 计算输出文件的大小(单位：比特)，创建输出缓冲区 */
	int tree_struct_code_len = tree_struct_code.size();
	int byte_sequence_len = byte_sequence.size()*8;
	int out_file_len = 3 + tree_struct_code_len + byte_sequence_len;
	for (int i = 0; i < 256; ++i){
		out_file_len += encode_table[i].size()*byte_cnt[i];
	}
	vector<unsigned char> out_buf_vector(out_file_len/8 + (out_file_len%8!=0));
	unsigned char* out_buf_char_star = &out_buf_vector[0];
	// log
	cout << "__LINE__ = " << __LINE__ << endl; //log
	cout << "tree_struct_code_len = " << tree_struct_code_len << endl; //log
	cout << "byte_sequence.size() = " << byte_sequence.size() << endl; //log
	cout << "out_file_len = " << out_file_len << endl; //log

	/* 6.1 将代表文件结尾多少位冗余的数字存进缓冲区头部，占3位（冗余位数只有可能是0~7）*/
	/*     类似于大端存储，高位在前，低位在后                                            */
	int useless_bit_cnt = (8 - out_file_len % 8) % 8;
	set_by_bit(out_buf_char_star, 0, (useless_bit_cnt>>2) & 1);
	set_by_bit(out_buf_char_star, 1, (useless_bit_cnt>>1) & 1);
	set_by_bit(out_buf_char_star, 2, (useless_bit_cnt>>0) & 1);
	cout << "in line " << __LINE__ << " useless_bit_cnt = " << useless_bit_cnt << endl; //log

	/* 6.2 将代表树的结构的01序列填入缓冲区 */
	int pointer = 3;
	for (int i = 0; i < tree_struct_code.size(); ++i){
		set_by_bit(out_buf_char_star, pointer++, tree_struct_code[i]);
	}

	/* 6.3 将代表叶子节点信息的叶子节点序列填入缓冲区 */
	for (int i = 0; i < byte_sequence.size(); ++i){
		for (int j = 0; j < 8; ++j){
			set_by_bit(out_buf_char_star, pointer++, get_by_bit(&byte_sequence[i], j));
		}
	}
	cout << "pointer = " << pointer << endl; //log

	/* 6.4 将每个字节对应的变长01序列填入缓冲区 */
	for (int i = 0; i < file_len; ++i){
		string const &code = encode_table[buf[i]];
		for ( auto v : code ){
			set_by_bit(out_buf_char_star, pointer++, v=='1');
		}
	}
	cout << "pointer = " << pointer << endl; //log
	printf("压缩率：%.3lf%%\n", (double)out_buf_vector.size()/file_len*100);
	return out_buf_vector;
}

vector<unsigned char> unzip_process(unsigned char *buf, int file_len){
	/**
	 * 将buf处开始，file_len字节长的数据解压，返回解压后的结果
	 * @buf: 首地址
	 * @file_len: buf的长度，单位：字节
	 */

	int pointer = 3; // 跳过记录结尾无效bit长度的32位

	/* 1. 解析提取结尾无效bit的长度 */
	int useless_bit_cnt = 0;
	useless_bit_cnt += get_by_bit(buf, 0) << 2;
	useless_bit_cnt += get_by_bit(buf, 1) << 1;
	useless_bit_cnt += get_by_bit(buf, 2) << 0;
	cout << "in line " << __LINE__ << " useless_bit_cnt = " << useless_bit_cnt << endl; //log
	cout << "__LINE__ = " << __LINE__ << endl; //log

	/* 2. 获取表示树结构的01序列 */
	int cnt_0 = 0;
	int cnt_1 = 0;
	vector<bool> tree_struct_code;
	do{
		bool v = get_by_bit(buf, pointer++);
		tree_struct_code.push_back(v);
		cnt_0 += v==0;
		cnt_1 += v==1;
	}while ( cnt_0 != cnt_1 );

	int byte_sequence_len = tree_struct_code.size()/2/2+1;

	// log
	cout << "__LINE__ = " << __LINE__ << endl; //log
	cout << "tree_struct_code.size() = " << tree_struct_code.size() << endl; //log
	for ( auto v : tree_struct_code ) {
		cout << v;
	}
	cout << endl;

	/* 3. 获取表示叶子信息的字节序列 */
	vector<unsigned char> byte_sequence;
	for (int i = 0; i < byte_sequence_len; ++i){
		unsigned char ch = 0;
		for (int j = 0; j < 8; ++j){
			ch <<= 1;
			ch += get_by_bit(buf, pointer++);
		}
		byte_sequence.push_back(ch);
	}

	//log 
	cout << "__LINE__ = " << __LINE__ << endl; //log
	cout << "byte_sequence.size() = " << byte_sequence.size() << endl; //log
	for ( auto v : byte_sequence ) {
		printf("%d ", v);
	}
	cout << endl;

	/* 4. 构建哈夫曼树 */
	Node *root = decode_hfmTree(tree_struct_code, byte_sequence);
	print_hfmTree(root);
	cout << "__LINE__ = " << __LINE__ << endl; //log

	/* 5. 根据哈夫曼树解压数据 */
	vector<unsigned char> ret;
	Node *curr = root;
	while ( pointer < file_len*8 - useless_bit_cnt ) {
		// cout << "pointer = " << pointer << endl; //log
		// cout << "get_by_bit(buf, pointer) = " << get_by_bit(buf, pointer) << endl; //log
		curr = get_by_bit(buf, pointer++) ? curr->rchild : curr->lchild;
		if (curr->lchild == NULL || curr->rchild==NULL) {
			// printf("curr->c = %d\n", curr->c);
			ret.push_back(curr->c);
			curr = root;
		}
	}
	return ret;
}

void linux_zip_file(const char *file_in, const char *file_out){
	/**
	 * Linux下，将file_in压缩后存储到file_out
	 */

	/* 1. 打开文件 */
	int fd = open(file_in, O_RDONLY);
	if (fd==-1) {
		throw string(file_in) + " in line " + to_string(__LINE__) + " error code = " + to_string(errno);
	}

	/* 2. 读取文件到缓冲区 */
	int file_len = lseek(fd, 0, SEEK_END); lseek(fd, 0, SEEK_SET);
	cout << "file_len = " << file_len << endl; //log
	unsigned char *buf = (unsigned char*)malloc(file_len);
	read(fd, buf, file_len);
	//log
	cout << "__LINE__ = " << __LINE__ << endl; //log
	// cout << "file_len = " << file_len << endl; //log
	// for (int i = 0; i < file_len; ++i){
	// 	putchar(buf[i]);
	// }
	

	/* 3. 执行压缩 */
	vector<unsigned char> out_buf = zip_process(buf, file_len);


	/* 4. 写入到输出文件 */
	int out_fd = open(file_out, O_CREAT|O_WRONLY);
	if (out_fd==-1) {
		throw string(file_out) + " in line " + to_string(__LINE__) + " error code = " + to_string(errno);
	}

	int n = write(out_fd, &out_buf[0], out_buf.size());
	if ( n<0 ) {
		throw string(file_out) + " in line " + to_string(__LINE__) + " error code = " + to_string(n);
	}

	/* 5. 关闭文件 */
	close(fd);
	close(out_fd);
	free(buf);
}

void linux_unzip_file(const char *file_in, const char *file_out){
	/**
	 * Linux下，将file_in解压后存储到file_out
	 */

	/* 1. 打开文件*/
	int fd = open(file_in, O_RDONLY);
	if (fd==-1) {
		throw string(file_in) + " in line " + to_string(__LINE__) + " error code = " + to_string(errno);
	}

	/* 2. 读取文件到buf */
	int file_len = lseek(fd, 0, SEEK_END); lseek(fd, 0, SEEK_SET);
	unsigned char *buf = (unsigned char*)malloc(file_len);
	read(fd, buf, file_len);

	/* 3. 解压到out_buf缓冲区 */
	vector<unsigned char> out_buf = unzip_process(buf, file_len);
	// cout << "__LINE__ = " << __LINE__ << endl; //log
	// cout << "out_buf.size() = " << out_buf.size() << endl; //log
	// for (auto i : out_buf ) {
	// 	printf("%c", i);
	// }
	cout << "-------------------------------------------------" << endl;
	/* 4. 写入到文件 */
	int out_fd = open(file_out, O_CREAT|O_WRONLY);
	cout << "out_fd = " << out_fd << endl; //log
	if (out_fd==-1) {
		throw string(file_out) + " in line " + to_string(__LINE__) + " error code = " + to_string(errno);
	}
	int n = write(out_fd, &out_buf[0], out_buf.size());
	if ( n<0 ) {
		throw string(file_out) + " in line " + to_string(__LINE__) + " error code = " + to_string(n);
	}

	/* 5. 关闭文件 */
	close(fd);
	close(out_fd);
	free(buf);
}

void windows_zip_file(const char *file_in, const char *file_out) {
	/**
	 * windows下，将file_in压缩后存储到file_out
	 */
	FILE *fp_in = fopen(file_in, "rb");
	if ( fp_in==NULL ) {
		throw string(file_in) + " in line " + to_string(__LINE__);
	}
	
	vector<unsigned char> buf;
	unsigned char ch;
	while(fread(&ch, 1, 1, fp_in)){
		buf.push_back(ch);
	}
	cout << "buf.size() = " << buf.size() << endl; //log
	vector<unsigned char> out_buf = zip_process(&buf[0], buf.size());

	FILE *fp_out = fopen(file_out, "wb");
	if ( fp_out==NULL ) {
		throw string(file_out) + " in line " + to_string(__LINE__);
	}
	fwrite(&out_buf[0], 1, out_buf.size(), fp_out);


	fclose(fp_in);
	fclose(fp_out);
}

void windows_unzip_file(const char *file_in, const char *file_out){
	/**
	 * windows下，将file_in解压后存储到file_out
	 */

	/* 1. 打开文件*/
	FILE *fp_in = fopen(file_in, "rb");
	if (fp_in==NULL) {
		throw "open file " + string(file_in) + " failed in line " + to_string(__LINE__);
	}
	cout << "fp_in = " << fp_in << endl; //log

	/* 2. 读取文件到buf */
	vector<unsigned char> buf;
	char ch;
	while ( fread(&ch, 1, 1, fp_in) ) {
		buf.push_back(ch);
	}
	cout << "buf.size() = " << buf.size() << endl; //log

	/* 3. 解压到out_buf缓冲区 */
	vector<unsigned char> out_buf = unzip_process(&buf[0], buf.size());


	/* 4. 写入到文件 */
	FILE *fp_out = fopen(file_out, "wb");
	if ( fp_out==NULL ) {
		throw "open file " + string(file_out) + " failed in line " + to_string(__LINE__);
	}
	int n = fwrite(&out_buf[0], 1, out_buf.size(), fp_out);
	cout << "n = " << n << endl; //log
	
	/* 5. 关闭文件 */
	fclose(fp_in);
	fclose(fp_out);
}

int main(int argc, char const *argv[]){

	#if 0
	freopen("out.txt", "w", stdout);
	unsigned char buf[1000];
	for (int i = 0; i < 1000; ++i){
		buf[i] = i*i;
	}
	vector<unsigned char> v = zip_process(buf, sizeof(buf)/sizeof(buf[0]));
	vector<unsigned char> res = unzip_process(&v[0], v.size());
	cout << "------------------------------------------------------------------------------------------" << endl;
	cout << "res.size() = " << res.size() << endl; //log
	bool right = true;
	for (int i = 0; i < 1000; ++i){
		right = right && res[i]==buf[i];
	}
	cout << "right = " << right << endl; //log
	fclose(stdout);
	system("out.txt");
	return 0;
	#endif

	// const char *a = "snake.mp4";
	// const char *b = "snake.mp4.zip";
	// const char *c = "snake0.mp4";

	const char *a = "a.txt";
	const char *b = "a.txt.zip";
	const char *c = "aa.txt";
	try{
		windows_zip_file(a, b);
		windows_unzip_file(b, c);
	}catch(string s) {
		cout << s << endl;
	}



	return 0;
}

WA的一声哭出来 pnq

关注

10
点赞
踩
43

收藏

觉得还不错? 一键收藏
1
评论
基于哈夫曼树的文件压缩与解压

背景原始数据一般使用定长编码，但当各个字符的频率不一样时，如果能够让使用频率高的用短码，使用频率低的用长码，则可大大减少文件的长度。例如，我们有一个字符串 ‘ABCDAABCABA’其中A,B,C,D出现的次数分别为5，3，2，1，如果用定长编码：A --- 00B --- 01C --- 10D --- 11则需要22个比特来表示：00 01 10 11 00 00 01 10 00 01 00如果换一种编码方式：A --- 0B --- 10C --- 110D ---
复制链接

扫一扫