实验3 | LZW编解码算法的C实现解读，及其压缩效率分析

最新推荐文章于 2022-04-24 18:50:28 发布

Endless Ferry

最新推荐文章于 2022-04-24 18:50:28 发布

阅读量1.5k

点赞数 1

分类专栏：数据压缩的那些实验报告文章标签：编码器信息压缩

本文链接：https://blog.csdn.net/weixin_44874766/article/details/115531743

版权

数据压缩的那些实验报告专栏收录该内容

12 篇文章 6 订阅

订阅专栏

文章目录

1 压缩效率分析
- 1.1 结果呈现
- 1.2 压缩率分析
2 程序解读
3 解码时候如果码字不存在怎么办？
附录：代码注释

1 压缩效率分析

1.1 结果呈现

我选用了了6种不同格式和内容的文件，分别进行LZW压缩和zip压缩，压缩后的文件大小如下表所示。其中，压缩效率最高者进行了突出：
在这里插入图片描述
zip压缩不是我们本次分析的重点，对于其压缩原理目前仍不甚清楚，故仅作为一个对比结果使用。事实证明，zip的压缩还是很厉害的，在各类型文件压缩中表现都很靓眼！

我们可以看到，LZW压缩在使用后，大部分的文件大小不减反增，少部分文件有较明显的压缩效果。我们接下来按照文件格式和文件对之进行分析。

1.2 压缩率分析

在文本文件txt中，由于我们在编码时，将变长的字符串映射为定长的码字，这个码字是16位的，比ASCII字符还多出了8位：

	#define output(f, x) BitsOutput( f, (unsigned long)(x), 16)

所以，如果数据重复的词组不多，反而可能会增大最终的文件大小。

我们的小文本文件，里面只有"HelloWorld！"几个字符，在编码后的码表占据的空间反而更大，字符数反而更多。我们将字典的演进过程print出来分析：每次都更新了码表，且没有词条被复用。几乎产生了两倍的冗余数据，难怪浪费了空间！
在这里插入图片描述

对于大的txt文件则好一些。这个大型txt文件是使用BullshitGenerator生成的11492字的文章，可以看到有很大数据冗余，因此数据压缩效果非常明显，几乎降低了一半。
在这里插入图片描述
如果我们将变长的字符串映射为定长的码字的16位改成12位或者9位，对于第一个文件，压缩效率可以更高一点，第二个则不然，因为第二个字典的词条可能会多。

对于视音频文件和PDF文件，可以预料到，由于文件组成的编码实质上与ASCII字符有挺大差别，故采用字典编码并没有太大的效果，压缩效果并不理想。

2 程序解读

本次程序较复杂，故进行一个完整流程的程序解读。

2.1 一些需要澄清的词汇含义

我们在后续的描述中，将一个字典里的内容这样称呼：
在这里插入图片描述

2.2 流程图

在这里插入图片描述

2.3 函数作用

2.3.1 程序结构

本次程序有两个c文件，其中bitio.c文件主要起到比特流的IO控制的功能，lzw_E.c包含了main函数和LZW算法的实现。

bitio.c和lzw_E.c的关系：在我们进行LZW编码的时候，需要将字符文件转换为比特流文件，这就会涉及到一步：要将变长的字符串映射为定长的码字。这个映射后的码字是定长的，也就意味着我们需要设定要长度是多少。

我们可以看到这个程序里设定的是16位。这时候，就需要使用位操作进行运算，将输入/输出的IO流进行映射成16位。这个bitio里使用滑动窗口的方法进行映射，我没有细究。幸运的是，16位刚好是俩字节，所以其实这时候不用这个函数也不会出错。

但是，如果要映射成9位呢？这时候它不是字节的整数倍，就不能直接写入，需要借助bitio这个工具进行编码的映射。

2.3.2 `lzw_E.c`内各函数的功能说明

int DecodeString( int start, int code);
/* 从码解出字符串到d_stack这个栈中，存储的字符串内容应该是倒序的
* @parameters:start：字符的基长度；code：需要解的码
* @return: 字符串（词组）的长度
*/

void InitDictionary( void);
/*将字典进行初始化，以树的逻辑数据结构进行结构体的设置，和数组的存储结构
* @parameters:全局变量dictionary数据
* @return:None
*/

void PrintDictionary( void);
/*打印出自定义的字典*/

int InDictionary( int character, int string_code);
/*查询 加上新来的character的字符串 是否在字典里
* @parameters:全局变量dictionary数据
* @return: 如果在，但传入的只有单个字符，返回字符的代号
*		   如果在，返回字符串的代号
*		   如果不在，返回-1
*/

void AddToDictionary( int character, int string_code)
/*将一个新组成的字符串加入字典
* @parameters:二进制文件指针bf,文件指针fp，全局变量Dictionary数据
* @return:None
*/

void LZWEncode( FILE *fp, BITFILE *bf)
/*进行LZW编码的完整流程
* @parameters:二进制文件指针bf,文件指针fp
* @return:None
*/

void LZWDecode( BITFILE *bf, FILE *fp)
/*进行LZW解码的完整流程
* @parameters:二进制文件指针bf,文件指针fp
* @return:None
*/

2.3.3`bitio.c`各函数功能的简要叙述

void CloseBitFileInput( BITFILE *bf); 
/*关闭比特流文件*/
void CloseBitFileOutput( BITFILE *bf);
/*关闭比特流文件*/

int BitInput( BITFILE *bf);
/*被下面的函数调用*/
unsigned long BitsInput( BITFILE *bf, int count);
/*输入比特流*/

void BitOutput( BITFILE *bf, int bit);


void BitsOutput( BITFILE *bf, unsigned long code, int count);
/*将字符对应的编码映射成固定长度的编码*/

2.4 数据结构

2.4.1 字典存储结构

本字典采用的存储结构是线性的数组存储，数组的索引index对应了码字，数组的内容对应了一个词组（字符串）里的单个字符，整个字符串要根据数据的逻辑结构来整个推出来。

2.4.2 字典逻辑结构

字典存储的逻辑结构是Trie树。这种树是动态增长的。从根节点不断往下扩充，对应新的字符串。

每当我们要找一个字符串，从树的叶子节点往上溯源到祖先节点，我们就可以找出我们要的字符串。
在这里插入图片描述

2.4.3 解出字符串时用的`d_stack` 栈

d_stack是一个栈，本质是一个数组。

由于解码的时候从树的叶子节点往上溯源到祖先节点，故寻找到我们需要的字符串里每个字符的顺序是导致的，也就是说，比如我们要找一个单词Hello，那我们寻找的顺序会是olleH。

所以存储在这个栈中的数据是倒序的，这也就意味着我们写入文件时要倒着读取这个数组。

3 解码时候如果码字不存在怎么办？

仔细看流程图，我们发现有一个判断：
在这里插入图片描述
我们在解码的时候，竟然出现了一个没有存储到字典里的码字？！这是怎么回事儿？

这是在编码时，上一个新的词条刚被创建，下一个词组就需要使用它造成的。

这个问题的根源原因是：解码端的解码会比编码端晚一步。我们没有得到最新的词条，就需要使用它了。怎么办呢？数据不会凭空创建。我们只需要做和编码端一样的事情就行了。我们看到，下一个词条的尾缀必定是这个词条的第一个字符，所以就先输出这个字符，然后用先前一条词条的code进行译码就好了。

附录：代码注释

lzw_h.c

/*
 * Definition for LZW coding 
 *
 * vim: ts=4 sw=4 cindent nowrap
 */
#include <stdlib.h>
#include <stdio.h>
#include "bitio.h"
#define MAX_CODE 65535
#pragma warning(disable:4996) 

struct {
	int suffix;
	int parent, firstchild, nextsibling;
} dictionary[MAX_CODE+1];

int next_code;
int d_stack[MAX_CODE]; // stack for decoding a phrase

#define input(f) ((int)BitsInput( f, 16))
#define output(f, x) BitsOutput( f, (unsigned long)(x), 16)

int DecodeString( int start, int code);
void InitDictionary( void);


void PrintDictionary( void){
	int n;
	int count;
	for( n=256; n<next_code; n++){
		count = DecodeString( 0, n);
		printf( "%4d->", n);
		while( 0<count--) printf("%c", (char)(d_stack[count]));
		printf( "\n");
	}
	printf("\n");
}
/*打印出自定义的字典*/

int DecodeString( int start, int code){
	int count;//计数器，记录当前词组/字符串的长度
	count = start;//计数器从start开始
	while( 0<=code){
		d_stack[ count] = dictionary[code].suffix; //当前字符串的尾缀存入栈中
		code = dictionary[code].parent;//回父节点，再次循环，直到祖先节点时跳出循环
		count ++; //字符长度计数+1
	}
	return count;
}
/* 从码解出字符串到d_stack这个栈中，存储的字符串内容应该是倒序的
* @parameters:start：字符的基长度；code：需要解的码
* @return: 字符串（词组）的长度
*/


void InitDictionary( void){
	int i;
	for( i=0; i<256; i++){
		dictionary[i].suffix = i; //尾缀字符
		dictionary[i].parent = -1; //父节点 
		dictionary[i].firstchild = -1; //第一个孩子节点
		dictionary[i].nextsibling = i+1; //下一个兄弟节点
	}
	dictionary[255].nextsibling = -1; //第255个ASCII字符是最后一个字符，没有兄弟节点
	next_code = 256; //下一个开始编号的号码是256
}
/*将字典进行初始化，以树的逻辑数据结构进行结构体的设置，和数组的存储结构
* @parameters:全局变量dictionary数据
* @return:None
*/


/*
 * Input: string represented by string_code in dictionary,
 * Output: the index of character+string in the dictionary
 * 		index = -1 if not found
 */


int InDictionary( int character, int string_code){
	int sibling; //
	if( 0>string_code) return character;
	sibling = dictionary[string_code].firstchild;
	while( -1<sibling){
		if( character == dictionary[sibling].suffix) return sibling;
		sibling = dictionary[sibling].nextsibling;
	}
	return -1;
}
/*查询 加上新来的character的字符串 是否在字典里
* @parameters:全局变量dictionary数据
* @return: 如果在，但传入的只有单个字符，返回字符的代号
*		   如果在，返回字符串的代号
*		   如果不在，返回-1
*/


void AddToDictionary( int character, int string_code){
	int firstsibling, nextsibling;
	if( 0>string_code) return;
	dictionary[next_code].suffix = character;
	dictionary[next_code].parent = string_code;
	dictionary[next_code].nextsibling = -1;
	dictionary[next_code].firstchild = -1;
	firstsibling = dictionary[string_code].firstchild;
	if( -1<firstsibling){	// the parent has child如果父亲节点有孩子
		nextsibling = firstsibling; //接下来从第一个兄弟节点开始，找到目前有几个兄弟节点，然后再向后添加
		while( -1<dictionary[nextsibling].nextsibling ) 
			nextsibling = dictionary[nextsibling].nextsibling;
		dictionary[nextsibling].nextsibling = next_code;
	}else{// no child before, modify it to be the first 没有孩子节点，它是第一个孩子
		dictionary[string_code].firstchild = next_code;
	}
	next_code ++;
}
/*将一个新组成的字符串加入字典
* @parameters:二进制文件指针bf,文件指针fp，全局变量Dictionary数据
* @return:None
*/



void LZWEncode( FILE *fp, BITFILE *bf){
	int character; //字符
	int string_code; //字符或字符串所对应的词典编码
	int index; //
	unsigned long file_length;//文件的长度

	fseek( fp, 0, SEEK_END); 
	file_length = ftell(fp); //上两条：使用seek指针计算出文件的整体长度
	fseek( fp, 0, SEEK_SET);//重新到起始处
	BitsOutput( bf, file_length, 4*8); 
	InitDictionary();//设置字典的0-255的基本内容
	string_code = -1; //设置字符或字符串所对应的词典编码为-1
	while( EOF!=(character=fgetc(fp))){//从文件里读取字符，直到EOF
		index = InDictionary( character, string_code);//该函数返回这个字符串是否在字典里
		if( 0<=index){	// string+character in dictionary
			string_code = index; //将返回的index给string_code
		}
		else{	// string+character not in dictionary
			output( bf, string_code);//将字符对应的编码映射成固定长度的编码
			if( MAX_CODE > next_code){	// free space in dictionary  确定字典里还有空间容下新的词条
				// add string+character to dictionary
				AddToDictionary( character, string_code); //将新组成的词条加入词典
			}
			string_code = character; //将目前字符的代号放入前缀串
		}
		//PrintDictionary();
	}
	output( bf, string_code);// 将字符串对应的代号写入到二进制文件中
}
/*进行LZW编码的完整流程
* @parameters:二进制文件指针bf,文件指针fp
* @return:None
*/

void LZWDecode( BITFILE *bf, FILE *fp){
	int character; //字符代号
	int new_code, last_code=-1;
	int phrase_length; //
	unsigned long file_length; //文件的长度

	file_length = BitsInput( bf, 4*8);//BitsInput是根据bf和代号的长度，计算出有多少个字符，返回个数给filelength
	if( -1 == file_length) file_length = 0; 
	/*需填充*/
	InitDictionary(); //先初始化词典，使得词典0-255对应ASCII字符
	while (file_length > 0) {
		new_code = input(bf); //读入一个代号
		if (new_code >= next_code) {//如果读入的这个代号比字典代号最大值还要大，也就是不在字典里
			d_stack[0] = character; // 当前字符代号先记录在栈里，也就是这个字符串的尾部是当前字符
			phrase_length = DecodeString(1, last_code);//解出字符，存入d_stack栈里
		}
		else {//如果在字典里
			phrase_length = DecodeString(0, new_code);//解出字符，存入d_stack栈里
		}
		character = d_stack[phrase_length-1]; // 更新下一个字符为当前字符串首字符。为后面添加字典作准备
		while (0 < phrase_length) {//输出字符串到文本文件中
			phrase_length--;
			fputc(d_stack[phrase_length],fp); 
			file_length--;
		}
		if (MAX_CODE > next_code) {//当字典还有词条空间的时候
			AddToDictionary(character, last_code);//将字符加入到字典中，也就是树的新的叶子节点
		}
		last_code = new_code;//更新字典条数last_code为最新的new_code
	}
}
/*进行LZW解码的完整流程
* @parameters:二进制文件指针bf,文件指针fp
* @return:None
*/



int main( int argc, char **argv){
	FILE *fp; //文件指针
	BITFILE *bf; //输出的二进制文件流

	if( 4>argc){ //输入参数错误时候的处理
		fprintf( stdout, "usage: \n%s <o> <ifile> <ofile>\n", argv[0]);
		fprintf( stdout, "\t<o>: E or D reffers encode or decode\n");
		fprintf( stdout, "\t<ifile>: input file name\n");
		fprintf( stdout, "\t<ofile>: output file name\n");
		return -1;
	}
	if( 'E' == argv[1][0]){ // do encoding 
		fp = fopen( argv[2], "rb");
		bf = OpenBitFileOutput( argv[3]);
		if( NULL!=fp && NULL!=bf){
			LZWEncode( fp, bf);
			fclose( fp);
			CloseBitFileOutput( bf);
			fprintf( stdout, "encoding done\n");
		}
	}else if( 'D' == argv[1][0]){	// do decoding
		bf = OpenBitFileInput( argv[2]);
		fp = fopen( argv[3], "wb");
		if( NULL!=fp && NULL!=bf){
			LZWDecode( bf, fp);
			fclose( fp);
			CloseBitFileInput( bf);
			fprintf( stdout, "decoding done\n");
		}
	}else{	// otherwise
		fprintf( stderr, "not supported operation\n");
	}
	return 0;
}