LZW编解码算法原理及分析

原创于 2022-07-08 18:15:57 发布 · 5.3k 阅读

23 ·

CC 4.0 BY-SA版权

文章标签：

#算法

数据压缩专栏收录该内容

9 篇文章

订阅专栏

本文详细介绍了LZW编解码算法，包括其基本思想、编码与解码原理及流程，以及实验过程和结果分析。LZW编码通过创建动态词典，用码字替换字符以实现数据压缩，但其效率受文件重复度影响，对于重复度不高的文件可能无法有效压缩。

文章目录

数据压缩实验（三）
一、LZW概述
二、LZW编解码原理
总结与分析

数据压缩实验（三）

一、LZW概述

第二类词典编码——LZW

LZW属于第二类词典编码，其基本思想是：企图从输入的数据中创建一个“短语词典”，这种短语词典可以是任意字符的组合。编码数据过程中，当遇到已经在字典中出现的“短语”时，编码器就输出这个词典中的短语的“索引号”，而不是短语本身。

J.Ziv和A.Lempel在1978年首次发表了介绍第二类词典编码算法的文章。在他们的研究基础上，Terry A.Welch在1984年发表了改进这种编码算法的文章，因此把这种编码方法称为LZW(Lempel-Ziv Walch)压缩编码。

二、LZW编解码原理

1.LZW编码

（1）算法原理

代号代替短语：LZW的编码思想是不断地从字符流中提取新的字符串，通俗地理解为新“词条”，然后用“代号”也就是码字表示这个“词条”。这样一来，对字符流的编码就变成了用码字（code word）去替换字符（String），生成码字流且只输出码字流，从而达到压缩数据的目的。

动态生成词典，新词条等于旧词条加新字符：LZW编码需要从输入的数据中创建短语词典，LZW编码器通过管理这个词典完成输入（短语）与输出（短语的索引号）之间的转换。

词典在开始时初始化不能为空，必须包含字符流中所有单个字符，即在编码匹配时至少能找到长度为1的匹配串。

输入字符输出码字：LZW编码器的输入是字符流，字符流可以是用8位ASCII字符组成的字符串，而输出是用n位(例如12位)表示的码字流。

（2）算法流程

步骤1：将词典初始化为包含所有可能的单字符，当前前缀P初始化为空。

步骤2：当前字符 C = 字符流中的下一个字符。

步骤3：判断 P＋C 是否在词典中：

1. 如果“是”，则用 C 扩展 P ，即令 P = P＋C，返回步骤2。

2.如果“否”，则
(1) 输出与当前前缀 P 相对应的码字 W ；
(2) 将 P＋C 添加到词典中；
(3) 令 P = C，并返回到步骤2。

具体如下图所示：

请添加图片描述

2.LZW解码

（1）算法原理

LZW解码算法开始时，译码词典和编码词典相同，包含所有可能的前缀根；

边解码边生成新词条，新词条等于旧词条加新字符。

（2）算法流程

步骤 1 ：在开始译码时词典包含所有可能的前缀根。

步骤 2 ：令 CW = 码字流中的第一个码字。

步骤 3 ：输出当前字符串 CW 到码字流。

步骤 4 ：先前码字 PW = 当前码字 CW 。

步骤 5 ：当前码字 CW = 码字流的下一个码字。

步骤 6 ：判断当前字符串 CW是否在词典中：

1. 如果 ” 是 ” ，则

(1) 把当前字符串 CW 输出到字符流；
(2) 当前前缀 P = 先前字符串 PW ；
(3) 当前字符 C = 当前字符串 CW 的第一个字符；
(4) 把字符串 P+C 添加到词典；
(5) PW = CW 。

2. 如果 ” 否 ” ，则

(1)当前前缀 P = 先前字符串 PW ；
(2) 当前字符 C = 当前字符串 CW 的第一个字符；
(3) 输出字符串 P+C 到字符流 , 然后把它添加到词典中；
(4) PW = CW 。

步骤7：判断码字流中是否还有码字要译：

1. 如果 ” 是 ” ，就返回步骤4。
2. 如果 ” 否 ”，则结束。

具体如下伪代码所示：

请添加图片描述

3.实验过程

（1）数据结构分析

尾缀字符（suffix）
母节点（parent）
第一个孩子节点( firstchild )
下一个兄弟节点（nextsibling）

树用数组dict[ ]表示，数组下标用pointer表示，所以dict[pointer]表示一个节点

dict[pointer].suffix
dict[pointer].parent
dict[pointer].firstchild
dict[pointer].nextsibling

（2）主函数


int main( int argc, char **argv){
	FILE *fp;
	BITFILE *bf;
	if( 4 > argc){
		fprintf( stdout, "usage: \n%s <o> <ifile> <ofile>\n", argv[0]);
		fprintf( stdout, "\t<o>: E or D reffers encode or decode\n");
		fprintf( stdout, "\t<ifile>: input file name\n");
		fprintf( stdout, "\t<ofile>: output file name\n");
		return -1;
	}
	if( 'E' == argv[1][0]){ // do encoding
		fp = fopen( argv[2], "rb");
		bf = OpenBitFileOutput( argv[3]);
		if( NULL!=fp && NULL!=bf){
			LZWEncode( fp, bf);
			fclose( fp);
			CloseBitFileOutput( bf);
			fprintf( stdout, "encoding done\n");
		}
		printf("Encode dictionary:\n");
		PrintDictionary();
	}else if( 'D' == argv[1][0]){	// do decoding
		bf = OpenBitFileInput( argv[2]);
		fp = fopen( argv[3], "wb");
		if( NULL!=fp && NULL!=bf){
			LZWDecode( bf, fp);
			fclose( fp);
			CloseBitFileInput( bf);
			fprintf( stdout, "decoding done\n");
		}
		printf("Decode dictionary:\n");
		PrintDictionary();
	}else{	// otherwise
		fprintf( stderr, "not supported operation\n");
	}
	return 0;
}

（3）主要功能模块

初始化词典

void InitDictionary( void){
	int i;
	for( i=0; i<256; i++)
	{ 
		dictionary[i].suffix = i;
		dictionary[i].parent = -1;
		dictionary[i].firstchild = -1;
		dictionary[i].nextsibling = i+1;
	}
	dictionary[255].nextsibling = -1;
	next_code = 256;
	string_code = -1;
}

查找词典中是否有字符串

int InDictionary( int character, int string_code)
{
	int sibling;
	if( 0>string_code) return character; //如果是单个字符？
	sibling = dictionary[string_code].firstchild; //找第一个孩子节点
	while( -1<sibling)
	{ 
		if( character == dictionary[sibling].suffix) 
		return sibling;
		sibling = dictionary[sibling].nextsibling; //进来的字符串在词典中未找到，则找兄弟节点
	}
	return -1; //表示进来的字符不在词典中
}

将新串加入词典

void AddToDictionary( int character, int string_code)
{ 
	int firstsibling, nextsibling;
	if( 0>string_code) return;
	dictionary[next_code].suffix = character;
	dictionary[next_code].parent = string_code;
	dictionary[next_code].nextsibling = -1;
	dictionary[next_code].firstchild = -1;
	firstsibling = dictionary[string_code].firstchild;
	if( -1<firstsibling)
	{ 	// the parent has child
		nextsibling = firstsibling;
		while( -1<dictionary[nextsibling].nextsibling ) 
			nextsibling = dictionary[nextsibling].nextsibling;
		dictionary[nextsibling].nextsibling = next_code;
	}else{	// no child before, modify it to be the first
		dictionary[string_code].firstchild = next_code;
		}
	next_code ++;
}

LZW编码

void LZWEncode( FILE *fp, BITFILE *bf){
	int character;
	int string_code;
	int index;
	unsigned long file_length;

	fseek( fp, 0, SEEK_END); // 来到文件尾
	file_length = ftell( fp); // 根据文件尾得到文件长度
	fseek( fp, 0, SEEK_SET); // 回到文件头
	BitsOutput( bf, file_length, 4*8); // 将源文件长度输出到压缩结果的前4字节。
	InitDictionary();	// 初始化词典
	string_code = -1;   // 初始化string_code，即P，P一开始是空的
	while( EOF!=(character=fgetc( fp))){ // 从文件中读取一个字符C
		index = InDictionary( character, string_code);	// 判断P+C是否在字典里
		if( 0<=index){	// string+character in dictionary P+C在字典里，所在位置为index
			string_code = index;	// P = P + C
		}else{	// string+character not in dictionary P + C不在字典里
			
			output( bf, string_code);	// 将P的索引输出到压缩结果中
			if( MAX_CODE > next_code){	// free space in dictionary 字典是否已满
				// add string+character to dictionary
				AddToDictionary( character, string_code);	// 将P+C放入字典
			}
			string_code = character;	// 令P = C
		}
	}
	output( bf, string_code);
}

LZW解码

int DecodeString( int start, int code){		// 从索引号解码字符串 start：修改d_stack的起始点，code： 码字
	//需填充
	int i = start;
	int string_code = code;
	while(string_code>=0){
		d_stack[i] = dictionary[string_code].suffix;
		string_code = dictionary[string_code].parent;
		i++;
	}
	//d_stack[i] = dictionary[string_code].suffix;
	//i++;
	return i;
}


void WriteTo(char *dst, int *src, int size) // WriteTo函数是反向写入的，因为DecodeString中存放d_stack也是反着的。
{
	int t = size;
	for(t=size; t>=1;t--){
		dst[size - t] = (char ) src[t-1];
	}
}

void LZWDecode( BITFILE *bf, FILE *fp){
	//需填充
	int character;
	int string_code;
	int index;
	unsigned long file_length;
	char * text, * start;
	fseek( fp, 0, SEEK_SET);
	file_length = BitsInput(bf, 8 * 4); // 读取文件大小
	printf("File length: %ld\r\n", file_length);
	text = (unsigned char *) malloc(file_length);	// 按文件大小分配
	start = text;	// 输出缓存开头
	memset(text, 0x00, file_length + 1);	// 初始化输出缓存
	InitDictionary();	// 初始化字典

	char * end = text + file_length;	// 计算输出缓存结尾
	int cW, pW;		// 声明cW和pW
	int count;		// 声明count，每次解码得到的字符数
	cW = input(bf);		// 读入第一个码字
	*text = dictionary[cW].suffix;	// 得到第一个码字的字符
	text++;
	pW = cW;		// pW = cW
	while(end-text>0){
		cW = input(bf);		// 读入一个码字
		if(cW<next_code){		// 码字在字典内
			count = DecodeString(0, cW);	// 码字解码为字符串
		}else {		// 码字不在字典内
			count = DecodeString(1, pW);  // 这里将Start设为1，是为了不覆写d_stack的第一个字符，
            							  // 即`P = dict[pW]`,`C=dict[pW][0]`
            							  // 而此时P+C=d_stack
		}
        AddToDictionary(d_stack[count-1], pW); // P+C输出到字典
        pW = cW;
		WriteTo(text, d_stack, count); // P+C输出到文件
		text += count;
		//while( 0<count--) printf("%c", (char)(d_stack[count]
	}
	fwrite(start, 1, file_length, fp);
}

4.实验结果

编码测试

（1）在记事本上随意输入文本
请添加图片描述

（2）设置命令行参数
请添加图片描述

（3）运行成功后打开编码后的文本文件
请添加图片描述

解码测试

（1）设置命令行参数
请添加图片描述

（2）成功解码得到原文件
请添加图片描述

测试10种不同格式的文件，分析压缩效率

由上述编码测试可计算得下表：

类型	压缩前	压缩后	压缩效率
tga	22 KB	39 KB	1.77272727
gif	17.2 MB	21 MB	1.22093023
txt	87 Bytes	158 Bytes	1.81609195
pdf	895 KB	1.1 MB	1.25260322
m4a	834 KB	1 MB	1.22340717
docx	14 KB	21 KB	1.50547045
mp4	515 KB	355 KB	0.68932038
PNG	3 MB	3.7 MB	1.22617322
xlsx	10 KB	17 KB	1.67404027
bmp	1.4 MB	1.4 MB	0.98131458