数据压缩——LZW 编解码算法实现与分析

倩倩_ICE_王王

已于 2022-04-18 15:29:34 修改

阅读量743

点赞数

分类专栏：数据压缩文章标签： c++

于 2022-04-18 15:18:59 首次发布

本文链接：https://blog.csdn.net/Bingyeshinvwang/article/details/124242848

版权

数据压缩专栏收录该内容

10 篇文章 0 订阅

订阅专栏

LZW编码概述

LZW的编码思想是不断地从字符流中提取新的字符串，通俗地理解为新“词条”，然后用“代号”也就是码字表示这个“词条”。
这样一来，对字符流的编码就变成了用码字去替换字符流，生成码字流，从而达到压缩数据的目的。
LZW编码是围绕称为词典的转换表来完成的。
LZW编码器通过管理这个词典完成输入与输出之间的转换。LZW编码器的输入是字符流，字符流可以是用8位ASCII字符组成的字符串，而输出是用n位(例如12位)表示的码字流。

LZW编码算法的步骤

请添加图片描述

步骤1：初始化字典，将所有可能的单字符放入词典中。当前前缀P初始化为空。
步骤2：当前字符C=字符流中的下一个字符。
步骤3：判断P＋C是否在词典中
（1）如果“是”，则用C扩展P，即让P=P＋C，返回到步骤2。
（2）如果“否”，则输出与当前前缀P相对应的码字W；
将P＋C添加到词典中；
令P=C，并返回到步骤2

举个栗子

若有一串字符流是：
a b b a b a b a c
其对应的初始化词典如下：
请添加图片描述

步骤	P	C	P+C	P+C是否在字典中	操作	输出
1	NULL	a	a	Y	更新P=a	-
2	a	b	ab	N	将ab加入字典中，对应码字256，更新P=b	97
3	b	b	bb	N	将bb加入字典中，对应码字257，更新P=b	98
4	b	a	ba	N	将ba加入字典中，对应码字258，更新P=a	98
5	a	b	ab	Y	更新P=ab	-
6	ab	a	aba	N	将aba加入字典中，对应码字258，更新P=a	256
7	a	b	ab	Y	更新P=ab	-
8	ab	a	aba	Y	更新P=aba	-
9	aba	c	abac	N	将abac加入字典中，对应码字259，更新P=c	259
10	c	结束符	-	-	-	99

所以输出的码字流为：

97 98 98 256 259 99

下图为对应情况：
请添加图片描述

LZW编码代码关键代码及注释

void LZWEncode( FILE *fp, BITFILE *bf){
	int character;	//存储当前新读取的单字符C
	int string_code;	//存放最终编码得到的码字并输出
	int index;	//判断新读取的P+C对应的词典索引号
	unsigned long file_length;

	fseek( fp, 0, SEEK_END);   //指针移动到fp文件末尾
	file_length = ftell( fp); //获取当前指针，打印出指针位置，获取文件大小
	fseek( fp, 0, SEEK_SET); //指针移动到fp文件开始
	BitsOutput( bf, file_length, 4*8); 
	InitDictionary();
	string_code = -1;
	while( EOF!=(character=fgetc( fp))) //没有读完整个文件时
	{ 
		index = InDictionary( character, string_code); //index为P+C对应的索引号
		if( 0<=index)
		{	// string+character in dictionary
			string_code = index; //在字典中，则nextP=P+C
		}
		else
		{	// string+character not in dictionary
			output( bf, string_code); //不在词典中，输出P对应的索引号
			if( MAX_CODE > next_code)
			{	// free space in dictionary
				// add string+character to dictionary
				AddToDictionary( character, string_code); //将P+C加入词典
			}
			string_code = character; //nextP=C;
		}
	}
	output( bf, string_code); //最后一次读取，只有P，没有C，将P的索引号输出
}

其中

struct {
	int suffix; //当前索引值对应字符的最后一位，假设索引号256对应符号为ab,则suffix=b；
	int parent, firstchild, nextsibling; //当前节点的父/母节点索引号，孩子节点索引号，兄弟节点索引号
} dictionary[MAX_CODE+1];
int next_code; //下一个码符号对应的索引号
int d_stack[MAX_CODE]; // stack for decoding a phrase

LZW解码步骤

步骤1：在开始译码时词典包含所有可能的前缀根。
步骤2：令CW：=码字流中的第一个码字。
步骤3：输出当前缀-符串string.CW到码字流。
步骤4：先前码字PW：=当前码字CW。
步骤5：当前码字CW：=码字流的下一个码字。
步骤6：判断当前缀-符串string.CW 是否在词典中。
（1）如果”是”则把当前缀-符串string.CW输出到字符流。
当前前缀P：=先前缀-符串string.PW。
当前字符C：=当前前缀-符串string.CW的第一个字符。
把缀-符串P+C添加到词典。
（2）如果”否”，则当前前缀P：=先前缀-符串string.PW。
当前字符C：=当前缀-符串string.CW的第一个字符。
输出缀-符串P+C到字符流,然后把它添加到词典中。
步骤7：判断码字流中是否还有码字要译。
（1）如果”是”，就返回步骤4。
（2）如果”否”，结束。

举个栗子

其中收到的码字流如下：请添加图片描述初始化字典如下：

步骤	PW	CW	CW是否在字典中	操作	输出
1	NULL	97	Y	pw=cw=97	a
2	97	98	Y	P=a；C=b；将P+C=ab放入词典，对应码字为256，pw=cw=98	b
3	98	98	Y	P=b；C=b；将P+C=bb放入词典，对应码字为257，pw=cw=98	b
4	98	256	Y	P=b；C=a；将P+C=ba放入词典，对应码字为258，pw=cw=256	ab
5	256	259	N	P=ab；C=a（是pw对应的第一个字符）；将P+C=aba放入词典，对应码字为259，pw=cw=259	-
6	259	99	Y	P=aba；C=c；将P+C=abac放入词典，对应码字为260，pw=cw=99	aba
7	99	-	-	P=c	c

所以解码得到的字符流为：
请添加图片描述

cw在字典中没有对应字符的情况

其中需要注意的是，因为在编码时ab编码之后就立刻被使用了，因此会在解码端出现cw不在字典中的情况。针对这种情况，解码出来的码字直接是令C=pw对应字符的第一位，P+C放入字典中，对应码字为cw的值。

LZW解码关键代码及注释

void LZWDecode( BITFILE *bf, FILE *fp){
	//需填充
	int character; 
	int new_code, last_code; //new_code对应新读取的cw；last_code对应pw
	int phrase_length; //需要解码的字符的位数
	unsigned long file_length;
	file_length = BitsInput(bf, 4 * 8); //需要解码的字符流长度
	if (-1 == file_length) file_length = 0;
	InitDictionary();
	last_code = -1; //第一次解码没有pw，所以为-1
	while (0 < file_length) //未完全解码时
	{
		new_code = input(bf);
		if (new_code >= next_code) //判断cw是否在词典内 
		{  
			d_stack[0] = character; //如果不在词典中，那么它一定是由【pw+pw的第一位】构成的
			//（也就是在编码时，刚刚编码出来就立刻使用）所以可以直接将character赋值给堆的第一位
			phrase_length = DecodeString(1, last_code);//得到pw（上一个刚刚解出的符号）的长度
		}
		else  
		{//若在词典中，则直接得到要解码字符的位数
			phrase_length = DecodeString(0, new_code);
		}
		character = d_stack[phrase_length - 1]; //因为编码时堆是倒序存放的，因此该操作是将cw的第一位赋值给character
		while (0 < phrase_length)  //遍历解码字符
		{
			phrase_length--;
			fputc(d_stack[phrase_length], fp); //将解码后的字符写入fp流中
			file_length--;
		}
		if (MAX_CODE > next_code) 
		{ // add the new phrase to dictionary 
			AddToDictionary(character, last_code);
		}
		last_code = new_code; //nextpw=cw
	}
}

创建文件测试LZW编码

#include<iostream>
#include<stdio.h>
using namespace std;
#define MAX_CODE 65535

int main()
{
	FILE* doc = NULL;
	if (fopen_s(&doc, "F:\\大三下资料\\数据压缩\\shiyan3\\test\\CREAT.doc", "wb") != 0)
	{
		cout << "Failed to open the doc file!" << endl;
	}
	else
	{
		cout << "Succcessfully opened the doc file!" << endl;
	}
	int n;
	cout << "请输入n的值：" << endl;
	cin >> n;
	unsigned char a[MAX_CODE];
	unsigned char* a_buffer = new unsigned char[n];
	if(n< MAX_CODE)
	{ 
		for (int i = 0; i < n; i++)
		{
			cin >> a[i];
			a_buffer[i] = a[i];
		}
		fwrite(a_buffer, sizeof(unsigned char), n, doc);
	}
	else
	{
		return 0;
	}
}

在这里插入图片描述

前18个字符放入文件中
以编码产生的文件作为解码的输入文件：

解码之后文件是一致的。

测试并分析不同文件的压缩效率

选取了10中文件格式，对其进行压缩
结果如下：
在这里插入图片描述

文件类型	压缩前文件大小a	压缩后文件大小b	文件压缩比(a-b)/a
txt	1kb	1kb	0%
doc	41kb	13kb	68.3%
jpg	51kb	77kb	-51.0%
mp3	413kb	109kb	73.6%
xlsx	10kb	17kb	-70.0%
pdf	51kb	79kb	-54.9%
rgb	192kb	179kb	6.7%
jfif	295kb	370kb	-25.4%
wav	3929kb	4615kb	-17.5%
png	164kb	217kb	-32.3%

总结

LZW压缩编码后并不一定使文件的大小变小，这是因为文件中字符出现的频率较低的原因。
大部分文件都出现了压缩后文件反而变大的现象，只有少数文件压缩后大小变小，其中MP3文件压缩比最大，这是因为所选的片段，旋律相近且不断重复的原因。

附件

lzw.c代码

#include <stdlib.h>
#include <stdio.h>
#include "bitio.h"
#define MAX_CODE 65535

struct {
	int suffix; //当前索引值对应字符的最后一位，假设索引号256对应符号为ab,则suffix=b；
	int parent, firstchild, nextsibling; //当前节点的父/母节点索引号，孩子节点索引号，兄弟节点索引号
} dictionary[MAX_CODE+1];
int next_code; //下一个码符号对应的索引号
int d_stack[MAX_CODE]; // stack for decoding a phrase 

#define input(f) ((int)BitsInput( f, 16))
#define output(f, x) BitsOutput( f, (unsigned long)(x), 16)

int DecodeString( int start, int code); 
void InitDictionary( void);  //写入字典
void PrintDictionary( void){  //打印字典
	int n;
	int count;
	for( n=256; n<next_code; n++){ //若256之后还有符号，就进入循环（初始建立dictionary为0~255），按照索引号从256到next_code遍历
		count = DecodeString( 0, n); //得到当前索引号对应的字符有几位，比如n=256若对应符号ab，count=2；
		printf( "%4d->", n); //当前索引号
		while( 0<count--) printf("%c", (char)(d_stack[count]));//输出当前索引号n对应的符号，比如n=256时，输出ab
		printf( "\n");											//注：因count是从大到小输出，所以存放时，d_stack应该先存放孩子节点
	}
}

int DecodeString( int start, int code){
	//需填充
	int count=start;
	for (; code >= 0;) //如果不到根节点
	{
		d_stack[count] = dictionary[code].suffix; //将当前索引号对应的最后一位放入d_stack中
		code = dictionary[code].parent; //更新code的值到其母节点对应的索引号
		count++; //表示读取到的该索引号对应的符号位数+1
	}
	return count;
	
}
void InitDictionary( void){
	int i;

	for( i=0; i<256; i++){  //0~255索引
		dictionary[i].suffix = i; //初始化
		dictionary[i].parent = -1; //i开始都没有母节点，设置为-1
		dictionary[i].firstchild = -1; //一开始也没有孩子节点，设置为-1
		dictionary[i].nextsibling = i+1; //兄弟节点的索引值为当前节点+1
	}
	dictionary[255].nextsibling = -1; //255索引号为最后一个，没有兄弟节点，设为-1
	next_code = 256; //再读入时，下一个索引号为256
	
}
/*
 * Input: string represented by string_code in dictionary,
 * Output: the index of character+string in the dictionary
 * 		index = -1 if not found
 */
int InDictionary( int character, int string_code){ //查找字典中是否有字符块
	int sibling;
	if( 0>string_code) return character; //若是单个字符，就返回当前的符号
	sibling = dictionary[string_code].firstchild; //有孩子节点，去找第一个孩子节点
	while( -1<sibling){ //如果有孩子节点
		if( character == dictionary[sibling].suffix) return sibling; //如果当前的孩子节点值为character，表示字符在字典中，返回它的索引号
		sibling = dictionary[sibling].nextsibling; //若不是，则继续遍历其兄弟节点
	}
	return -1; //未找到，返回-1；
}

void AddToDictionary( int character, int string_code){ //将字符串加入到字典中
	int firstsibling, nextsibling;  //关系链
	if( 0>string_code) return; //如果是单个字符，就返回
	dictionary[next_code].suffix = character; //将该节点字符赋值为character
	dictionary[next_code].parent = string_code; //该节点的母节点的索引号为string_code
	dictionary[next_code].nextsibling = -1; //因为是新添加的节点，所以没有兄弟节点，赋值为-1
	dictionary[next_code].firstchild = -1; //同上
	firstsibling = dictionary[string_code].firstchild; //母节点的第一个孩子节点
	if( -1<firstsibling){	// the parent has child，说明当前添加的不是第一个孩子节点
		nextsibling = firstsibling; //把母节点的第一个孩子节点的索引号赋值给nextsibling，暂时把其当作当前节点的左兄弟节点
		while( -1<dictionary[nextsibling].nextsibling ) //判断左兄弟节点原来是否还有右兄弟节点，也就是说，需要搞清楚自己是“第几个孩子”，要找到母节点最后的孩子节点 
			nextsibling = dictionary[nextsibling].nextsibling; //有兄弟节点，那么就将自己的兄弟节点索引号更新
		dictionary[nextsibling].nextsibling = next_code; //直到没有兄弟节点，就把next_code的值赋给，当前节点
	}else{// no child before, modify it to be the first
		dictionary[string_code].firstchild = next_code; //如果当前节点是第一个孩子节点，也即其母节点原来没有孩子节点，
														//就直接把next_code值作为当前节点索引号
	}
	next_code ++;  //next_code更新
}

void LZWEncode( FILE *fp, BITFILE *bf){
	int character;	//存储当前新读取的单字符C
	int string_code;	//存放最终编码得到的码字并输出
	int index;	//判断新读取的P+C对应的词典索引号
	unsigned long file_length;

	fseek( fp, 0, SEEK_END);   //指针移动到fp文件末尾
	file_length = ftell( fp); //获取当前指针，打印出指针位置，获取文件大小
	fseek( fp, 0, SEEK_SET); //指针移动到fp文件开始
	BitsOutput( bf, file_length, 4*8); 
	InitDictionary();
	string_code = -1;
	while( EOF!=(character=fgetc( fp))) //没有读完整个文件时
	{ 
		index = InDictionary( character, string_code); //index为P+C对应的索引号
		if( 0<=index)
		{	// string+character in dictionary
			string_code = index; //在字典中，则nextP=P+C
		}
		else
		{	// string+character not in dictionary
			output( bf, string_code); //不在词典中，输出P对应的索引号
			if( MAX_CODE > next_code)
			{	// free space in dictionary
				// add string+character to dictionary
				AddToDictionary( character, string_code); //将P+C加入词典
			}
			string_code = character; //nextP=C;
		}
	}
	output( bf, string_code); //最后一次读取，只有P，没有C，将P的索引号输出
}

void LZWDecode( BITFILE *bf, FILE *fp){
	//需填充
	int character; 
	int new_code, last_code; //new_code对应新读取的cw；last_code对应pw
	int phrase_length; //需要解码的字符的位数
	unsigned long file_length;
	file_length = BitsInput(bf, 4 * 8); //需要解码的字符流长度
	if (-1 == file_length) file_length = 0;
	InitDictionary();
	last_code = -1; //第一次解码没有pw，所以为-1
	while (0 < file_length) //未完全解码时
	{
		new_code = input(bf);
		if (new_code >= next_code) //判断cw是否在词典内 
		{  
			d_stack[0] = character; //如果不在词典中，那么它一定是由【pw+pw的第一位】构成的
			//（也就是在编码时，刚刚编码出来就立刻使用）所以可以直接将character赋值给堆的第一位
			phrase_length = DecodeString(1, last_code);//得到pw（上一个刚刚解出的符号）的长度
		}
		else  
		{//若在词典中，则直接得到要解码字符的位数
			phrase_length = DecodeString(0, new_code);
		}
		character = d_stack[phrase_length - 1]; //因为编码时堆是倒序存放的，因此该操作是将cw的第一位赋值给character
		while (0 < phrase_length)  //遍历解码字符
		{
			phrase_length--;
			fputc(d_stack[phrase_length], fp); //将解码后的字符写入fp流中
			file_length--;
		}
		if (MAX_CODE > next_code) 
		{ // add the new phrase to dictionary 
			AddToDictionary(character, last_code);
		}
		last_code = new_code; //nextpw=cw
	}
}



int main( int argc, char **argv){
	FILE *fp;
	BITFILE *bf;

	if( 4>argc){
		fprintf( stdout, "usage: \n%s <o> <ifile> <ofile>\n", argv[0]);
		fprintf( stdout, "\t<o>: E or D reffers encode or decode\n");
		fprintf( stdout, "\t<ifile>: input file name\n");
		fprintf( stdout, "\t<ofile>: output file name\n");
		return -1;
	}
	
	if( 'E' == argv[1][0]){ // do encoding
		fp = fopen( argv[2], "rb");
		bf = OpenBitFileOutput( argv[3]);
		if( NULL!=fp && NULL!=bf){
			LZWEncode( fp, bf);
			fclose( fp);
			CloseBitFileOutput( bf);
			fprintf( stdout, "encoding done\n");
		}
	}else if( 'D' == argv[1][0]){	// do decoding
		bf = OpenBitFileInput( argv[2]);
		fp = fopen( argv[3], "wb");
		if( NULL!=fp && NULL!=bf){
			LZWDecode( bf, fp);
			fclose( fp);
			CloseBitFileInput( bf);
			fprintf( stdout, "decoding done\n");
		}
	}else{	// otherwise
		fprintf( stderr, "not supported operation\n");
	}
	return 0;
}