LZW Data Compression

转载:

http://www.cs.duke.edu/csed/curious/compression/lzw.html#tips

Overview

If you were to take a look at almost any data file on a computer, character by character, you would notice that there are many recurring patterns. LZW is a data compression method that takes advantage of this repetition. The original version of the method was created by Lempel and Ziv in 1978 (LZ78) and was further refined by Welch in 1984, hence the LZW acronym. Like any adaptive/dynamic compression method, the idea is to (1) start with an initial model, (2) read data piece by piece, (3) and update the model and encode the data as you go along. LZW is a "dictionary"-based compression algorithm. This means that instead of tabulating character counts and building trees (as for Huffman encoding), LZW encodes data by referencing a dictionary. Thus, to encode a substring, only a single code number, corresponding to that substring's index in the dictionary, needs to be written to the output file. Although LZW is often explained in the context of compressing text files, it can be used on any type of file. However, it generally performs best on files with repeated substrings, such as text files. 


Compression

LZW starts out with a dictionary of 256 characters (in the case of 8 bits) and uses those as the "standard" character set. It then reads data 8 bits at a time (e.g., 't', 'r', etc.) and encodes the data as the number that represents its index in the dictionary. Everytime it comes across a new substring (say, "tr"), it adds it to the dictionary; everytime it comes across a substring it has already seen, it just reads in a new character and concatenates it with the current string to get a new substring. The next time LZW revisits a substring, it will be encoded using a single number. Usually a maximum number of entries (say, 4096) is defined for the dictionary, so that the process doesn't run away with memory. Thus, the codes which are taking place of the substrings in this example are 12 bits long (2^12 = 4096). It is necessary for the codes to be longer in bits than the characters (12 vs. 8 bits), but since many frequently occuring substrings will be replaced by a single code, in the long haul, compression is achieved.

Here's what it might look like in pseudocode:

string s;
char ch;
...

s = empty string;
while (there is still data to be read)
{
    ch = read a character;
    if (dictionary contains s+ch)
    {
	s = s+ch;
    }
    else
    {
	encode s to output file;
	add s+ch to dictionary;
	s = ch;
    }
}
encode s to output file;

Now, let's suppose our input stream we wish to compress is "banana_bandana", and that we are only using the initial dictionary:

Index   Entry
  0       a
  1       b
  2       d
  3       n
  4       _ (space)

The encoding steps would proceed like this:

InputCurrent StringSeen this Before?Encoded OutputNew Dictionary Entry/Index
b
b
yes
nothing
none
ba
ba
no
1
ba / 5
ban
an
no
1,0
an / 6
bana
na
no
1,0,3
na / 7
banan
an
yes
no change
none
banana
ana
no
1,0,3,6
ana / 8
banana_
a_
no
1,0,3,6,0
a_ / 9
banana_b
_b
no
1,0,3,6,0,4
_b / 10
banana_ba
ba
yes
no change
none
banana_ban
ban
no
1,0,3,6,0,4,5
ban / 11
banana_band
nd
no
1,0,3,6,0,4,5,3
nd / 12
banana_banda
da
no
1,0,3,6,0,4,5,3,2
da / 13
banana_bandan
an
yes
no change
none
banana_bandana
ana
yes
1,0,3,6,0,4,5,3,2,8
none

Notice that after the last character,"a", is read, the final substring, "ana", must be output. 


Uncompression

The uncompression process for LZW is also straightforward. In addition, it has an advantage over static compression methods because no dictionary or other overhead information is necessary for the decoding algorithm--a dictionary identical to the one created during compression is reconstructed during the process. Both encoding and decoding programs must start with the same initial dictionary, in this case, all 256 ASCII characters.

Here's how it works. The LZW decoder first reads in an index (integer), looks up the index in the dictionary, and outputs the substring associated with the index. The first character of this substring is concatenated to the current working string. This new concatenation is added to the dictionary (resimulating how the substrings were added during compression). The decoded string then becomes the current working string (the current index, ie. the substring, is remembered), and the process repeats.

Again, here's what it might look like:

string entry;
char ch;
int prevcode, currcode;
...

prevcode = read in a code;
decode/output prevcode;
while (there is still data to read)
{
    currcode = read in a code;
    entry = translation of currcode from dictionary;
    output entry;
    ch = first char of entry;
    add ((translation of prevcode)+ch) to dictionary;
    prevcode = currcode;
}

There is an exception where the algorithm fails, and that is when the code calls for an index which has not yet been entered (eg. calling for an index 31 when index 31 is currently being processed and therefore not in the dictionary yet). An example from Sayood will help illustrate this point. Suppose you had the string abababab..... and an initial dictionary of just a & b with indexes 0 & 1, respectively. The encoding process begins:

InputCurrent StringSeen this Before?Encoded OutputNew Dictionary Entry/Index
a
a
yes
nothing
none
ab
ab
no
0
ab / 2
aba
ba
no
0,1
ba / 3
abab
ab
yes
no change
none
ababa
aba
no
0,1,2
aba / 4
ababab
ab
yes
no change
none
abababa
aba
yes
no change
none
abababab
abab
no
0,1,2,4
abab / 5
...
...
...
...
...

So, the encoded output starts out 0,1,2,4,... . When we start trying to decode, a problem arises (in the table below, keep in mind that the Current String is just the substring that was decoded/translated in the last pass of the loop. Also, the New Dictionary Entry is created by concatenating the Current String with the first character of the new Dictionary Translation):

(意思就是 :  New Dictionary Entry 是由 通过 链接  Current String 和  Dictionary Translation 的  first character )

Encoded InputDictionary TranslationDecoded OutputCurrent StringNew Dictionary Entry / Index
0
0 = a
a
none
none
0,1
1 = b
ab
a
ab / 2
0,1,2
2 = ab
abab
b
解压ba / 3
0,1,2,4
4 = ???
abab???
ab
???

As you can see, the decoder comes across an index of 4 while the entry that belongs there is currently being processed. To understand why this happens, take a look at the encoding table(编码表,就是上2个表). Immediately after(接在...之后) "aba" (with an index of 4) is entered into the dictionary, the next substring that is encoded is an "aba" (ie. the very next code(下一个code) written to the encoded output file is a 4). Thus, the only case in which this special case can occur is if the substring begins and ends with the same character ("aba" is of the form <char><string><char>). So, to deal with this exception, you simply take the substring you have so far, "ab", and concatenate its first character to itself, "ab"+"a" = "aba", instead of following the procedure as normal. Therefore the pseudocode provided above must be altered a bit in order to handle all cases. 


(个人理解:上面大概想表达的就是,有异常的情况,例如有这样一种形式,"aba" (就是开头第一个字符与结尾第一个字符一样),当这样形式的subString进入encode table, 接着被encode的substring刚好又是"aba"的话,就会出现上面的"知道了code = 4,但是无法从表中找到4对应的substring",出现异常,那么就需要, you simply take the substring you have so far, "ab", and concatenate its first character to itself, "ab"+"a" = "aba",具体的做法如下:就是当你无法找到对应的code的substring的时候,就可以认为就是 “aba”这种情况,就可以认为这个substring就是上一个substring + 上一个substring的第一个字符)


参考代码:

<span style="white-space:pre">	</span>public override void Decompress(BinaryReader reader, BinaryWriter writer)
        {
            List<string> list = new List<string>();
            for (int i = 0; i < 256; i++)
            {
                list.Add(((char)i).ToString());
            }
            byte firstByte = (byte)ReadCode(reader);
            string match = ((char)firstByte).ToString();
            writer.Write(firstByte);
            int lastPercent = 0;
            while (reader.BaseStream.Position < reader.BaseStream.Length)
            {
                lastPercent = RaiseEvent(reader, lastPercent);
                int nextCode = ReadCode(reader);
                string nextMatch = null;
                // 这里的list.Count是随时会变化的
                if (nextCode < list.Count)
                {
                    nextMatch = list[nextCode];
                }
                <span style="color:#ff0000;">else
                {
                    nextMatch = match + match[0];
                }</span>

                foreach(char c in nextMatch)
                {
                    writer.Write((byte)c);
                }
                list.Add(match + nextMatch[0]);
                match = nextMatch;
            }
            RaiseFinishEvent();
        }




  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
本火锅店点餐系统采用Java语言和Vue技术,框架采用SSM,搭配Mysql数据库,运行在Idea里,采用小程序模式。本火锅店点餐系统提供管理员、用户两种角色的服务。总的功能包括菜品的查询、菜品的购买、餐桌预定和订单管理。本系统可以帮助管理员更新菜品信息和管理订单信息,帮助用户实现在线的点餐方式,并可以实现餐桌预定。本系统采用成熟技术开发可以完成点餐管理的相关工作。 本系统的功能围绕用户、管理员两种权限设计。根据不同权限的不同需求设计出更符合用户要求的功能。本系统中管理员主要负责审核管理用户,发布分享新的菜品,审核用户的订餐信息和餐桌预定信息等,用户可以对需要的菜品进行购买、预定餐桌等。用户可以管理个人资料、查询菜品、在线点餐和预定餐桌、管理订单等,用户的个人资料是由管理员添加用户资料时产生,用户的订单内容由用户在购买菜品时产生,用户预定信息由用户在预定餐桌操作时产生。 本系统的功能设计为管理员、用户两部分。管理员为菜品管理、菜品分类管理、用户管理、订单管理等,用户的功能为查询菜品,在线点餐、预定餐桌、管理个人信息等。 管理员负责用户信息的删除和管理,用户的姓名和手机号都可以由管理员在此功能里看到。管理员可以对菜品的信息进行管理、审核。本功能可以实现菜品的定时更新和审核管理。本功能包括查询餐桌,也可以发布新的餐桌信息。管理员可以查询已预定的餐桌,并进行审核。管理员可以管理公告和系统的轮播图,可以安排活动。管理员可以对个人的资料进行修改和管理,管理员还可以在本功能里修改密码。管理员可以查询用户的订单,并完成菜品的安排。 当用户登录进系统后可以修改自己的资料,可以使自己信息的保持正确性。还可以修改密码。用户可以浏览所有的菜品,可以查看详细的菜品内容,也可以进行菜品的点餐。在本功能里用户可以进行点餐。用户可以浏览没有预定出去的餐桌,选择合适的餐桌可以进行预定。用户可以管理购物车里的菜品。用户可以管理自己的订单,在订单管理界面里也可以进行查询操作。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值