Data Compression
Coding and Decoding
Coding is a rule assigning exactly one codeword for each source symbol.
binary coding
if any codeword consists of two symbols (usually ‘0’ and ‘1’).
unique coding
is possible only when arbitrary任意的 two distinct不同的 source messages have distinct code.
block coding
uses pairwise成对的 distinct codewords of length n.
e.g., hexadecimal code 十六进制码, even parity code, ASCII code, etc
instantaneous瞬时 code
no codeword is prefix of another codeword
not all uniquely decodable codes are instantaneous
Block Code
Huffman Code
- instantaneous (prefix) code
- optimal最佳 symbol code
– it encodes individual source symbols into a code of variable length
– there is no other coding scheme that achieves shorter average codeword length - derived产生 based on the estimated probability of occurrence of individual source symbols
Construction of Huffman code (sketch草图):
- list all possible symbols with their probabilities, and locate two symbols with the smallest probabilities.
- replace them with a single member containing both of them, whose probability is the sum of them.
- repeat these procedures recursively until the list contains only one member. (It can be seen like a binary tree with the original symbols at the leaves.)
- in order to form a codeword, trace backward the tree from the root to the leaves, labelling ‘0’ for one branch and ‘1’ for the other.
Arithmetic Code 算术码
- codeword is not assigned to individual symbols (i.e., not symbol code)
- represent symbols by intervals间隔
- encode a stream of source symbols into a single fraction小数 between 0 and 1
- slightly more efficient than Huffman code
假设对FADDE编码
- block code of length 3: 15 bits
- Huffman code: 12 bits
- arithmetic code :12 bits
– encode with any number between 0.54256 and 0.54288 — e.g., 0.542724609375, whose binary expression is 0.100010101111.