目录
字符串压缩算法用于减少字符串的存储空间,尤其是在需要传输或保存大量文本数据时。以下是三种常见的字符串压缩算法:RLE、Huffman编码和LZW压缩。
RLE(游程长度编码)
算法原理
游程长度编码(Run-Length Encoding,RLE)是一种简单的压缩算法,主要针对字符串中连续重复的字符。该算法通过记录每个字符的重复次数来实现压缩。
步骤说明
- 遍历字符串,记录每个字符及其连续出现的次数。
- 生成一个新的字符串,其中每个字符后面跟着其出现的次数。
示例说明
考虑字符串 "AAAABBBCCDAA"
:
- 第1步:找到字符
A
连续出现了4次,记为"4A"
。 - 第2步:找到字符
B
连续出现了3次,记为"3B"
。 - 第3步:字符
C
连续出现2次,记为"2C"
。 - 第4步:字符
D
出现1次,记为"1D"
。 - 第5步:字符
A
连续出现2次,记为"2A"
。
最终压缩结果为 "4A3B2C1D2A"
。
代码示例
python语言:
def rle_encode(data):
encoding = ''
i = 0
while i < len(data):
count = 1
while i + 1 < len(data) and data[i] == data[i + 1]:
i += 1
count += 1
encoding += str(count) + data[i]
i += 1
return encoding
# 示例使用
input_string = "AAAABBBCCDAA"
encoded_string = rle_encode(input_string)
print(encoded_string) # 输出: "4A3B2C1D2A"
C语言:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char *rleEncode(const char *data) {
int dataLength = strlen(data);
char *encoding = (char *)malloc(2 * dataLength * sizeof(char));
int encodingIndex = 0;
int i = 0;
while (i < dataLength) {
int count = 1;
while (i + 1 < dataLength && data[i] == data[i + 1]) {
i++;
count++;
}
int countDigits = 0;
int tempCount = count;
while (tempCount > 0) {
tempCount /= 10;
countDigits++;
}
int digitIndex = countDigits;
tempCount = count;
while (tempCount > 0) {
encoding[encodingIndex + digitIndex--] = '0' + tempCount % 10;
tempCount /= 10;
}
encoding[encodingIndex + countDigits] = data[i];
encodingIndex += countDigits + 1;
i++;
}
encoding[encodingIndex] = '\0';
return encoding;
}
int main() {
const char *inputString = "AAAABBBCCDAA";
char *encodedString = rleEncode(inputString);
printf("%s\n", encodedString);
free(encodedString);
return 0;
}
优缺点
- 优点:RLE算法实现简单,适用于字符重复较多的场景。
- 缺点:对于字符重复较少的字符串,RLE可能会增加字符串的长度而非压缩。
Huffman编码
基本原理
Huffman编码是一种基于字符出现频率的无损压缩算法。它通过构建一棵Huffman树,为出现频率较高的字符分配较短的二进制编码,频率较低的字符分配较长的二进制编码,从而达到压缩的目的。
构造Huffman树
- 计算每个字符的出现频率。
- 创建一个优先队列,将每个字符及其频率作为一个叶节点插入队列。
- 取出队列中频率最低的两个节点,创建一个新的父节点,其频率为两个节点频率之和,并将该父节点插回队列。
- 重复步骤3,直到队列中只剩下一个节点,该节点即为Huffman树的根节点。
编码与解码过程
- 编码:从根节点出发,沿着树向下遍历,每向左走一步,记为
0
,向右走一步,记为1
,直到达到叶节点。这样,每个字符都有一个唯一的二进制编码。 - 解码:从压缩后的二进制字符串出发,沿着Huffman树进行解码,直到恢复出原始字符串。
代码示例
python语言:
import heapq
from collections import defaultdict, Counter
class HuffmanNode:
def __init__(self, char, freq):
self.char = char
self.freq = freq
self.left = None
self.right = None
def __lt__(self, other):
return self.freq < other.freq
def build_huffman_tree(text):
frequency = Counter(text)
heap = [HuffmanNode(char, freq) for char, freq in frequency.items()]
heapq.heapify(heap)
while len(heap) > 1:
node1 = heapq.heappop(heap)
node2 = heapq.heappop(heap)
merged = HuffmanNode(None, node1.freq + node2.freq)
merged.left = node1
merged.right = node2
heapq.heappush(heap, merged)
return heap[0]
def build_codes(node, prefix='', codebook={}):
if node is not None:
if node.char is not None:
codebook[node.char] = prefix
build_codes(node.left, prefix + '0', codebook)
build_codes(node.right, prefix + '1', codebook)
return codebook
def huffman_encode(text):
root = build_huffman_tree(text)
codebook = build_codes(root)
return ''.join(codebook[char] for char in text), codebook
# 示例使用
text = "AAAABBBCCDAA"
encoded_text, huffman_codebook = huffman_encode(text)
print(f"Encoded: {encoded_text}") # 输出压缩后的二进制字符串
print(f"Codebook: {huffman_codebook}") # 输出字符到二进制的映射
C语言:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
// 哈夫曼树节点结构体
typedef struct HuffmanNode {
char character;
int frequency;
struct HuffmanNode *left;
struct HuffmanNode *right;
} HuffmanNode;
// 创建新的哈夫曼节点
HuffmanNode *createHuffmanNode(char character, int frequency) {
HuffmanNode *newNode = (HuffmanNode *)malloc(sizeof(HuffmanNode));
newNode->character = character;
newNode->frequency = frequency;
newNode->left = NULL;
newNode->right = NULL;
return newNode;
}
// 交换两个哈夫曼节点
void swapHuffmanNodes(HuffmanNode **a, HuffmanNode **b) {
HuffmanNode *temp = *a;
*a = *b;
*b = temp;
}
// 下调整个最小堆,保持堆的性质
void minHeapify(HuffmanNode **heap, int size, int index) {
int smallest = index;
int left = 2 * index + 1;
int right = 2 * index + 2;
if (left < size && heap[left]->frequency < heap[smallest]->frequency)
smallest = left;
if (right < size && heap[right]->frequency < heap[smallest]->frequency)
smallest = right;
if (smallest!= index) {
swapHuffmanNodes(&heap[index], &heap[smallest]);
minHeapify(heap, size, smallest);
}
}
// 构建最小堆
void buildMinHeap(HuffmanNode **heap, int size) {
for (int i = (size / 2) - 1; i >= 0; i--)
minHeapify(heap, size, i);
}
// 提取最小频率的节点
HuffmanNode *extractMin(HuffmanNode **heap, int *size) {
if (*size <= 0)
return NULL;
HuffmanNode *minNode = heap[0];
heap[0] = heap[(*size) - 1];
(*size)--;
minHeapify(heap, *size, 0);
return minNode;
}
// 插入节点到最小堆
void insertNode(HuffmanNode **heap, int *size, HuffmanNode *node) {
(*size)++;
int i = (*size) - 1;
while (i && node->frequency < heap[(i - 1) / 2]->frequency) {
heap[i] = heap[(i - 1) / 2];
i = (i - 1) / 2;
}
heap[i] = node;
}
// 构建哈夫曼树
HuffmanNode *buildHuffmanTree(char *text) {
int frequency[256] = {0};
int length = strlen(text);
for (int i = 0; i < length; i++)
frequency[(int)text[i]]++;
HuffmanNode **heap = (HuffmanNode **)malloc(length * sizeof(HuffmanNode *));
int size = 0;
for (int i = 0; i < 256; i++) {
if (frequency[i] > 0) {
heap[size++] = createHuffmanNode((char)i, frequency[i]);
}
}
buildMinHeap(heap, size);
while (size > 1) {
HuffmanNode *left = extractMin(heap, &size);
HuffmanNode *right = extractMin(heap, &size);
HuffmanNode *merged = createHuffmanNode('\0', left->frequency + right->frequency);
merged->left = left;
merged->right = right;
insertNode(heap, &size, merged);
}
HuffmanNode *root = extractMin(heap, &size);
free(heap);
return root;
}
// 深度优先遍历构建编码表
void buildCodes(HuffmanNode *root, char *prefix, int prefixLength, char **codebook) {
if (root->left) {
prefix[prefixLength] = '0';
buildCodes(root->left, prefix, prefixLength + 1, codebook);
}
if (root->right) {
prefix[prefixLength] = '1';
buildCodes(root->right, prefix, prefixLength + 1, codebook);
}
if (root->character!= '\0') {
prefix[prefixLength] = '\0';
codebook[(int)root->character] = strdup(prefix);
}
}
// 哈夫曼编码函数
void huffmanEncode(char *text) {
HuffmanNode *root = buildHuffmanTree(text);
char prefix[256] = {0};
char **codebook = (char **)malloc(256 * sizeof(char *));
buildCodes(root, prefix, 0, codebook);
printf("Encoded: ");
int length = strlen(text);
for (int i = 0; i < length; i++) {
printf("%s", codebook[(int)text[i]]);
}
printf("\n");
printf("Codebook:\n");
for (int i = 0; i < 256; i++) {
if (codebook[i]!= NULL) {
printf("%c: %s\n", (char)i, codebook[i]);
free(codebook[i]);
}
}
free(codebook);
}
// 测试示例
int main() {
char text[] = "AAAABBBCCDAA";
huffmanEncode(text);
return 0;
}
优缺点
- 优点:Huffman编码能够显著减少高频字符的编码长度,实现高效压缩。
- 缺点:构造Huffman树的过程相对复杂,对于频率较为均匀的字符,压缩效果有限。
LZW压缩
字典构建与压缩过程
LZW(Lempel-Ziv-Welch)是一种基于字典的无损压缩算法。它通过动态构建字典,将字符串中的重复模式编码为较短的二进制串,从而实现压缩。
步骤说明
- 初始化字典,包含所有可能的单字符模式。
- 从输入字符串中读取字符,寻找最长的已存在于字典中的模式。
- 将该模式的索引输出,并将新模式(即该模式加下一个字符)加入字典。
- 重复步骤2和3,直到字符串处理完毕。
示例说明
假设有字符串 "ABABABABABAB"
:
- 初始字典包含所有单字符模式,如
'A': 1, 'B': 2
。 - 读取字符
'A'
,最长匹配为'A'
,输出其索引1
,并将'AB'
加入字典。 - 读取字符
'B'
,最长匹配为'B'
,输出其索引2
,并将'BA'
加入字典。 - 继续匹配,最终压缩输出一系列索引代表原始字符串。
代码示例
python语言:
def lzw_compress(uncompressed):
dict_size = 256
dictionary = {chr(i): i for i in range(dict_size)}
w = ""
compressed_data = []
for c in uncompressed:
wc = w + c
if wc in dictionary:
w = wc
else:
compressed_data.append(dictionary[w])
dictionary[wc] = dict_size
dict_size += 1
w = c
if w:
compressed_data.append(dictionary[w])
return compressed_data
# 示例使用
input_string = "ABABABABABAB"
compressed = lzw_compress(input_string)
print(compressed) # 输出: [65, 66, 256, 258, 260, 262]
C语言:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define DICT_SIZE 256
typedef struct {
char *key;
int value;
} DictionaryEntry;
DictionaryEntry *createDictionaryEntry(char *key, int value) {
DictionaryEntry *entry = (DictionaryEntry *)malloc(sizeof(DictionaryEntry));
entry->key = strdup(key);
entry->value = value;
return entry;
}
void freeDictionaryEntry(DictionaryEntry *entry) {
free(entry->key);
free(entry);
}
int lzwCompress(char *uncompressed) {
DictionaryEntry *dictionary[DICT_SIZE];
for (int i = 0; i < DICT_SIZE; i++) {
dictionary[i] = createDictionaryEntry((char *)&i, i);
}
char w[1000] = "";
int compressedData[1000];
int compressedDataIndex = 0;
for (int i = 0; uncompressed[i]!= '\0'; i++) {
char wc[1000];
strcpy(wc, w);
strncat(wc, &uncompressed[i], 1);
int found = 0;
for (int j = 0; j < DICT_SIZE; j++) {
if (strcmp(dictionary[j]->key, wc) == 0) {
found = 1;
strcpy(w, wc);
break;
}
}
if (!found) {
for (int j = 0; j < DICT_SIZE; j++) {
if (strcmp(dictionary[j]->key, w) == 0) {
compressedData[compressedDataIndex++] = dictionary[j]->value;
break;
}
}
dictionary[DICT_SIZE] = createDictionaryEntry(wc, DICT_SIZE);
DICT_SIZE++;
strcpy(w, &uncompressed[i]);
}
}
if (strlen(w) > 0) {
for (int j = 0; j < DICT_SIZE; j++) {
if (strcmp(dictionary[j]->key, w) == 0) {
compressedData[compressedDataIndex++] = dictionary[j]->value;
break;
}
}
}
for (int i = 0; i < DICT_SIZE; i++) {
freeDictionaryEntry(dictionary[i]);
}
for (int i = 0; i < compressedDataIndex; i++) {
printf("%d ", compressedData[i]);
}
printf("\n");
return compressedDataIndex;
}
int main() {
char inputString[] = "ABABABABABAB";
lzwCompress(inputString);
return 0;
}
优缺点
- 优点:LZW压缩在重复模式丰富的场景下能实现很好的压缩效果,且字典动态构建,使其适应性强。
- 缺点:初始字典大小限制了压缩的灵活性,且当模式变化频繁时,压缩效果不佳。