字符串压缩算法

字符串压缩算法字符串压缩算法用于减少字符串的存储空间 尤其是在需要传输或保存大量文本数据时

大家好,欢迎来到IT知识分享网。

目录

RLE(游程长度编码)

算法原理

步骤说明

示例说明

代码示例

python语言:

C语言:

优缺点

Huffman编码

基本原理

构造Huffman树

编码与解码过程

代码示例

python语言:

C语言:

优缺点

LZW压缩

字典构建与压缩过程

步骤说明

代码示例

python语言:

C语言:

优缺点


字符串压缩算法用于减少字符串的存储空间,尤其是在需要传输或保存大量文本数据时。以下是三种常见的字符串压缩算法:RLE、Huffman编码和LZW压缩。

RLE(游程长度编码)

算法原理

游程长度编码(Run-Length Encoding,RLE)是一种简单的压缩算法,主要针对字符串中连续重复的字符。该算法通过记录每个字符的重复次数来实现压缩。

步骤说明

  1. 遍历字符串,记录每个字符及其连续出现的次数。
  2. 生成一个新的字符串,其中每个字符后面跟着其出现的次数。

示例说明

考虑字符串 "AAAABBBCCDAA"

  • 第1步:找到字符 A 连续出现了4次,记为 "4A"
  • 第2步:找到字符 B 连续出现了3次,记为 "3B"
  • 第3步:字符 C 连续出现2次,记为 "2C"
  • 第4步:字符 D 出现1次,记为 "1D"
  • 第5步:字符 A 连续出现2次,记为 "2A"

最终压缩结果为 "4A3B2C1D2A"

代码示例

python语言:

def rle_encode(data): encoding = '' i = 0 while i < len(data): count = 1 while i + 1 < len(data) and data[i] == data[i + 1]: i += 1 count += 1 encoding += str(count) + data[i] i += 1 return encoding # 示例使用 input_string = "AAAABBBCCDAA" encoded_string = rle_encode(input_string) print(encoded_string) # 输出: "4A3B2C1D2A"

C语言:

#include <stdio.h> #include <stdlib.h> #include <string.h> char *rleEncode(const char *data) { int dataLength = strlen(data); char *encoding = (char *)malloc(2 * dataLength * sizeof(char)); int encodingIndex = 0; int i = 0; while (i < dataLength) { int count = 1; while (i + 1 < dataLength && data[i] == data[i + 1]) { i++; count++; } int countDigits = 0; int tempCount = count; while (tempCount > 0) { tempCount /= 10; countDigits++; } int digitIndex = countDigits; tempCount = count; while (tempCount > 0) { encoding[encodingIndex + digitIndex--] = '0' + tempCount % 10; tempCount /= 10; } encoding[encodingIndex + countDigits] = data[i]; encodingIndex += countDigits + 1; i++; } encoding[encodingIndex] = '\0'; return encoding; } int main() { const char *inputString = "AAAABBBCCDAA"; char *encodedString = rleEncode(inputString); printf("%s\n", encodedString); free(encodedString); return 0; }

优缺点

  • 优点:RLE算法实现简单,适用于字符重复较多的场景。
  • 缺点:对于字符重复较少的字符串,RLE可能会增加字符串的长度而非压缩。

Huffman编码

基本原理

Huffman编码是一种基于字符出现频率的无损压缩算法。它通过构建一棵Huffman树,为出现频率较高的字符分配较短的二进制编码,频率较低的字符分配较长的二进制编码,从而达到压缩的目的。

构造Huffman树

  1. 计算每个字符的出现频率。
  2. 创建一个优先队列,将每个字符及其频率作为一个叶节点插入队列。
  3. 取出队列中频率最低的两个节点,创建一个新的父节点,其频率为两个节点频率之和,并将该父节点插回队列。
  4. 重复步骤3,直到队列中只剩下一个节点,该节点即为Huffman树的根节点。

编码与解码过程

  • 编码:从根节点出发,沿着树向下遍历,每向左走一步,记为 0,向右走一步,记为 1,直到达到叶节点。这样,每个字符都有一个唯一的二进制编码。
  • 解码:从压缩后的二进制字符串出发,沿着Huffman树进行解码,直到恢复出原始字符串。

代码示例

python语言:

import heapq from collections import defaultdict, Counter class HuffmanNode: def __init__(self, char, freq): self.char = char self.freq = freq self.left = None self.right = None def __lt__(self, other): return self.freq < other.freq def build_huffman_tree(text): frequency = Counter(text) heap = [HuffmanNode(char, freq) for char, freq in frequency.items()] heapq.heapify(heap) while len(heap) > 1: node1 = heapq.heappop(heap) node2 = heapq.heappop(heap) merged = HuffmanNode(None, node1.freq + node2.freq) merged.left = node1 merged.right = node2 heapq.heappush(heap, merged) return heap[0] def build_codes(node, prefix='', codebook={}): if node is not None: if node.char is not None: codebook[node.char] = prefix build_codes(node.left, prefix + '0', codebook) build_codes(node.right, prefix + '1', codebook) return codebook def huffman_encode(text): root = build_huffman_tree(text) codebook = build_codes(root) return ''.join(codebook[char] for char in text), codebook # 示例使用 text = "AAAABBBCCDAA" encoded_text, huffman_codebook = huffman_encode(text) print(f"Encoded: {encoded_text}") # 输出压缩后的二进制字符串 print(f"Codebook: {huffman_codebook}") # 输出字符到二进制的映射

C语言:

#include <stdio.h> #include <stdlib.h> #include <string.h> // 哈夫曼树节点结构体 typedef struct HuffmanNode { char character; int frequency; struct HuffmanNode *left; struct HuffmanNode *right; } HuffmanNode; // 创建新的哈夫曼节点 HuffmanNode *createHuffmanNode(char character, int frequency) { HuffmanNode *newNode = (HuffmanNode *)malloc(sizeof(HuffmanNode)); newNode->character = character; newNode->frequency = frequency; newNode->left = NULL; newNode->right = NULL; return newNode; } // 交换两个哈夫曼节点 void swapHuffmanNodes(HuffmanNode a, HuffmanNode b) { HuffmanNode *temp = *a; *a = *b; *b = temp; } // 下调整个最小堆,保持堆的性质 void minHeapify(HuffmanNode heap, int size, int index) { int smallest = index; int left = 2 * index + 1; int right = 2 * index + 2; if (left < size && heap[left]->frequency < heap[smallest]->frequency) smallest = left; if (right < size && heap[right]->frequency < heap[smallest]->frequency) smallest = right; if (smallest!= index) { swapHuffmanNodes(&heap[index], &heap[smallest]); minHeapify(heap, size, smallest); } } // 构建最小堆 void buildMinHeap(HuffmanNode heap, int size) { for (int i = (size / 2) - 1; i >= 0; i--) minHeapify(heap, size, i); } // 提取最小频率的节点 HuffmanNode *extractMin(HuffmanNode heap, int *size) { if (*size <= 0) return NULL; HuffmanNode *minNode = heap[0]; heap[0] = heap[(*size) - 1]; (*size)--; minHeapify(heap, *size, 0); return minNode; } // 插入节点到最小堆 void insertNode(HuffmanNode heap, int *size, HuffmanNode *node) { (*size)++; int i = (*size) - 1; while (i && node->frequency < heap[(i - 1) / 2]->frequency) { heap[i] = heap[(i - 1) / 2]; i = (i - 1) / 2; } heap[i] = node; } // 构建哈夫曼树 HuffmanNode *buildHuffmanTree(char *text) { int frequency[256] = {0}; int length = strlen(text); for (int i = 0; i < length; i++) frequency[(int)text[i]]++; HuffmanNode heap = (HuffmanNode )malloc(length * sizeof(HuffmanNode *)); int size = 0; for (int i = 0; i < 256; i++) { if (frequency[i] > 0) { heap[size++] = createHuffmanNode((char)i, frequency[i]); } } buildMinHeap(heap, size); while (size > 1) { HuffmanNode *left = extractMin(heap, &size); HuffmanNode *right = extractMin(heap, &size); HuffmanNode *merged = createHuffmanNode('\0', left->frequency + right->frequency); merged->left = left; merged->right = right; insertNode(heap, &size, merged); } HuffmanNode *root = extractMin(heap, &size); free(heap); return root; } // 深度优先遍历构建编码表 void buildCodes(HuffmanNode *root, char *prefix, int prefixLength, char codebook) { if (root->left) { prefix[prefixLength] = '0'; buildCodes(root->left, prefix, prefixLength + 1, codebook); } if (root->right) { prefix[prefixLength] = '1'; buildCodes(root->right, prefix, prefixLength + 1, codebook); } if (root->character!= '\0') { prefix[prefixLength] = '\0'; codebook[(int)root->character] = strdup(prefix); } } // 哈夫曼编码函数 void huffmanEncode(char *text) { HuffmanNode *root = buildHuffmanTree(text); char prefix[256] = {0}; char codebook = (char )malloc(256 * sizeof(char *)); buildCodes(root, prefix, 0, codebook); printf("Encoded: "); int length = strlen(text); for (int i = 0; i < length; i++) { printf("%s", codebook[(int)text[i]]); } printf("\n"); printf("Codebook:\n"); for (int i = 0; i < 256; i++) { if (codebook[i]!= NULL) { printf("%c: %s\n", (char)i, codebook[i]); free(codebook[i]); } } free(codebook); } // 测试示例 int main() { char text[] = "AAAABBBCCDAA"; huffmanEncode(text); return 0; }

优缺点

  • 优点:Huffman编码能够显著减少高频字符的编码长度,实现高效压缩。
  • 缺点:构造Huffman树的过程相对复杂,对于频率较为均匀的字符,压缩效果有限。

LZW压缩

字典构建与压缩过程

LZW(Lempel-Ziv-Welch)是一种基于字典的无损压缩算法。它通过动态构建字典,将字符串中的重复模式编码为较短的二进制串,从而实现压缩。

步骤说明

  1. 初始化字典,包含所有可能的单字符模式。
  2. 从输入字符串中读取字符,寻找最长的已存在于字典中的模式。
  3. 将该模式的索引输出,并将新模式(即该模式加下一个字符)加入字典。
  4. 重复步骤2和3,直到字符串处理完毕。

示例说明

假设有字符串 "ABABABABABAB"

  • 初始字典包含所有单字符模式,如 'A': 1, 'B': 2
  • 读取字符 'A',最长匹配为 'A',输出其索引 1,并将 'AB' 加入字典。
  • 读取字符 'B',最长匹配为 'B',输出其索引 2,并将 'BA' 加入字典。
  • 继续匹配,最终压缩输出一系列索引代表原始字符串。

代码示例

python语言:

def lzw_compress(uncompressed): dict_size = 256 dictionary = {chr(i): i for i in range(dict_size)} w = "" compressed_data = [] for c in uncompressed: wc = w + c if wc in dictionary: w = wc else: compressed_data.append(dictionary[w]) dictionary[wc] = dict_size dict_size += 1 w = c if w: compressed_data.append(dictionary[w]) return compressed_data # 示例使用 input_string = "ABABABABABAB" compressed = lzw_compress(input_string) print(compressed) # 输出: [65, 66, 256, 258, 260, 262]

C语言:

#include <stdio.h> #include <stdlib.h> #include <string.h> #define DICT_SIZE 256 typedef struct { char *key; int value; } DictionaryEntry; DictionaryEntry *createDictionaryEntry(char *key, int value) { DictionaryEntry *entry = (DictionaryEntry *)malloc(sizeof(DictionaryEntry)); entry->key = strdup(key); entry->value = value; return entry; } void freeDictionaryEntry(DictionaryEntry *entry) { free(entry->key); free(entry); } int lzwCompress(char *uncompressed) { DictionaryEntry *dictionary[DICT_SIZE]; for (int i = 0; i < DICT_SIZE; i++) { dictionary[i] = createDictionaryEntry((char *)&i, i); } char w[1000] = ""; int compressedData[1000]; int compressedDataIndex = 0; for (int i = 0; uncompressed[i]!= '\0'; i++) { char wc[1000]; strcpy(wc, w); strncat(wc, &uncompressed[i], 1); int found = 0; for (int j = 0; j < DICT_SIZE; j++) { if (strcmp(dictionary[j]->key, wc) == 0) { found = 1; strcpy(w, wc); break; } } if (!found) { for (int j = 0; j < DICT_SIZE; j++) { if (strcmp(dictionary[j]->key, w) == 0) { compressedData[compressedDataIndex++] = dictionary[j]->value; break; } } dictionary[DICT_SIZE] = createDictionaryEntry(wc, DICT_SIZE); DICT_SIZE++; strcpy(w, &uncompressed[i]); } } if (strlen(w) > 0) { for (int j = 0; j < DICT_SIZE; j++) { if (strcmp(dictionary[j]->key, w) == 0) { compressedData[compressedDataIndex++] = dictionary[j]->value; break; } } } for (int i = 0; i < DICT_SIZE; i++) { freeDictionaryEntry(dictionary[i]); } for (int i = 0; i < compressedDataIndex; i++) { printf("%d ", compressedData[i]); } printf("\n"); return compressedDataIndex; } int main() { char inputString[] = "ABABABABABAB"; lzwCompress(inputString); return 0; }

优缺点

  • 优点:LZW压缩在重复模式丰富的场景下能实现很好的压缩效果,且字典动态构建,使其适应性强。
  • 缺点:初始字典大小限制了压缩的灵活性,且当模式变化频繁时,压缩效果不佳。

免责声明:本站所有文章内容,图片,视频等均是来源于用户投稿和互联网及文摘转载整编而成,不代表本站观点,不承担相关法律责任。其著作权各归其原作者或其出版社所有。如发现本站有涉嫌抄袭侵权/违法违规的内容,侵犯到您的权益,请在线联系站长,一经查实,本站将立刻删除。 本文来自网络,若有侵权,请联系删除,如若转载,请注明出处:https://haidsoft.com/112588.html

(0)
上一篇 2026-01-17 12:45
下一篇 2026-01-17 13:10

相关推荐

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注

关注微信