【数据结构】树（一）：字典搜索树&并查集&哈夫曼编码（C++实现）

最新推荐文章于 2023-02-07 21:33:32 发布

热爱改名阿呆呆

最新推荐文章于 2023-02-07 21:33:32 发布

阅读量1.2k

点赞数

分类专栏：数据结构课程文章标签：数据结构

本文链接：https://blog.csdn.net/Jingle_cjy/article/details/70178727

版权

课程同时被 2 个专栏收录

14 篇文章 0 订阅

订阅专栏

数据结构

4 篇文章 0 订阅

订阅专栏

这篇博客详细介绍了树的基本概念，包括根、子节点、叶节点等，并讨论了树的表示方法。重点讲解了字典搜索树（Trie树）、并查集和哈夫曼编码，提供了C++实现。Trie树用于高效处理字符串公共前缀问题，而并查集用于不相交集合的合并和查询，哈夫曼编码则是一种依据字符概率构建的前缀编码方式，用于压缩数据。

摘要由CSDN通过智能技术生成

大三狗发现知识遗忘率实在是太高了，决心今天开始好好复习，基本上是把这里当复习笔记用了，毕竟看英语教材超苦=^=给自己找点动力，欢迎大家来一起学习呀~

树

general tree

一. 基本概念

树（tree）：由一个点集（vertices）及边的集合（edges/branches）构成，这个结构符合两个条件(1)对于任意一个节点都存在一个边的序列（路径，path）使得该节点与其他节点相互连通；(2) 结构中不存在环路（circuits），即不存在一条路径使得一个节点能够回到起点。
- 根（root）：没有父节点的节点，如节点A。
- 子节点（child node）&父节点（parent node）：节点的子树的根为子节点，如节点A的子节点为B、C、D。若一个结点含有子节点，则这个节点称为其子节点的父节点，如节点A为节点B、C、D的父节点。
- 分支节点：度不为0的节点。
- 叶节点（leaf node）：没有子节点的节点。如G、H、F、D。
- 兄弟节点（sibling node）：具有相同父节点的节点互为兄弟节点。如节点G、H互为兄弟节点。
- 祖先（ancestor）：从根节点到该节点所经过的路径上所有节点。
- 后裔（descendant）：以该节点为根的子树中的所有节点均为后裔。
- 子树（subtree）：设T是有根树，a是T中的一个顶点，由a以及a的所有后裔（后代）导出的子图称为有向树T的子树，a是子树的根。
- 森林（forest）：一个树的无序集合，通常假设森林中的树都是有根树。
- 果园（orchard）：也可称为有序森林（ordered forest），空集或者有序树的有序集合。
- 路径（path）：从节点N1到节点Nk的路径是一个节点序列N1, n2, …, Nk（1 ≤ i < k），其中Ni为Ni+1的父节点。路径的长度为路径中边的数量，为k-1。
- 深度（depth）：从根节点到该节点存在唯一路径，该路径的长度为该节点的深度。根节点的深度为0。
- 高度（height）：从该节点到叶节点的最长路径长度为节点高度。叶节点的高度为0，树的高度为根节点的高度。
- 度（degree）：节点的子节点个数，树的度为最大的节点的度。
- 层次（level）：从根开始定义起，根为第0层，根的子节点为第1层，以此类推。
- 满二叉树（full binary tree）：除叶节点外，所有节点都有两个1子节点。
- 完全二叉树（complete binary tree）：若设二叉树的深度为h，除第 h 层外，其它各层 (1～h-1) 的结点数都达到最大个数，第 h 层所有的结点都连续集中在最左边。
- 理想二叉树（perfect binary tree）：除最后一层无任何子节点外，每一层上的所有结点都有两个子结点的树称为理想二叉树。高度为h（从0开始算起）且包含2^(h+1)-1个节点的二叉树。
- 最优二叉树（哈夫曼树）：给定n个权值作为n个叶子结点，构造一棵二叉树，若带权路径长度达到最小，称这样的二叉树为最优二叉树，也称为哈夫曼树(Huffman Tree)。

二. 树的表示方法

双亲表示法（parent method）

/* 树的双亲表示法结点结构定义 */
#define MAX_TREE_SIZE 100
typedef char TElemType; /* 树结点的数据类型 */

struct PTNode /* 结点结构 */
{
    TElemType data; /* 结点数据 */
    int parent;     /* 双亲位置 */
};

struct PTree        /* 树结构 */
{
    PTNode nodes[MAX_TREE_SIZE]; /* 结点数组 */
    int r, n;       /* 根节点的位置和结点数 */
};

这种表示方法使得节点的父节点十分容易得到，但是节点的子节点难以获取（需要遍历整个表）。
2. 多重链表表示法（Multiple links）
(1) 每个节点都包含d个指针，d是树中节点的最大度数（degree）。

(2) 另一种表示方法：用一个数字d声明节点的度数，指针域包含d个指针。

3. 孩子链表表示法（child-link）

/* 孩子链表表示法的结构定义 */
#define MAX_TREE_SIZE 100
typedef char TElemType; /* 树结点的数据类型 */
typedef struct CTNode   /* 孩子结点结构 */
{
    int child;
    CTNode *next;   
} *ChildPtr;        
struct CTBox        /* 表头结构 */
{
    TElemType data; 
    ChildPtr firstchild;    
};      
struct CTree        /* 树结构 */
{
    CTBox nodes[MAX_TREE_SIZE]; /* 结点数组 */
    int r, n;   /* 根节点的位置和结点数 */
};

4 孩子兄弟表示法（First child next sibling）

/* 树的节点定义 */
struct TreeNode 
{
    TElemType data;
    TreeNode *firstChild;
    TreeNode *nextSibling;
};

5 森林遍历（Forest Traverse）
对森林前序遍历需要先将其转换为对应二叉树。

三. 一些简单应用

1 字典搜索树（Lexicographic Search Trees）：trie(retrieval的截取，字典树)[1]

除根节点之外的所有节点都存储一个字符，从根节点到某一个节点A，路径上经过的所有字符构成节点A对应的字符串。具有同一父节点的节点存储的字符不同。主要用于处理字符串公共前缀相关的问题。
优点是利用字符串的公共前缀来减少查询时间，最大限度地减少无谓的字符串比较，查询效率比哈希树高。缺点是如果系统中存在大量字符串且这些字符串基本没有公共前缀，则相应的trie树将非常消耗内存。

#include<iostream>
using namespace std;
#define num_chars 26
struct TrieNode{
    int count;
    TrieNode *branches[num_chars];
};

class Trie{
    public:
        // constructor
        Trie(){
            root = new TrieNode;
            for(int i=0; i<num_chars; i++){
                root->branches[i] = NULL;
                root->count = 0;
            }
        }
        // destructor
        ~Trie(){
            deleteNode(root);
            root = NULL;
        }

        // create or insert new string
        void insert(const string str){
            int len = str.length();
            if(len<=0) return;

            TrieNode *recNode = root;
            for(int i=0; i<len; i++){
                if(recNode->branches[str[i]-'a']==NULL){
                    TrieNode *tmp = new TrieNode;
                    for(int j=0; j<num_chars; j++)
                        tmp->branches[j] = NULL;
                    tmp->count = 0;
                    recNode->branches[str[i]-'a'] = tmp;
                    recNode = tmp;
                }
                else{
                    recNode = recNode->branches[str[i]-'a'];
                }
            }
            recNode->count++;
        } 

        // Check whether a string exists in the trie
        bool search(const string str){
            int len = str.length();
            if(len<=0) return true;
            TrieNode *recNode = root;
            for(int i=0; i<len; i++){
                if(recNode->branches[str[i]-'a']==NULL)
                    return false;
                recNode = recNode->branches[str[i]-'a'];
            }
            if(recNode->count>0) return true;

            return false;
        }

    protected:
        void deleteNode(TrieNode *node){
            for(int i=0; i<num_chars; i++){
                if(node->branches[i]!=NULL){
                    deleteNode(node->branches[i]);
                }
            }
            delete node;
        }

    private:
        TrieNode *root;
};

void Test(){
    Trie trie;
    trie.insert("hello");
    trie.insert("he");
    trie.insert("her");
    trie.insert("world");
    trie.insert("word");
    if(trie.search("hello")) cout << "YES" << endl;
    else cout << "NO" << endl; 
    if(trie.search("hel")) cout << "YES" << endl;
    else cout << "NO" << endl;
    if(trie.search("helloooo")) cout << "YES" << endl;
    else cout << "NO" << endl;
}

int main(){
    Test(); 
    return 0;
}

2 并查集（Disjoint-set Forest）

并查集用树型的数据结构表示不相交集合，集合中的每个节点都存储其父亲节点的引用（用双亲表示法表示）。主要用于处理不相交集合的合并及查询。
优化：每次查找的时候，如果路径较长，则修改信息，以便下次查找的时候速度更快（修改查找路径上的所有节点，将它们都指向根结点）。

#include <iostream>
using namespace std;
#define MAX 50005
int father[MAX];

// Find the father of node x
int findFather(int x){
    if(father[x] != x) 
        return father[x] = findFather(father[x]); // path compression
    else return x;
}

// a and b are in the same set. The trees 
// that a and b belong to should be combined
int combineTree(int a, int b){
    father[findFather(a)] = findFather(b);
}

int main(){
    int numNode, numEdge, numQuery, tmp1, tmp2;
    cin >> numNode >> numEdge >> numQuery;

    // Initialize the father of each node as itself
    for(int i=1; i<=numNode; i++){
        father[i]=i;
    } 
    // if two trees are connected, combine them
    for(int i=1; i<=numEdge; i++){
        cin >> tmp1 >> tmp2;
        combineTree(tmp1, tmp2);
    }
    // check if two nodes are in one set
    for(int i=1; i<=numQuery; i++){
        cin >> tmp1 >> tmp2;
        if(findFather(tmp1)==findFather(tmp2)) cout << "YES" << endl;
        else cout << "NO" << endl;
    }
}

3 哈夫曼编码（Huffman Coding）

哈夫曼编码依据字符出现概率来构造异字头的平均长度最短的码字（用0/1编码）。
前缀码（prefix code）：任何一个字符的编码都不能是另一个字符编码的前缀。
通常使用满二叉树进行哈夫曼编码，得到的二叉树称为哈夫曼树（Huffman tree）。简单示例如图所示：

每一个叶节点都代表一个值的编码。 $f(c)$ 表示字符c的出现频数， $d_T(c)$ 是代表字符c编码的叶节点，在哈夫曼树中的深度，通过以下函数可以计算出编码一个文件所需要的比特数。即为哈夫曼树的费用（cost）。

B (T) = \sum c \in C f (c) d T (c)

$B(T)=\sum_{c\in C}{f(c)d_T(c)}$
哈夫曼算法（Huffman Algorithm）：贪心算法，步骤如下
1. 创建一个森林包含s个节点，每个节点代表一个字符，节点间互相独立，每个节点有一个对应的数值，为该字符的频数。这些频数被放进优先队列中。
2. 接着重复以下步骤s-1次：
(1) 移除优先队列中值最小的两个节点L和R，创建一个节点作为L和R的父节点。
(2) 计算出创建的节点的数值为L和R的频数之和，并将该数值插入优先队列中。

#include <iostream>
#include <vector>
#include <queue>
using namespace std;

struct Node{
    int freq;
    Node* left;
    Node* right;
};

struct cmpNode{
    bool operator()(const Node* a, const Node* b){
        return a->freq >= b->freq;
    }
}; 

Node* mergeTree(Node* &small1, Node* &small2){
    Node* newNode = new Node();
    newNode->freq = small1->freq + small2->freq;
    newNode->left = small1;
    newNode->right = small2;
    return newNode;
} 

void level_traversal(Node* node){
    Node* curNode = node;
    queue<Node*> q;
    if(curNode != NULL) q.push(curNode);
    while(!q.empty()){
        curNode = q.front();
        q.pop();
        cout << curNode->freq << " ";
        if(curNode->left != NULL) q.push(curNode->left);
        if(curNode->right != NULL) q.push(curNode->right);
    }
}

int main(){
    int n, freq; 
    Node *less1, *less2, *root;
    cin >> n;
    // Construct MinHeap
    priority_queue<Node*, vector<Node*>, cmpNode> Q;
    for(int i=0; i<n; i++){
        cin >> freq;
        Node* newNode = new Node();
        newNode->freq = freq;
        newNode->left = NULL;
        newNode->right = NULL;
        Q.push(newNode); // Put the value into heap
    }
    while(Q.size() > 1){
        less1 = Q.top();
        Q.pop();
        less2 = Q.top();
        Q.pop();
        root = mergeTree(less1, less2);
        Q.push(root);
    }
    level_traversal(root);
    cout << "END" << endl;

    return 0;
}