字典树原理分析及实现(支持中文插入)

最新推荐文章于 2023-09-02 22:04:41 发布

Ragty_

最新推荐文章于 2023-09-02 22:04:41 发布

阅读量1.4k

点赞数

分类专栏： NLP 文章标签：算法字典树字典树原理

本文链接：https://blog.csdn.net/huoji555/article/details/104488665

版权

NLP 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

1.背景

匹配算法的瓶颈之一在于如何判断字典中是否含有字符串，如果用的是有序集合(TreeMap)的话，复杂度是O(logn)，如果用散列表(HashMap)，账面上的时间复杂度虽然下降了，但内存复杂度上去了。我们要寻找一种速度又快，又省内存的数据结构。

2.字典树概念：

又称单词查找树，Trie树，是一种树形结构，是一种哈希树的变种。典型应用是用于统计，排序和保存大量的字符串（但不仅限于字符串），所以经常被搜索引擎系统用于文本词频统计。（看图马上理解）

3.字典树特点：

根节点不包含字符，除根节点外每一个节点都只包含一个字符
从根节点到某一节点，路径上经过的字符连接起来，为该节点对应的字符串
每个节点的所有子节点包含的字符都不相同

4.字典树的实现原理：

从确定有限状态自动机(DFA)的角度来讲，每个节点都是一个状态，状态表示当前已经查询到的前缀。从父节点到子节点的移动过程可以看作一次状态转移。以下是查询步骤：

我们输入一个想要查询的词，如果有满足条件的边，状态转移；如果找不到，直接失败
完成了全部转移时，拿到了最后一个字符的状态，询问该状态是否为终点状态，如果是则查到了单词，否则该单词不在字典中

"删改改查"都是一回事，以下不再赘述

5.字典树节点结构：

这里我们用HashMap实现

/**
* @Author Ragty
* @Description  字典树节点
* @Date   2020/2/27 0:00
*/
class TrieNode {
    public int path;        //表示多少个词共用该前缀
    public boolean status;
    public HashMap<Character, TrieNode> map;

    public TrieNode() {
        path = 0;
        status = false;
        map = new HashMap<>();
    }
}

6.字典树的实现：

private TrieNode root;


/**
 *  @author: Ragty
 *  @Date: 2020/2/27 0:01
 *  @Description: 初始化
 */
public TrieTree1() {
    root = new TrieNode();
}



/**
 *  @author: Ragty
 *  @Date: 2020/2/27 0:02
 *  @Description: 插入节点
 */
public void insert(String word) {
    if (word == null || word.isEmpty()) {
        return;
    }
    TrieNode node = root;
    node.path++;
    char[] words = word.toCharArray();
    for (int i = 0; i < words.length; i++) {
        if (node.map.get(words[i]) == null) {
            node.map.put(words[i], new TrieNode());
        }
        node = node.map.get(words[i]);
        node.path++;
    }
    node.status = true;
}


/**
 *  @author: Ragty
 *  @Date: 2020/2/27 0:02
 *  @Description: 寻找节点
 */
public boolean search(String word) {
    if (word == null)
        return false;
    TrieNode node = root;
    char[] words = word.toCharArray();
    for (int i = 0; i < words.length; i++) {
        if (node.map.get(words[i]) == null)
            return false;
        node = node.map.get(words[i]);
    }
    return node.status;
}


/**
 *  @author: Ragty
 *  @Date: 2020/2/27 0:06
 *  @Description: 删除节点
 */
public void delete(String word) {
    if (search(word)) {
        char[] words = word.toCharArray();
        TrieNode node = root;
        node.path--;
        for (int i = 0; i < words.length; i++) {
            if (--node.map.get(words[i]).path == 0) {
                node.map.remove(words[i]);
                return;
            }
            node = node.map.get(words[i]);
        }
    }
}




/**
 *  @author: Ragty
 *  @Date: 2020/2/27 0:07
 *  @Description: 前缀遍历，若有前缀，返回它最后一个节点的path
 */
public int prefixNumber(String pre) {
    if (pre == null)
        return 0;
    TrieNode node = root;
    char[] pres = pre.toCharArray();
    for (int i = 0; i < pres.length; i++) {
        if (node.map.get(pres[i]) == null)
            return 0;
        node = node.map.get(pres[i]);
    }
    return node.path;
}


/**
 *  @author: Ragty
 *  @Date: 2020/2/27 0:50
 *  @Description: 前序遍历
 */
public void preWalk(TrieNode root) {
    TrieNode node = root;
    for (Map.Entry<Character,TrieNode> map : root.map.entrySet()) {
        node = map.getValue();
        if (node != null) {
            System.out.println(map.getKey());
            preWalk(node);
        }
    }
}


public TrieNode getRoot() {
    return root;
}

7.测试：

public static void main(String[] args) {
    TrieTree1 trieTree = new TrieTree1();

    trieTree.insert("字典树");
    trieTree.insert("字典书");
    trieTree.insert("字典");
    trieTree.insert("天气");
    trieTree.insert("气人");

    System.out.println(trieTree.search("字典"));
    System.out.println(trieTree.search("字"));
    System.out.println(trieTree.prefixNumber("字典树"));

    TrieNode root = trieTree.getRoot();

    trieTree.preWalk(root);
}

8.测试结果：

true
false
气--人--字--典--树--书--天--气

9.算法分析：

当字典大小为n时，虽然最坏情怀下字典树的复杂度依然是O(logn)。但它的实际速度比二分查找快，这是因为随着路径的深入，前缀匹配是递进的过程，算法不必比较字符串的前缀，因此可以节省很多用来比较的时间。

10.算法改进：

这里我们查询某个词的时候还需要逐个对比，若我们将对象转换为散列值，散列函数输出区间为[0,65535]之间的整数，这时候我们直接访问下标就可以访问到对应的字符，不过这种做法只适用于第一行，否则会内存指数膨胀，后边的按数组存放即可，查询时直接二分法查询。

Ragty_

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
字典树原理分析及实现(支持中文插入)

1.背景匹配算法的瓶颈之一在于如何判断字典中是否含有字符串，如果用的是有序集合(TreeMap)的话，复杂度是O(logn)，如果用散列表(HashMap)，账面上的时间复杂度虽然下降了，但内存复杂度上去了。我们要寻找一种速度又快，又省内存的数据结构。2.字典树概念：又称单词查找树，Trie树，是一种树形结构，是一种哈希树的变种。典型应用是用于统计，排序和保存大量的字符串（但不仅限于字符...
复制链接

扫一扫

专栏目录