【数据结构】初入数据结构的字典树 ( Trie Tree ) 及实现

SnailMann

于 2021-04-14 13:07:17 发布

阅读量515

点赞数 1

分类专栏：数据结构文章标签：数据结构字典树前缀树敏感词过滤

本文链接：https://blog.csdn.net/snailmann/article/details/115493906

版权

数据结构专栏收录该内容

15 篇文章 7 订阅

订阅专栏

初入数据结构的前缀/字典树 ( Trie Tree ) 及实现

如果觉得对你有帮助，能否点个赞或关个注，以示鼓励笔者呢？！博客目录 | 先点这里

前提概念
- 什么是字典树？
- 字典树的优缺点？
实现
- 约束和需求
- 代码实现

前提概念

什么是字典树？

字典树 (Trie Tree)，又称前缀树，单词查找树。典型应用是用于统计，排序和保持大量字符串，所以经常被搜索引擎系统用于文本词频统计。它的核心优势就在于利用字符串的公共前缀来减少查询时间，最大限度的减少无谓的的字符串比较
字典树 - @百度百科

特征

字典树是一颗多路树
根结点不包含字符，除根结点外的所有结点都包含一个字符

应用场景

K/V 存储
敏感词过滤
- 采用字典树进行敏感词过滤，相比哈希表可以节省空间
词频统计
- 同样相比哈希表，更加的节省空间
字典序排序
- 字典树前序遍历
前缀匹配
- 搜索引擎，搜索提示

字典树的优缺点

从字典树的介绍，我们知道了什么是字典树。那么字典树相比其他数据结构，有什么好处和作用？

字典树说白了就是一种以空间换时间的特定场合数据结构，在特定的场合下，可以有比较良好的查询优势

敏感词过滤 (n 个敏感词)

线性表查找时间复杂度 O(n)
二叉树查找时间 O(logn)
哈希表查找时间 O(1)
字典树查找时间复杂度 O(h), h 为单词长度
通常情况下 h < n , 所以字典树的查询效率肯定是远与线性表和二叉树的。但略低于哈希表。当然也不一定，因为哈希表存在哈希冲突，在哈希冲突大的情况下，哈希表的查询效率不一定比字典树要高 (当冲突时的链表长度或树高度大于单词长度)

联想词

线性表需要扫全表 O(n)
二叉树需要扫全树 O(logn)
哈希表不太好实现
字典树查找时间最坏情况 O((h + x * y) , h 是单词长度， x 是关联词延伸多少个字符， y 是有多少种字符
假设100w 的单词数据，单词平均长度 h = 10, 单词只有 26 个字符，只联系延伸 3 个字符, 字典树只需要最坏查询 88 次即可，相比线性表的 10w 次和 log10w 次，简直是性能大大提升，所以联想词其实属于字典树的优势场景

不适合的场景

如果输入的数据，长度比较长，比如有千个，万个单位长度，其实就不适用使用字典树去存储，这会导致树深过大

实现

约束与需求

约束
简单点实现一个字典树，所以先放上约束

该字典树仅存储 [a-z] 26 个字符的数据
单个单词长度 1 < len <= 100

需求

词频统计 count()
是否存在 contains()
是否存在以 xxx 开头的词 startWith()
提供相关词 search()

代码实现

字典树结点定义

    public static class TrieNode {
    
        private boolean end;
        private char data;
        private int count;
        private TrieNode[] next = new TrieNode[26];

        public TrieNode(char data) {
            this.data = data;
        }
    }

end 代表根结点到当前结点，是否构成一个单词
data 代表该结点存储的字符数据 [a-z]
count 如果该结点 end = true, 那么 count 就有意义了，代表该单词在该字典树出现的次数
next 代表 26 个子结点的指针
- 为什么是 26 个子结点，因为一个结点的后继字符，有 26 种可能；又因为我们基于数组 (静态定长)，所以只能一开始就分配 26 长度子结点指针数组
- 如果不基于数组实现，其实也可以引入动态数组，List，甚至 HashMap 或是 TreeMap

字典树定义


public class TrieTree {

    /**
     * root node，meaningless dummy node
     */
    private TrieNode root;

    public TrieTree() {
        root = new TrieNode('*');
    }
    
}

字典树的定义非常简单

一个的根结点，不存储任何的数据
- 可以理解为一个没有数据的哑结点
- 作用就仅仅是让我们找到这颗树的头部
一个构造函数，用于创建一棵空的字典树，然后就可以为所欲为了

插入单词

    public void insert(String word) {
        TrieNode node = root;
        for (int i = 0; i < word.length(); i++) {
            int index = word.charAt(i) - 'a';
            if (node.next[index] == null) {
                node.next[index] = new TrieNode(word.charAt(i));
            }
            node = node.next[index];
        }
        node.end = true;
        ++node.count;
    }

插入单词的逻辑也很简单，基本思路就是逐个字符寻址字典树的结点路径

循环 word 的字符长度次数，因为如果字典树存在该单词，那么该单词至少有 word.length() + 1 个结点构成（1 是根结点）
然后通过 word.charAt(i) - 'a' 得到该 word.chatAt(i) 字符在当前结点的 26 个子结点的数组索引
- 因为 [a-z] 26 个字符中，是有字典序的，a - a = 0, b - a = 1, 那么 a 字符就是 next[0] 结点，b 字符就是 next[1] 结点
- 如果我们使用动态数组，就不需要通过这样的方式来获取索引
在迭代的过程中，判断是否有 node.next[index] == null
- 如果为真，则代表字典树中，存在 word 的部分字符，但不存在该单词，则需要继续构造剩余字符的结点
- 如果不为真，则代表字典树中，存在 word 的当前字符，需要继续迭代判断
循环结束后，则代表 word 本身就存在字典中，或是本来不存在，但已经构造完成。则 end = true, count++

是否包含

    public boolean contains(String word) {
        TrieNode node = root;
        for (int i = 0; i < word.length(); i++) {
            int index = word.charAt(i) - 'a';
            if (node.next[index] == null) {
                return false;
            }
            node = node.next[index];
        }
        return node.end;
    }

是否包含目的是就是检查 word 是否已经在字典树中存在了，那我们就只需要判断字典树是否有构成 word 的结点路径，并且最终字符结点的 end 是否为真

遍历 word.length() 长度次数
判断是否不存在子字符的结点
- 如果为真，则代表字典树只存在部分前缀，完整的 word 没有在字典树中，直接返回 false
- 如果为假，则继续迭代
迭代完毕后，则代表字典树存在 word 的子字符结点路径，但是不一定含有该单词; 所以最后要判断最终字符结点的 end 是否为真
- 如果为真，则存在 word 单词
- 如果为假，则不存在。就像字典树中有 helloworld, 但是没有 hello

是否有以 xxx 为前缀的单词

    public boolean startsWith(String prefix) {
        TrieNode node = root;
        for (int i = 0; i < prefix.length(); i++) {
            int index = prefix.charAt(i) - 'a';
            if (node.next[index] == null) {
                return false;
            }
            node = node.next[index];

        }
        return true;
    }

startsWith 其实跟 contains 的逻辑是一样的，仅仅是遍历完成后，不需要判断 end 是否为真

词频统计

    public int count(String word) {
        TrieNode node = tailNode(word);
        if (node == null) {
            return 0;
        }

        return node.count;
    }

因为每个单词的最终字符结点，都有 end 字段标记是否构成单词。如果 end 为真的情况下，又可以根据 count 的大小来判断该单词出现的次数。所以我们只需要遍历到该单词的最终字符结点，获取 count 即可

当前我们在遍历的过程中，也要判断字典树是否存在该单词，即获取 count 的过程就是糅杂了 contains() 方法和获取 count 的功能。如果不存在该单词，我们就直接返回 0 ，代表没有出现过，如果存在则返回结点的 count 值

其实词频统计也可以用于判断该单词是否存在，毕竟词频为 0 则代表未出现过，即不存在

自动联想词

    public List<String> search(String prefix) {
        TrieNode node = tailNode(prefix);
        if (node == null) {
            return Collections.emptyList();
        }

        List<String> words = new ArrayList<>();
        for (TrieNode next : node.next) {
            if (next != null) {
                String word = prefix + next.data;
                if (next.end) {
                    words.add(word);
                }
                List<String> relates = search(word);
                if (relates != null && relates.size() != 0) {
                    words.addAll(relates);
                }
            }
        }
        return words;
    }

自动联想词的方法，相对复杂，但是也不难。因为我们是简化版的字典树，所以我们不打算写的太复杂，只要将所有后续联系单词都找出来即可。在很多的实际应用场景，我们可以只会向后联想延伸 0~3 个字符长度或是前 x 个联想词，而不是将所有联想单词都找到

同样，我们先得找到 prefix 前缀的最终字符结点
- 如果不存在，则代表以 prefix 为单词的结点都没有，更别说联想词了
- 如果存在，则继续
以 prefix 的最终字符结点为相对根结点，构成一颗新的字典子树，然后以 26 路递归查找所有子树中可构成的单词
- 汇总到一个 words 集合中

然后就可以得到我们想过的结果了。如果你有一个实际的需要，只想要向后延伸 3 个字符的联系词。可以，其实也不难，说白了就是控制子树递归的深度，在递归的过程中，传入一个递归深度限制，每递归一层就减 1，当为 0 时，则代表触达限制，直接返回。

完整代码

/**
 * 前缀树/字典树
 *
 * @author snailmann
 */
public class TrieTree {

    /**
     * root node，meaningless dummy node
     */
    private TrieNode root;

    public TrieTree() {
        root = new TrieNode('*');
    }

    /**
     * 插入一个单词 [a-z]
     *
     * @param word 单词
     */
    public void insert(String word) {
        TrieNode node = root;
        for (int i = 0; i < word.length(); i++) {
            int index = word.charAt(i) - 'a';
            if (node.next[index] == null) {
                node.next[index] = new TrieNode(word.charAt(i));
            }
            node = node.next[index];
        }
        node.end = true;
        ++node.count;
    }

    /**
     * 是否包含该单词
     *
     * @param word 单词
     * @return 真假
     */
    public boolean contains(String word) {
        TrieNode node = root;
        for (int i = 0; i < word.length(); i++) {
            int index = word.charAt(i) - 'a';
            if (node.next[index] == null) {
                return false;
            }
            node = node.next[index];
        }
        return node.end;
    }

    /**
     * 是否有以 prefix 为前缀的单词
     *
     * @param prefix 前缀
     * @return 真假
     */
    public boolean startsWith(String prefix) {
        TrieNode node = root;
        for (int i = 0; i < prefix.length(); i++) {
            int index = prefix.charAt(i) - 'a';
            if (node.next[index] == null) {
                return false;
            }
            node = node.next[index];

        }
        return true;
    }

    /**
     * 词频统计
     *
     * @param word 单词
     * @return 单词出现的次数
     */
    public int count(String word) {
        TrieNode node = tailNode(word);
        if (node == null) {
            return 0;
        }

        return node.count;
    }

    /**
     * 单词联想，通过多路递归实现
     *
     * @param prefix 前缀
     * @return 联想词集合
     */
    public List<String> search(String prefix) {
        TrieNode node = tailNode(prefix);
        if (node == null) {
            return Collections.emptyList();
        }

        List<String> words = new ArrayList<>();
        for (TrieNode next : node.next) {
            if (next != null) {
                String word = prefix + next.data;
                if (next.end) {
                    words.add(word);
                }
                List<String> relates = search(word);
                if (relates != null && relates.size() != 0) {
                    words.addAll(relates);
                }
            }
        }

        return words;
    }

    /**
     * 获取单词的尾部节点
     *
     * @param word 单词
     * @return 如果不存在，则返回 null
     */
    private TrieNode tailNode(String word) {
        TrieNode node = root;
        for (int i = 0; i < word.length(); i++) {
            int index = word.charAt(i) - 'a';
            if (node.next[index] == null) {
                return null;
            }
            node = node.next[index];
        }
        return node;
    }

    @Override
    public String toString() {
        String prefix = "";
        TrieNode node = root;
        if (node == null) {
            return "";
        }
        return search(prefix).toString();
    }

    public static class TrieNode {

        /**
         * 是否构成单词
         */
        private boolean end;

        /**
         * 字符数据
         */
        private char data;

        /**
         * 单词次数
         */
        private int count;

        /**
         * 26 棵子树 [a-z]
         */
        private TrieNode[] next = new TrieNode[26];

        public TrieNode(char data) {
            this.data = data;
        }

        @Override
        public String toString() {
            return "TrieNode{" +
                    "end=" + end +
                    ", data=" + data +
                    ", num=" + count +
                    ", next=" + Arrays.toString(next) +
                    '}';
        }
    }


    public static void main(String[] args) {
        TrieTree trieTree = new TrieTree();
        trieTree.insert("helloworld");
        trieTree.insert("helloworlde");
        trieTree.insert("helloworldas");
        trieTree.insert("hellocool");
        trieTree.insert("sdf");
        trieTree.insert("ab");
        trieTree.insert("abc");
        trieTree.insert("abd");

        System.out.println(trieTree.contains("helloworld"));
        System.out.println(trieTree.startsWith("hello"));

        System.out.println(trieTree.count("helloworld"));
        System.out.println(trieTree.count("hello"));
        System.out.println(trieTree);
    }