Trie树（前缀树）的实现和应用

码农随笔

已于 2022-03-14 12:12:52 修改

阅读量115

点赞数

分类专栏：数据结构算法文章标签：算法数据结构

于 2020-08-25 18:46:42 首次发布

本文链接：https://blog.csdn.net/ycb1914845451/article/details/108227066

版权

算法同时被 2 个专栏收录

34 篇文章 1 订阅

订阅专栏

数据结构

29 篇文章 1 订阅

订阅专栏

给定500万个单词，如何实现如下两个问题？

1、如何快速判断某个单词是否在给定的单词中？
2、如何快速的判断给定前缀有多少个单词？

一、树的构建

先思考一个问题，现在有两个单词，interest 和 interesting，会怎么存储？

其实如果数据量比较小时，用什么方式存储都可以，但是当数据量比较大或者涉及到快速查找，统计分析，网络传输等需求时，数据的存储方式，数据结构的实现等等就会变得很重要，这直接关乎到后续业务处理获取数据的效率。正如题目开始时说的，假如有500万个单词，该怎么存储的问题。那假如其中的两个单词是 interest 和 interesting，其实我们可以发现单词interest 和 interesting大部分是相同的，那对于相同的部分，我们可否共用同样的存储空间？其实是可以的，我们可以像下面这样进行表示

计算机领域，对于大数据量的存储，数据之间有高度的相似性时，可以把相同的部分抽取出来进行单独存储，这样可以节省一些空间，感觉设计模式中的 享元模式 就差不多是这个道理。

前面对单词interest 和 interesting的存储方案，可以把单词中重复的部分单独抽取出来进行共用（如interest），剩下的部分再进行单独的存储（如ing）

那假如现在又来一个单词 interested，我们可以继续去树中进行查找，查找公共的最长前缀，做相应的处理。处理后如下图所示。我们可以看到三个单词 interest 、 interesting interested 复用了大量的存储空间。

那假如现在又来了个单词 inside，我们可以按照相同的逻辑进行处理。找出最长前缀，进行相应的拆分处理。我们假设 in 不在500万单词中，那找到一个节点时如何确定是已经找到了单词几点，还是前缀，其实我们可以加标记进行区分，图中用不同的形状（圆形）表示非单词几点，矩形表示单词节点。具体实现时可以在节点中增加属性来进行区分。

那假如再来一个单词 apple ,现在apple没有与已存单词共用的前缀，则从根节点开始，另开一个分支，如下图所示。

二、查找（判断单词是否在其中）

假如树现在已经构造好了，假如目前就 5 个单词组成的树，如下所示

（1）查找单词 interested 是否在字典中

从跟开始，逐步遍历即可，遍历路劲如下图

（2）查找单词 internal ，根据如下路劲进行查找，查找找到 in，还剩余 ternal ，发现没有以 t / te / ter /tern /terna / ternal 为前缀的，则查询结束。表示查不到。

三、判断以某个字符串为前缀的单词有多少个

其实这个比较好搞，如果我们简单的只是需要知道有多少个，我们可以在单词进入字典树时就计算好，判断时直接获取就好，但是如果我们需要知道详细的明细，则需要进行整颗树的搜索。如下的需求，想获取以inter开头的并进行展示。

结合上面分析，代码如下：

package com.Ycb.tree;

import java.util.HashMap;

/**
 * 字典树
 */
public class DictionaryTree {
    /**
     * 字典树节点
     */
    private class Node {
        /**
         * 节点所代表的字符
         */
        private String str;

        //节点下单词数量
        private int count;

        //父节点
        private Node parent;

        //节点是否表示单词
        private boolean word;

        //孩子节点
        private HashMap<String, Node> childs;

        public Node() {
            this.childs = new HashMap<>();
        }

        public Node(String str, int count, boolean word) {
            this();
            this.count = count;
            this.word = word;
            this.str = str;
        }

        /**
         * 添加子节点，同时设置子节点的parent
         *
         * @param key
         * @param node
         */
        public void addChild(String key, Node node) {
            this.childs.put(key, node);
            node.parent = this;
        }

        /**
         * 移除子节点
         *
         * @param key
         */
        public void removeChild(String key) {
            //移除子节点
            this.childs.remove(key);
        }

        @Override
        public String toString() {
            return "Node{" + "str='" + str + '\'' + ", count=" + count + ", word=" + word + '}';
        }
    }

    //用一个哨兵节点表示根节点
    private Node root;

    public DictionaryTree() {
        //初始化哨兵跟节点
        root = new Node();
    }

    /**
     * 把word添加到字典树中
     *
     * @param word 待添加的单词
     */
    public void add(String word) {

        //最开始时从根节点开始进行查找插入
        addStr(word, root);
    }

    /**
     * 往node节点中插入word单词
     *
     * @param word 待插入的字串
     * @param node 待插入到的节点
     */
    private void addStr(String word, Node node) {
        node.count++;

        String str = node.str;
        if (str != null) {
            //寻找公共前缀
            String commonPrefix = "";
            for (int i = 0; i < word.length(); i++) {
                if (str.length() > i && word.charAt(i) == str.charAt(i)) {
                    commonPrefix += word.charAt(i);
                } else {
                    break;
                }
            }

            //找到公共前缀
            if (commonPrefix.length() > 0) {
                if (commonPrefix.length() == str.length() && commonPrefix.length() == word.length()) {
                    //与之前的元素重复
                    //需要递归向上使单词数减一
                    decreaseWordCount(node);
                } else if (commonPrefix.length() == str.length() && commonPrefix.length() < word.length()) {
                    //剩余的串
                    String wordLeft = word.substring(commonPrefix.length());

                    //继续去子节点中搜索
                    searchChild(wordLeft, node);
                } else if (commonPrefix.length() < str.length()) {
                    // 节点裂变
                    Node splitNode = new Node(commonPrefix, node.count, true);
                    // 处理裂变节点的父关系
                    splitNode.parent = node.parent;
                    splitNode.parent.addChild(commonPrefix, splitNode);
                    node.parent.removeChild(node.str);
                    node.count--;

                    // 节点裂变后的剩余字串
                    String strLeft = str.substring(commonPrefix.length());
                    node.str = strLeft;
                    splitNode.addChild(strLeft, node);
                    // 单词裂变后的剩余字串
                    if (commonPrefix.length() < word.length()) {
                        splitNode.word = false;
                        String wordLeft = word.substring(commonPrefix.length());
                        Node leftNode = new Node(wordLeft, 1, true);
                        splitNode.addChild(wordLeft, leftNode);
                    }
                }
            } else {
                //没有公共前缀，直接添加节点
                Node newNode = new Node(word, 1, true);
                node.addChild(word, newNode);
            }
        } else {
            //node代表跟节点
            if (node.childs.size() > 0) {
                //如果node已经有孩子节点，需要去孩子节点中继续查找
                searchChild(word, node);
            } else {
                //如果根节点还没有任何孩子，则直接添加
                Node newNode = new Node(word, 1, true);
                node.addChild(word, newNode);
            }
        }
    }

    /**
     * 在子节点中进行查找
     *
     * @param wordLeft
     * @param node
     */
    private void searchChild(String wordLeft, Node node) {
        boolean isFind = false;

        if (node.childs.size() > 0) {
            for (String childKey : node.childs.keySet()) {
                Node childNode = node.childs.get(childKey);
                if (wordLeft.charAt(0) == childNode.str.charAt(0)) {
                    isFind = true;
                    addStr(wordLeft, childNode);
                    break;
                }
            }
        }
        //如果没有搜索到
        if (isFind == false) {
            Node newNode = new Node(wordLeft, 1, true);
            node.addChild(wordLeft, newNode);
        }
    }

    /**
     * 查找单词是否在字典树中
     *
     * @param word 待查找的单词
     * @return
     */
    public boolean find(String word) {
        return findStr(word, root);
    }

    private boolean findStr(String word, Node node) {
        boolean isMatch = true;
        String wordLeft = word;
        String str = node.str;
        if (null != str) {
            // 字串与单词不匹配
            if (word.indexOf(str) != 0) {
                isMatch = false;
            } else {
                // 匹配，则计算剩余单词
                wordLeft = word.substring(str.length());
            }
        }
        // 如果匹配
        if (isMatch) {
            // 如果还有剩余单词长度
            if (wordLeft.length() > 0) {
                // 遍历孩子继续找
                for (String key : node.childs.keySet()) {
                    Node childNode = node.childs.get(key);
                    boolean isChildFind = findStr(wordLeft, childNode);
                    if (isChildFind) {
                        return true;
                    }
                }
                return false;
            } else {
                // 没有剩余单词长度，说明已经匹配完毕，直接返回节点是否为单词
                return node.word;
            }
        }
        return false;
    }

    // 统计子节点字串单词数
    private int countChildStr(String prefix, Node node) {
        // 遍历孩子
        for (String key : node.childs.keySet()) {
            Node childNode = node.childs.get(key);
            // 匹配子节点
            int childCount = countStr(prefix, childNode);
            if (childCount != 0) {
                return childCount;
            }
        }
        return 0;
    }

    // 统计字串单词数
    private int countStr(String prefix, Node node) {
        String str = node.str;
        // 非根结点
        if (null != str) {
            // 前缀与字串不匹配
            if (prefix.indexOf(str) != 0 && str.indexOf(prefix) != 0) {
                return 0;
                // 前缀匹配字串，且前缀较短
            } else if (str.indexOf(prefix) == 0) {
                // 找到目标节点，返回单词数
                return node.count;
                // 前缀匹配字串，且字串较短
            } else if (prefix.indexOf(str) == 0) {
                // 剩余字串继续匹配子节点
                String prefixLeft = prefix.substring(str.length());
                if (prefixLeft.length() > 0) {
                    return countChildStr(prefixLeft, node);
                }
            }
        } else {
            // 根结点，直接找其子孙
            return countChildStr(prefix, node);
        }
        return 0;
    }

    // 统计前缀单词数
    public int count(String prefix) {
        // 处理特殊情况
        if (null == prefix || prefix.trim().isEmpty()) {
            return root.count;
        }
        // 从根结点往下匹配
        return countStr(prefix, root);
    }

    // 打印节点
    private void printNode(Node node, int layer) {
        // 层级递进
        for (int i = 0; i < layer; i++) {
            System.out.print("    ");
        }
        // 打印
        System.out.println(node);
        // 递归打印子节点
        for (String str : node.childs.keySet()) {
            Node child = node.childs.get(str);
            printNode(child, layer + 1);
        }
    }

    // 打印字典树
    public void print() {
        printNode(root, 0);
    }

    /**
     * 使单词数减1
     *
     * @param node
     */
    private void decreaseWordCount(Node node) {
        while (node != null) {
            node.count--;
            node = node.parent;
        }
    }
}

码农随笔

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Trie树（前缀树）的实现和应用

有500万单词，有如下需求1、如何快速判断某个单词是否在给定的单词中2、如何快速的判断给定前缀有多少个单词package com.Ycb.tree;import java.util.HashMap;/** * 字典树 */public class DictionaryTree { /** * 字典树节点 */ private class Node { /** * 节点所代表的字符 */
复制链接

扫一扫

专栏目录