几种Trie树性能比较

最新推荐文章于 2024-07-06 11:24:09 发布

Aiphis7

最新推荐文章于 2024-07-06 11:24:09 发布

阅读量1.9k

点赞数

分类专栏：数据结构与算法

数据结构与算法专栏收录该内容

90 篇文章 0 订阅

订阅专栏

最近正在做一个自己的NLP库，刚起步的第一个问题就是字典的储存与查询。毫无疑问，最佳的数据结构是Trie树，同时为了平衡效率和空间，决定使用双数组Trie树。现在的问题是，双数组Trie树是一个压缩的Trie树，在插入的时候需要递归调整base与check中的冲突，无异于重新压缩一次。所以在动态的词典场景下，需要一个支持快速插入、同时查询速度还行的结构。我调查了很多开源实现，发现“首字直接分配内存，之后用动态数组二分”与“Radix and Crit Bit Tree”两种比较有实用价值。在此对两种方法做个比较，同时与标准的DoubleArrayTrie树一起比较一下速度。

二分法采用的实现是ansj的tree_split，Radix and Crit Bit Tree采用的是patricia-trie，标准的DoubleArrayTrie树采用了Jada和darts（两个库都是日本人写的），另外digitalstain的Datrie简直太糟糕了在测试中指针越界，懒得看那坨shit一样的代码。

我用一个真实的词典做为输入，连续查找100000000次一个词来比较查询速度

测试代码如下：


  
  package com.hankcs;
 
import darts.DoubleArrayTrie;
import love.cq.domain.Forest;
import love.cq.library.Library;
import love.cq.splitWord.GetWord;
import net.reduls.jada.Trie;
import net.reduls.jada.TrieBuilder;
import org.ardverk.collection.PatriciaTrie;
import org.ardverk.collection.StringKeyAnalyzer;
import org.digitalstain.datrie.DoubleArrayTrieImpl;
import org.digitalstain.datrie.SearchResult;
import org.digitalstain.datrie.store.IntegerArrayList;
 
import java.io.*;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
 
public class Main
{
    static int testTimes = 100000000;
    public static void main(String[] args) throws Exception
    {
        List<String> wordList = new ArrayList<String>();
        BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("data/WordList.txt")));
        String line = null;
        while ((line = br.readLine()) != null)
        {
            wordList.add(line);
        }
        br.close();
//        Collections.sort(wordList);
//        BufferedWriter writer = new BufferedWriter(new FileWriter("data/WordList.txt", false));
//        for (String w : wordList)
//        {
//            writer.write(w);
//            writer.newLine();
//        }
//        writer.close();
        System.out.println("单词表大小:" + wordList.size());
        long start = System.currentTimeMillis();
        // 测试ansj的tire
        Forest forest = new Forest();
        for (String w : wordList)
        {
            Library.insertWord(forest, w);
        }
        System.out.println("ansj插入耗时:" + (System.currentTimeMillis() - start));
 
        start = System.currentTimeMillis();
        for (int i = 0; i < testTimes; ++i)
        {
            if (forest.getWord("好好").getAllWords().equals("好好") == false)
            {
                throw new RuntimeException("ansj没找到该有的词");
            }
        }
        System.out.println("ansj查询耗时:" + (System.currentTimeMillis() - start));
 
        start = System.currentTimeMillis();
        PatriciaTrie<String, String> trie = new PatriciaTrie<String, String>(StringKeyAnalyzer.CHAR);
        for (String w : wordList)
        {
            trie.put(w, w);
        }
        System.out.println("PatriciaTrie插入耗时:" + (System.currentTimeMillis() - start));
 
        start = System.currentTimeMillis();
        for (int i = 0; i < testTimes; ++i)
        {
            if (trie.containsKey("好好") == false)
            {
                throw new RuntimeException("PatriciaTrie没找到该有的词");
            }
        }
        System.out.println("PatriciaTrie查询耗时:" + (System.currentTimeMillis() - start));
 
        start = System.currentTimeMillis();
        TrieBuilder builder = new TrieBuilder(wordList);
        Trie trieJada = builder.build();
        System.out.println("Jada插入耗时:" + (System.currentTimeMillis() - start));
 
        start = System.currentTimeMillis();
        for (int i = 0; i < testTimes; ++i)
        {
            if (trieJada.search("好好") == -1)
            {
                throw new RuntimeException("Jada没找到该有的词");
            }
        }
        System.out.println("Jada查询耗时:" + (System.currentTimeMillis() - start));
 
        start = System.currentTimeMillis();
        DoubleArrayTrie dat = new DoubleArrayTrie();
        dat.build(wordList);
        System.out.println("darts插入耗时:" + (System.currentTimeMillis() - start));
 
        start = System.currentTimeMillis();
        for (int i = 0; i < testTimes; ++i)
        {
            if (dat.exactMatchSearch("加油") == -1)
            {
                throw new RuntimeException("darts没找到该有的词");
            }
        }
        System.out.println("darts查询耗时:" + (System.currentTimeMillis() - start));
 
        // digitalstainDatrie is a piece of shit!
//        start = System.currentTimeMillis();
//        DoubleArrayTrieImpl digitalstainDatrie = new DoubleArrayTrieImpl(wordList.size());
//        digitalstainDatrie.add(wordList);
//        System.out.println("digitalstainDatrie插入耗时:" + (System.currentTimeMillis() - start));
//
//        start = System.currentTimeMillis();
//        for (int i = 0; i < testTimes; ++i)
//        {
//            if (digitalstainDatrie.containsPrefix(new IntegerArrayList("加油")) != SearchResult.PERFECT_MATCH)
//            {
//                throw new RuntimeException("digitalstainDatrie没找到该有的词");
//            }
//        }
//        System.out.println("darts查询耗时:" + (System.currentTimeMillis() - start));
    }
}

输出：


  
  单词表大小:275909
ansj插入耗时:189
ansj查询耗时:7756
PatriciaTrie插入耗时:156
PatriciaTrie查询耗时:13960
Jada插入耗时:1711
Jada查询耗时:3463
darts插入耗时:1444
darts查询耗时:1479

插入耗时

可见在20万词的词典插入中，二分法和基数树表现得非常突出，双数组Trie树因为要压缩（甚至解决冲突，如Jada）就多花了一个数量级的时间。不过这也没什么大不了的，因为分词词典就加载一次。

查询耗时

这才是重头戏，在查询中双数组Trie树表现得非常突出，二分法次之，而基数树则是二分法的两倍。

结论

二分法小巧实用。

Aiphis7

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录