最近正在做一个自己的NLP库,刚起步的第一个问题就是字典的储存与查询。毫无疑问,最佳的数据结构是Trie树,同时为了平衡效率和空间,决定使用双数组Trie树。现在的问题是,双数组Trie树是一个压缩的Trie树,在插入的时候需要递归调整base与check中的冲突,无异于重新压缩一次。所以在动态的词典场景下,需要一个支持快速插入、同时查询速度还行的结构。我调查了很多开源实现,发现“首字直接分配内存,之后用动态数组二分”与“Radix and Crit Bit Tree”两种比较有实用价值。在此对两种方法做个比较,同时与标准的DoubleArrayTrie树一起比较一下速度。
二分法采用的实现是ansj的tree_split,Radix and Crit Bit Tree采用的是patricia-trie,标准的DoubleArrayTrie树采用了Jada和darts(两个库都是日本人写的),另外digitalstain的Datrie简直太糟糕了在测试中指针越界,懒得看那坨shit一样的代码。
我用一个真实的词典做为输入,连续查找100000000次一个词来比较查询速度
测试代码如下:
- package com.hankcs;
-
- import darts.DoubleArrayTrie;
- import love.cq.domain.Forest;
- import love.cq.library.Library;
- import love.cq.splitWord.GetWord;
- import net.reduls.jada.Trie;
- import net.reduls.jada.TrieBuilder;
- import org.ardverk.collection.PatriciaTrie;
- import org.ardverk.collection.StringKeyAnalyzer;
- import org.digitalstain.datrie.DoubleArrayTrieImpl;
- import org.digitalstain.datrie.SearchResult;
- import org.digitalstain.datrie.store.IntegerArrayList;
-
- import java.io.*;
- import java.util.ArrayList;
- import java.util.Collections;
- import java.util.List;
-
- public class Main
- {
- static int testTimes = 100000000;
- public static void main(String[] args) throws Exception
- {
- List<String> wordList = new ArrayList<String>();
- BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("data/WordList.txt")));
- String line = null;
- while ((line = br.readLine()) != null)
- {
- wordList.add(line);
- }
- br.close();
- // Collections.sort(wordList);
- // BufferedWriter writer = new BufferedWriter(new FileWriter("data/WordList.txt", false));
- // for (String w : wordList)
- // {
- // writer.write(w);
- // writer.newLine();
- // }
- // writer.close();
- System.out.println("单词表大小:" + wordList.size());
- long start = System.currentTimeMillis();
- // 测试ansj的tire
- Forest forest = new Forest();
- for (String w : wordList)
- {
- Library.insertWord(forest, w);
- }
- System.out.println("ansj插入耗时:" + (System.currentTimeMillis() - start));
-
- start = System.currentTimeMillis();
- for (int i = 0; i < testTimes; ++i)
- {
- if (forest.getWord("好好").getAllWords().equals("好好") == false)
- {
- throw new RuntimeException("ansj没找到该有的词");
- }
- }
- System.out.println("ansj查询耗时:" + (System.currentTimeMillis() - start));
-
- start = System.currentTimeMillis();
- PatriciaTrie<String, String> trie = new PatriciaTrie<String, String>(StringKeyAnalyzer.CHAR);
- for (String w : wordList)
- {
- trie.put(w, w);
- }
- System.out.println("PatriciaTrie插入耗时:" + (System.currentTimeMillis() - start));
-
- start = System.currentTimeMillis();
- for (int i = 0; i < testTimes; ++i)
- {
- if (trie.containsKey("好好") == false)
- {
- throw new RuntimeException("PatriciaTrie没找到该有的词");
- }
- }
- System.out.println("PatriciaTrie查询耗时:" + (System.currentTimeMillis() - start));
-
- start = System.currentTimeMillis();
- TrieBuilder builder = new TrieBuilder(wordList);
- Trie trieJada = builder.build();
- System.out.println("Jada插入耗时:" + (System.currentTimeMillis() - start));
-
- start = System.currentTimeMillis();
- for (int i = 0; i < testTimes; ++i)
- {
- if (trieJada.search("好好") == -1)
- {
- throw new RuntimeException("Jada没找到该有的词");
- }
- }
- System.out.println("Jada查询耗时:" + (System.currentTimeMillis() - start));
-
- start = System.currentTimeMillis();
- DoubleArrayTrie dat = new DoubleArrayTrie();
- dat.build(wordList);
- System.out.println("darts插入耗时:" + (System.currentTimeMillis() - start));
-
- start = System.currentTimeMillis();
- for (int i = 0; i < testTimes; ++i)
- {
- if (dat.exactMatchSearch("加油") == -1)
- {
- throw new RuntimeException("darts没找到该有的词");
- }
- }
- System.out.println("darts查询耗时:" + (System.currentTimeMillis() - start));
-
- // digitalstainDatrie is a piece of shit!
- // start = System.currentTimeMillis();
- // DoubleArrayTrieImpl digitalstainDatrie = new DoubleArrayTrieImpl(wordList.size());
- // digitalstainDatrie.add(wordList);
- // System.out.println("digitalstainDatrie插入耗时:" + (System.currentTimeMillis() - start));
- //
- // start = System.currentTimeMillis();
- // for (int i = 0; i < testTimes; ++i)
- // {
- // if (digitalstainDatrie.containsPrefix(new IntegerArrayList("加油")) != SearchResult.PERFECT_MATCH)
- // {
- // throw new RuntimeException("digitalstainDatrie没找到该有的词");
- // }
- // }
- // System.out.println("darts查询耗时:" + (System.currentTimeMillis() - start));
- }
- }
输出:
- 单词表大小:275909
- ansj插入耗时:189
- ansj查询耗时:7756
- PatriciaTrie插入耗时:156
- PatriciaTrie查询耗时:13960
- Jada插入耗时:1711
- Jada查询耗时:3463
- darts插入耗时:1444
- darts查询耗时:1479
插入耗时
可见在20万词的词典插入中,二分法和基数树表现得非常突出,双数组Trie树因为要压缩(甚至解决冲突,如Jada)就多花了一个数量级的时间。不过这也没什么大不了的,因为分词词典就加载一次。
查询耗时
这才是重头戏,在查询中双数组Trie树表现得非常突出,二分法次之,而基数树则是二分法的两倍。
结论
二分法小巧实用。