Google原生输入法LatinIME词库构建流程分析--相关数据结构分析

最新推荐文章于 2024-08-02 09:15:15 发布

原创最新推荐文章于 2024-08-02 09:15:15 发布 · 2.1k 阅读

5 ·

CC 4.0 BY-SA版权

文章标签：

#LatinIME

Android 同时被 2 个专栏收录

61 篇文章

订阅专栏

输入法

9 篇文章

订阅专栏

本文深入剖析了输入法词库的内部结构，包括核心数据结构LemmaEntry及其字段含义，介绍了拼音字典树SpellingNode的设计，并探讨了字典树节点LmaNodeLE0和LmaNodeGE1的作用。

其实输入法词库相关数据结构的定义基本上都在头文件dictdef.h文件中，进入到代码目录cpp下.

初始化字库,首先读取txt文件内容到数据结构lemma_arr和valid_hzs中,lemma_arr是一个数组类型为LemmaEntry,下面来看下LemmaEntry定义(cpp/include/dictdef.h):

//rawdict_utf16_65105_freq.txt每一行是一个LemmaEntry实体
//在记录拼音的时候，它默认将拼音字母转成大写，仅对双声母中的h使用小写。 这里指的是pinyin_str
struct LemmaEntry {
  LemmaIdType idx_by_py;
  LemmaIdType idx_by_hz;
  char16 hanzi_str[kMaxLemmaSize + 1];

  // The SingleCharItem id for each Hanzi.
  uint16 hanzi_scis_ids[kMaxLemmaSize];

  uint16 spl_idx_arr[kMaxLemmaSize + 1];
  char pinyin_str[kMaxLemmaSize][kMaxPinyinSize + 1];
  unsigned char hz_str_len;
  float freq;
};

首先来看下rawdict_utf16_65105_freq.txt文件内容：

鼥 0.750684002197 1 ba
釛 0.781224156844 1 ba
軷 0.9691786136 1 ba
釟 0.9691786136 1 ba
蚆 1.15534975655 1 ba
。。。。。。

可以看到该文件行数为65105，每一行的格式都是：汉字频率？拼音，结构体中的freq就是频率，hz_str_len就是汉字的长度，二维数组pinyin_str[8][7]用来存放拼音，限制最长汉字串长度为8，单个汉字拼音长度限定为7，hanzi_str[8+1]是用来存放汉字的一种unicode编码，如第一个字“鼥”的编码就是：40741，可以在这里转换，在gdb中查看lemma_arr_第一个元素如下：

{idx_by_py = 0, idx_by_hz = 0, hanzi_str =     {40741,
    0,
    0,
    0,
    0,
    0,
    0,
    0,
    0}, hanzi_scis_ids =     {0,
    0,
    0,
    0,
    0,
    0,
    0,
    0}, spl_idx_arr =     {0,
    0,
    0,
    0,
    0,
    0,
    0,
    0,
    0}, pinyin_str =     {      "BA\000\000\000\000",
          "\000\000\000\000\000\000",
          "\000\000\000\000\000\000",
          "\000\000\000\000\000\000",
          "\000\000\000\000\000\000",
          "\000\000\000\000\000\000",
          "\000\000\000\000\000\000",
          "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 0.750684023}

拼音都被转换成了大写，但是双声母中的h除外。hanzi_scis_ids字段对应的是该lemma每个汉字在单个汉字表scis中对应的id，比如：第一行的“鼥”字在单字表scis中对应的id就被赋值给了hanzi_scis_ids的第一个元素即hanzi_scis_ids[0]位置，最后一行为“欧洲市场”，那么该字段对应的数组中依次存放“欧” “洲” “市” “场”所对应的id，假如“欧” “洲” “市” “场”分别对应id为34、46、29、200，那么hanzi_scis_ids[0] = 34、hanzi_scis_ids[1] = 46、hanzi_scis_ids[2] = 29、hanzi_scis_ids[3] = 200，其余值仍为初始值0，spl_idx_arr字段描述了每个LemmaEntry中每个汉字字音的id，在gdb中跳过50000次执行后正好跳到“叫声”这个词组，打印看到lemma_arr_中的结构：

(gdb) p lemma_arr_[8117]
$31 = {idx_by_py = 0, idx_by_hz = 0, hanzi_str = {21483, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {0, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {166, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"JIAO\000\000", 
    "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, 
  hz_str_len = 1 '\001', freq = 55685.8672}
(gdb) p lemma_arr_[12781]
$32 = {idx_by_py = 0, idx_by_hz = 0, hanzi_str = {22768, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {0, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {337, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"ShENG\000", 
    "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, 
  hz_str_len = 1 '\001', freq = 7169.11719}
(gdb) p lemma_arr_[i-1]
$33 = {idx_by_py = 0, idx_by_hz = 0, hanzi_str = {21483, 22768, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {0, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {166, 337, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {
    "JIAO\000\000", "ShENG\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, 
  hz_str_len = 2 '\002', freq = 564.171448}
(gdb) p i
$34 = 33619
(gdb)

8117为“叫”这个字在lemma_arr_中的存储，12781为“声”这个字的lemma_arr_存储，i-1 = 33618，可以看到lemma_arr[33618]处“叫声”的idx_by_hz和spl_idx_arr就是其单个字的组合而来，同时还可以看到双声母“ShENG”中的h被设置为了小写。idx_by_py和idx_by_hz字段分别表示该lemma通过汉字id数组和拼音id数组计算出来的lemma的id。

前面提到过，LemmaEntry字段hanzi_scis_ids字段表示lemma（称之为汉字串吧）中每个汉字在单汉字表中的id，单汉字表scis也是一个数组，其类型为SingleCharItem的结构体（cpp/include/dictdef.h）：

#ifdef ___BUILD_MODEL___
struct SingleCharItem {
  float freq;
  char16 hz;
  SpellingId splid;
};

字段splid对应了单个汉字的拼音id，hz即汉字的描述，freq字段描述该单个汉字的频率，具体代码在dictbuilder.cpp中

size_t hz_num = lemma_arr_[pos].hz_str_len;
...
if (1 == hz_num)
        scis_[scis_num_].freq = lemma_arr_[pos].freq;
      else
        scis_[scis_num_].freq = 0.000001;

汉字num为1也就是lemma只有一个汉字,否则freq设置为0.000001。再来看splid类型为SpellingId,这也是一个结构体：

typedef struct {
  uint16 half_splid:5;
  uint16 full_splid:11;
} SpellingId, *PSpellingId;

此结构体定义的half_splid和full_splid使用了位字段进行定义，即half_splid可以存储的无符号short类数不大于31（最大为11111），而full_splid可以存储最大无符号short类形数（大于31）小于2的0次方累加到2的11次方（具体多少自己算吧）。

接下来看与构建字典树相关的数据结构，一个是LmaNodeLE0，另一个是LmaNodeGE1，它们分别代表层数小于等于0上的节点和层数大于1上的节点，先来看LmaNodeLE0的定义（cpp/include/dictdef.h）：

/**
 * We use different node types for different layers
 * Statistical data of the building result for a testing dictionary:
 *                              root,   level 0,   level 1,   level 2,   level 3
 * max son num of one node:     406        280         41          2          -
 * max homo num of one node:      0         90         23          2          2
 * total node num of a layer:     1        406      31766      13516        993
 * total homo num of a layer:     9       5674      44609      12667        995
 *
 * The node number for root and level 0 won't be larger than 500
 * According to the information above, two kinds of nodes can be used; one for
 * root and level 0, the other for these layers deeper than 0.
 *
 * LE = less and equal,
 * A node occupies 16 bytes. so, totallly less than 16 * 500 = 8K
 */
struct LmaNodeLE0 {
  uint32 son_1st_off;
  uint32 homo_idx_buf_off;
  uint16 spl_idx;
  uint16 num_of_son;
  uint16 num_of_homo;
};

/**
 * GE = great and equal
 * A node occupies 8 bytes.
 */
struct LmaNodeGE1 {
  uint16 son_1st_off_l;        // Low bits of the son_1st_off
  uint16 homo_idx_buf_off_l;   // Low bits of the homo_idx_buf_off_1
  uint16 spl_idx;
  unsigned char num_of_son;            // number of son nodes
  unsigned char num_of_homo;           // number of homo words
  unsigned char son_1st_off_h;         // high bits of the son_1st_off
  unsigned char homo_idx_buf_off_h;    // high bits of the homo_idx_buf_off
};

结构体LmaNodeGE0和LmaNodeGE1结构体主要在dictbuilder::construct_subset(...)方法中调用，从buidl_dict方法调用时传入的item_start=0, item_end=65101,就是rawdict_utf16_65101_freq.txt文件第一行到最后一行，也就是遍历lemma_arr_这个数组生成分层的trie树，在方法construtc_subset中递归调用自己为level0和level1层上添加节点，这块具体结构形式还没弄太明白，等以后明白了再详细描述这两个结构体吧。

// Node used for the trie of spellings
struct SpellingNode {
  SpellingNode *first_son;
  // The spelling id for each node. If you need more bits to store
  // spelling id, please adjust this structure.
  uint16 spelling_idx:11;
  uint16  num_of_son:5;
  char char_this_node;
  unsigned char score;
};

结构体SpellingNode用来描述拼音字典树的每个节点，此结构体定义在cpp/include/spellingtrie.h中，*first_son是一个指向儿子节点类型为SpellingNode的指针数组首地址，spellingtrie.cpp的construct方法中构建的音节树的每个节点都是此类型,root_节点的first_son指针指向level1_sons的首元素地址，num_of_son采用位字段来定义，说明它可以存放不大于2的5次方的整数，该字段用来描述以此char_this_node描述的char可以组成的音节数量，如：当char_this_node 为 ‘A‘时，它的儿子节点数为3，分别是ai an ao。字段score即此char的得分，score越小搜索优先级越高，root_节点的score=0，位字段spelling_idx描述每个可组成音节的字母在列表中的id：

{first_son = 0x617420, spelling_idx = 1, num_of_son = 3, char_this_node = 65 'A', score = 86 'V'},
  {first_son = 0x617480, spelling_idx = 2, num_of_son = 5, char_this_node = 66 'B', score = 57 '9'},
  {first_son = 0x617620, spelling_idx = 3, num_of_son = 6, char_this_node = 67 'C', score = 72 'H'},
  {first_son = 0x6179e0, spelling_idx = 5, num_of_son = 5, char_this_node = 68 'D', score = 46 '.'},
  {first_son = 0x617c50, spelling_idx = 6, num_of_son = 3, char_this_node = 69 'E', score = 79 'O'},
  {first_son = 0x617cb0, spelling_idx = 7, num_of_son = 5, char_this_node = 70 'F', score = 72 'H'},
  {first_son = 0x617e00, spelling_idx = 8, num_of_son = 4, char_this_node = 71 'G', score = 62 '>'},
  {first_son = 0x617ff0, spelling_idx = 9, num_of_son = 4, char_this_node = 72 'H', score = 64 '@'},
  {first_son = 0x6181e0, spelling_idx = 11, num_of_son = 2, char_this_node = 74 'J', score = 59 ';'},
  {first_son = 0x618380, spelling_idx = 12, num_of_son = 4, char_this_node = 75 'K', score = 70 'F'},
  {first_son = 0x618570, spelling_idx = 13, num_of_son = 6, char_this_node = 76 'L', score = 62 '>'},
  {first_son = 0x618810, spelling_idx = 14, num_of_son = 5, char_this_node = 77 'M', score = 68 'D'},
  {first_son = 0x6189e0, spelling_idx = 15, num_of_son = 6, char_this_node = 78 'N', score = 66 'B'},
  {first_son = 0x618c70, spelling_idx = 16, num_of_son = 1, char_this_node = 79 'O', score = 109 'm'},
  {first_son = 0x618c90, spelling_idx = 17, num_of_son = 5, char_this_node = 80 'P', score = 90 'Z'},
  {first_son = 0x618e50, spelling_idx = 18, num_of_son = 2, char_this_node = 81 'Q', score = 66 'B'},
  {first_son = 0x626ff0, spelling_idx = 19, num_of_son = 5, char_this_node = 82 'R', score = 65 'A'},
  {first_son = 0x6271a0, spelling_idx = 20, num_of_son = 6, char_this_node = 83 'S', score = 46 '.'},
  {first_son = 0x627540, spelling_idx = 22, num_of_son = 5, char_this_node = 84 'T', score = 70 'F'},
  {first_son = 0x6277a0, spelling_idx = 25, num_of_son = 4, char_this_node = 87 'W', score = 61 '='},
  {first_son = 0x627890, spelling_idx = 26, num_of_son = 2, char_this_node = 88 'X', score = 68 'D'},
  {first_son = 0x627a30, spelling_idx = 27, num_of_son = 5, char_this_node = 89 'Y', score = 51 '3'},
  {first_son = 0x627bd0, spelling_idx = 28, num_of_son = 6, char_this_node = 90 'Z', score = 61 '='}

但是该字段为11位，也就是说它可以存放不大于2的11次方的整数（2048）。