下载kcws: git clone https://github.com/koth/kcws
切换到本项目代码目录,运行
./configure
cd kcws
./configure
词向量训练;
文本预处理:
python kcws/train/process_anno_file.py <语料目录> pre_chars_for_w2v.txt
编译词向量函数:
bazel build third_party/word2vec:word2vec
先得到初步词表
./bazel-bin/third_party/word2vec/word2vec -train pre_chars_for_w2v.txt -save-vocab pre_vocab.txt -min-count 3
处理低频词
python kcws/train/replace_unk.py pre_vocab