Chinese Word Segmentation
Task
Chinese word segmentation is the task of splitting Chinese text (a sequence of Chinese characters) into words.
Example:
'上海浦东开发与建设同步' → ['上海', '浦东', '开发', ‘与', ’建设', '同步']
Systems
♠ marks the system that uses character unigram as input.
♣ marks the system that uses character bigram as input.
- Huang et al. (2019): BERT + model compression + multi-criterial learing ♠
- Yang et al. (2018): Lattice LSTM-CRF + BPE subword embeddings ♠♣
- Ma et al. (2018): BiLSTM-CRF + hyper-params search♠♣
- Yang et al. (2017): Transition-based + Beam-search + Rich pretrain♠♣
- Zhou et al. (2017): Greedy Search + word context♠
- Chen et al. (2017): BiLSTM-CRF + adv. loss♠♣
- Cai et al. (2017): Greedy Search+Span representation♠
- Kurita et al. (2017): Transition-based + Joint model♠
- Liu et al. (2016): neural semi-CRF♠
- Cai and Zhao (2016): Greedy Search♠
- Chen et al. (2015a): Gated Recursive NN♠♣
- Chen et al. (2015b): BiLSTM-CRF♠♣
Evaluation
Metrics
F1-score
Dataset
Chinese Treebank 6
Model | F1 | Paper / Source | Code |
---|---|---|---|
Huang et al. (2019) | 97.6 | Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning | |
Ma et al. (2018) | 96.7 | State-of-the-art Chinese Word Segmentation with Bi-LSTMs | |
Yang et al. (2018) | 96.3 | Subword Encoding in Lattice LSTM for Chinese Word Segmentation | Github |
Yang et |