Neural Word Segmentation Learning for Chinese
paper
code
会议:ACL2016
作者:Deng Cai and Hai Zhao
机构:
Department of Computer Science and Engineering
Key Lab of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering
Shanghai Jiao Tong University, Shanghai, China
主要工作(解决了什么):消除了上下文窗口,可以利用完整的分割信息
主要方法(使用了什么):LSTM
数据集:PKU and MSR
取得的成果:①给定字后给出对应的词表示 ②句子级的似然评估系统 ③提出算法找到最佳划分方式
缺陷和不足:与SOTA模型相比,性能不足,并且想要很长的训练时间
下一步工作:提升模型性能
摘要
过去:只能捕获固定大小的局部窗口内的上下文信息和相邻标签之间的简单交互
本文:thoroughly eliminates context windows and can utilize complete segmentation history.
完全消除了上下文窗口,并且可以利用完整的分割历史
Our model employs a gated combination neural network over characters to produce distributed representations of word candidates, which are then given to a long shortterm memory (LSTM) language scoring model.
我们的模型在字符上使用门控组合神经网络来生成候选词的分布式表示,然后将其提供给长短期记忆 (LSTM) 语言评分模型。
介绍
观点:
- Most east Asian languages including Chinese are written without explicit word delimiters, therefore, word segmentation is a preliminary step for processing those languages.
- Since Xue (2003), most methods formalize the Chinese word segmentation (CWS) as a sequence labeling problem with character position tags, which can be handled with supervised learning methods such as Maximum Entropy and Conditional Random Fields. However, those methods heavily depend on the choice of handcrafted features.
先前方法的缺点:
Nevertheless, the tag-tag transition is insufficient to model the complicated influence from previous segmentation decisions, though it could sometimes be a crucial clue to later segmentation decisions. The fixed context window size, which is broadly adopted by these methods for feature engineering, also restricts the flexibility of modeling diverse distances. Moreover, word-level information, which is being the greater granularity unit as suggested in (Huang and Zhao, 2006), remains unemployed.
本文方法:
this paper makes a latest attempt to re-formalize CWS as a direct segmentation learning task.
Our method does not make tagging decisions on individual characters, but directly evaluates the relative likelihood of different segmented sentences and then search for a segmentation with the highest score.
我们的方法不对单个字符进行标记决策,而是直接评估不同分段句子的相对可能性,然后搜索得分最高的分段。
To feature a segmented sentence, a series of distributed vector representations (Bengio et al., 2003) are generated to characterize the corresponding word candidates.
为了表征一个分段的句子,生成了一系列分布式向量表示(Bengio et al., 2003)来表征相应的候选词.
Such a representation setting makes the decoding quite different from previous methods and indeed much more challenging, however, more discriminative features can be captured.
这样的表示设置使得解码与以前的方法完全不同,并且确实更具挑战性,但是,可以捕获更多的判别特征。
Though the vector building is word centered, our proposed scoring model covers all three processing levels from character, word until sentence.
尽管向量构建是以单词为中心的,但我们提出的评分模型涵盖了从字符、单词到句子的所有三个处理级别。
First, the distributed representation starts from character embedding, as in the context of word segmentation, the n-gram data sparsity issue makes it impractical to use word vectors immediately. Second, as the word candidate representation is derived from its characters, the inside character structure will also be encoded, thus it can be used to determine the word likelihood of its own. Third, to evaluate how a segmented sentence makes sense through word interacting, an LSTM is used to chain together word candidates incrementally and construct the representation of partially segmented sentence at each decoding step,
so that the coherence between next word candidate and previous segmentation history can be depicted.
首先,分布式表示从字符嵌入开始,因为在分词的上下文中,n-gram 数据稀疏性问题使得立即使用词向量变得不切实际。 其次,由于词候选表示是从它的字符派生的,内部的字符结构也会被编码,因此它可以用来确定它自己的词似然度。 第三,为了评估分段句子如何通过单词交互变得有意义,LSTM 用于将候选单词递增地链接在一起,并在每个解码步骤构建部分分段句子的表示,这样就可以描述下一个候选词和之前的切