tagging problem
即序列标注问题。
给定一个词序列作为输入:the dog saw a cat。
要求输出其词性序列:D N V D N (D for a determiner, N for noun, and V for verb)。
有时输出序列会是这种形式:the/D dog/N saw/V a/D cat/N。
其中有两个重要具体分支任务:part-of-speech(POS) tagging和named-entity recognition.
POS tagging
INPUT:
Profits soared at Boeing Co., easily topping forecasts on Wall Street, as their CEO Alan Mulally announced first quarter results.
OUTPUT:
Profits/N soared/V at/P Boeing/N Co./N ,/, easily/ADV topping/V forecasts/N on/P Wall/N Street/N ,/, as/P their/POSS CEO/N Alan/N Mulally/N announced/V first/ADJ quarter/N results/N ./.
KEY:
N = Noun
V = Verb
P = Preposition
Adv = Adverb
Adj = Adjective
…
pos tagging是NLP领域的基础问题之一,在很多领域发挥着重要作用。
pos tagging的一个难点在于歧义——许多单词可以是不同的pos。上例中的profits是名词,但是在其它地方可能是动词。想到高中政治中学到“人是社会中的人”,同样,“单词是语句中的单词”,单纯从单词本身出发去解决问题是很难的,而考虑上下文后会更容易一些,能够削弱歧义的影响。
另一个难点在于很多单词出现的频率很低,导致比较难训练。这一点在词向量这一概念出现之后就不再是很严重的问题了,因为就算很多单词很少出现,但是其词向量会跟近义词接近。
named entity recognition
例:
INPUT: Profits soared at Boeing Co., easily topping forecasts on Wall Street, as their CEO Alan Mulally announced first quarter results.
OUTPUT: Profits soared at [Company Boeing Co.], easily topping forecasts on [Location Wall Street], as their CEO [Person Alan Mulally] announced first quarter results.
该任务就是要从语句中找出命名实体来,如人名、地名、公司名等。
实际上,在处理此类任务时,通常会对每一个单词预测一个label:
INPUT: Profits soared at