Lesson 13
Representation for a word
早年间,supervised neural network,效果还不如一些feature classifier(SVM之类的)
后来训练unsupervised neural network,效果赶上feature classifier了,但是花费的时间很长(7weeks)
如果再加一点hand-crafted features,准确率还能进一步提升
后来,我们可以train on supervised small corpus,找到dependency parser
我们可以initialize them with random word vector,但如果我们用pre-trained data,效果会好1%(主要是好在rare words上)
- 对于unknown words with word vector的solution
- train的时候,我们会用vocab来contain那些supervised data(比如出现次数>5次),其余的全部认为是UNK
- 把所有次数<5次的words map to UNK,专门为他们train a word vector
- 再专门为他们跑一次word vector
-
但这样还是没有distinguish different UNKs
所以我们需要用char-level model -
但是在Question Answering中,即test-time时如果遇到new words,有新方法
- 可能pre-trained word embedding中contain比你实际应用中用到的更多的vocabulary,所以直接从pre-train里面直接找到这个词的word vector,并提取出来
- 如果还是没法在pretrain中找到,那就assign一个random word vector,这样每个word都有unique identity,这样如果你在answer中遇到同一个单词,他们就能match perfectly
-
但还是有新问题
- 一个词one spelling,但是有多个意义,所以我们需要多个word vectors to distinguish their meanings;比如一个word vector是Mixture,但是我们model会把它separate
- 有一些词他们semantic意义相同,但是它们用在不同的地方,比如有的是noun,有的是verb;词还有connotation(内涵)difference
-
想法
传统来说,我们都是利input word in LSTM models,然后用这个hidden state 来predict next words.
那么每个hidden state就是所谓的context-specific word representation(上下文指定的hidden state),那么我们能否用这个representation来做更多别的事?
NER
为每一个词找一个token,比如person/location/date,算是另一种程度上的features
TagLM - Pre-ELMo
一般是用CoNLL 2003 dataset
- IDEA:想要学word meaning in context,但only learn task RNN on small task-labeled data
用semi-supervised approach(给了他们better features)
- 利用许多unlabeled data,可以train a conventional word embedding model(Word2Vec)用来predict next word,simultaneously,we can train a recurrent language model(bi-LSTM model)
- 利用trained recurrent language model,把word输入,generate hidden states in bi-LSTM language model(也是一个word representation)
- 把两个word representation一起用到sequence tagging model里
- 细节:我们利用unlabeled data train bi-LSTM model,然后我们想do name entity recognition for ‘New York’ is located
- 右侧,是一个predict next word的LM。
run ‘New York’ is located through pre-trained language model(这个pre-train之后所有参数都frozen),一个forward,一个backward,然后concate得到一个word representation。
每个Position,它会给你一个contextual word representation(即concatenated vectors作为features用在named entity recognizer里)
没有gradient flow back,这样这个LM就不会更新 - 左边,对于name entity recognizer,我们用supervised data(same sentence)来进行训练。通过word2vec得到其word embedding,通过character level CNN/RNN得到character-level embedding,然后把两个concate
- 再把第2步里得到的feed into bi-LSTM,得到一个output
- 再把第1步和第3步的output concatenate起来,这样就变成了pair state
- 第4步得到的东西feed into second layer bi-LSTM model中,对于Output,做softmax classification,然后就能给出tags(比如beginning/end of the location)
- 右侧,是一个predict next word的LM。
- 关于language model的细节
- bidirection比unidirection重要
- LM trained on supervised data doesn’t help
- 一定要用much more data来train,不然效果不好
- 如果只用LM embedding来Predict,效果并不好
改进版:ELMo
breakout of word token vectors or contextual word vectors
- Difference
- do bidirectional language model a little differently。一个是要compact LM easy for people to use(所以可能会用character representation)
- TagLM中,从Pre-train LM中,feed进主model的,是top level of the LM
但在ELMo中,use all layers of the bi-LSTM model,给每层hidden level(layer) learn 一个weight(后面会提),还learn了一个global scaling factor γ \gamma γ
weight of layers
这里没有很深,只有two layers of bi-LSTM
但Lower layer is better for lower-level syntax(处理比较简单的),比如part-of-speech tagging
higher layer is better for higher-level semantics(处理比较难的),比如question answering
ULMfit
Universal Language Model Fine-tuning for Text Classification(transfer learning)
把Language model变为目标的name entity recognition
- big unsupervised corpus to train a neural language model(3 hidden layer)
- fine tune the model on我们真正需要工作的domain
- 最后把text classification导入
他们没直接feed features into a different network,而是一直use the same network但是在最后设立了一个不同的目标。
比如我们可以用这个network来predict next word,但此时我们freeze the parameters of the network,也就是freeze了softmax的参数
但是这个network还是可以用来predict别的,比如predict pos/neg sentiment in the text classification
- 特性
如果你能train it on a big amount of data,即使你后面transfer learning到别的task,并且只train on pretty little data,最后的表现也能很好
Transformers(参考NLP4最后的笔记)
- motivation
我们想要build model faster - 我们希望可以parallelization,传统的都是RNN,得等一个state完了才能下一个,所以特别慢。
尽管GRU和LSTM又很好地效果,但是它们处理long sentence还是有问题,所以我们需要attention机制。
或许,我们可以只用attention机制,而不用RNN的部分?
attention is all you need!!!
Dot-Product Attention
新的attention layer
- input: a query q and a set of key-value(k-v) pairs
- output:weighted sum of values
A ( q , K , V ) = ∑ i e q ∗ k i ∑ j e q ∗ k j v i A(q,K,V)=\sum_i\frac{e^{q*k_i}}{\sum_je^{q*k_j}}v_i A(q,K,V)=i∑∑jeq∗kjeq∗kivi
计算key和query的similarity,所以后面需要用vector for the corresponding value
所以这里在use the softmax over query,key来给出attention weight based on the corresponding values
queries and keys have same dimensionality d k d_k dk,value have d v d_v dv
scaled Dot-product attention
In the encoder,everything is word vectors(queries,keys,values)
Multi-head attention
第二个New idea:attention的想法很好,但是只有一个attention distribution也许很不好,也许一个Postion上需要注意多个地方
Complete transformer block
- 把两个attention distribution的结果求和
- 然后normalize他们(不做batch normalization,而是layer normalization)
有的是时候为了pass info through the chain,我们需要每走一步就深一层,这样会让layer很多,但是GPU可以并行运算。总之,no free lunch
Encoder input
用了byte pair encoding,当然,直接输入word vector也是可以的
- Complete Encoder
每个block,我们use the same Q,K,V from the previous layer
block会repeat 6 times
Transformer Decoder
BERT(Bidirectional Encoder Representations from Transformers)
Pre-training of Deep Bidirectional Transformers for Language Understanding.
-
核心思想
Use encoder from transformer network to calculate a representation of a sentence.生成的这个representation可以用来做name entity recognition/sentimental analysis,多种用途 -
原因
LM(文本预测)一般只用left/right context,但是文本理解(language understanding) is bidirectional- Why are LMs unidirectional?
- Directionality is needed to generate a well-formed probability distribution
- Words can see themselves in a bidirectional encoder(bi-LSTM会把左右两个结合起来,那么输入下一层的时候,相当于能看到future word,失去了prediction的意义了)
- Why are LMs unidirectional?
-
Solution
Mask out some words in the sentence,然后预测这部分被遮住的words
一般k%被遮住,这个k=15%。遮的太少,too expensive to train;遮的太多,not enough context- EG:The man went to the [MASK] to buy a [MASK] of milk
我们的LM也不再是language model,而是变成generate a probability of a sentence;我们用cross entropy loss to guess 两个MASK,并且我们希望猜到store/gallon这两个词
这边是using the entire context(both sides)来预测blank
- Another Object
除了预测被mask的单词,它还有一个功能:learn the relationships between sentences,predicet whether Sentence B is actual sentence that proceeds Sentence A,or just a random sentence
即判断句子A和句子B的关系
三种关系:follow,contradict,has no relation to premise(与前文无关)
BERT具体算法
- Input sentence are represented as word pieces(token embedding+position embeddings+segment embedding)
- train to predict masked words + predict sentence relationship
- 得到的pre-trained model for various tasks,keep on using this model(不再加别的layer),fine tune it for a particular task
具体做法是:remove the very top level prediction(即把最后predict blank/relationship那层),substitute with a final prediction layer for particular task
- 备注:BERT train on GLUE dataset
Lesson 14
Learning representations of variable length data
- Q:长度变化的数据,如何转化为word representation
- A:RNN 天然适合于面对这种情况,GRU/LSTM目前占主导地位
但是RNN不能parallel处理,速度慢,而且我们想要Hierarchy models
CNN虽然可以并行处理,但是需要我们砍掉一部分来协调长度;a convolution可以完全使用all the info inside its local receptive field
self-attention
Attention机制中,encoder决定what info to absorb based on每个position上content similarity
那我们为什么不直接在representation中用它?
- self-attention
我们想要re-represent the word representation(重构its new representation) - 具体措施:
compare with自己以及周围所有的Neighborhood word,然后produce a weighted combination of entire neighborhood,然后基于weight,sum all the infomation(这样就将自己重构了)
我们也可以再add feed-forward layers to computer new features
Transformer
比如要做English-German Translation
- Encoder part
Attention is permutation invariant(和位置变动无关),所以我们可以change the order of our position,并且不会影响结果
但是为了记录position,we need to add position representation
self-attention layer可以re-compute representation for each postion simultaneously
然后还有feed-forward layer和residual connections between layers.
对于input,还有skip connection just add to activation - Decoder part
用self-attention模拟了一个language model(这样就能mask word了)
具体layer分析
- Encoder self-attention
比如我们想re-represent e 2 e_2 e2
- linearly transform it into a query,
然后linearly transform every position in the neighborhood to key
这些linearly transformation可以被视为features
我们最后用softmax来把它表示出来 - 我们在对其做一次linear transformation to mix the information and pass it through a feed-forward layer
-
Decoder part
attention机制好用在:dimension远远>lengh的时候 -
attention的另一种理解
当我们input一个句子:I kicked the ball- CNN
Kick可以从I,ball这里学到主语,学到宾语(只要在receptive field内)
- CNN
2. self-Attention
所有Input一视同仁(average),不能从不同地方获取different info
3. one-attention layer:for who
这样就可以注意到主语
4. multihead attention
我们可以pay attention to 多个,重点:他们可以并行
当然,每个attention都带了一个softmax函数,虽然几乎其余几项都是0,只要attention的那项为1
self-attention在image上应用
- model of the joint distribution of pixedls
- 变成sequence modeling problem
吧probabilities assign下去,这样就能allow meansuring generalization
如果想要得到image各部分之间的dependency,需要一个large receptive filed,这样用attention,就可以get result at a low computational cost
避免了传统cnn会把far away pixels算进来
-
text synthesis with self similarity
实际上是在做image generation -
classic work called non-local means是用来做denoising的
基于image上的其他一些patches,compute content-based similarity,然后Pull information
compute content-based similarity between elements,然后基于content-based similarity,construct a convex combination that brings these things together
Image Transformer
Music generation with relative self-attention
music也有raw representation,也是sequential,也有start/stop token
- 几种不同的转化(trained on half length)
self-similarity in music
每个点都是weighted average of the past,但好处是无论多远,我们都能access to it
-
不同的方式
- CNN:看周围几个点
- MultIhead attention+Conv
可以access the history directly,also know we relate to them
music translation可以生成beyond the length that it was trained on(超过它所训练的长度)
- CNN:看周围几个点
-
relative attention
- 我们不仅考虑东西有多远(左上距离右下的距离这种,(2,2)值得就是横向距离2,纵向距离2)
- 还会考虑我们对这些东西有多care about(weight)
距离:比如language translation
以前都是生成3D tensor,计算relative distance,然后找一个embedding that corresponds to that distance
这种时候,可以节约很多memory
- 我们不仅考虑东西有多远(左上距离右下的距离这种,(2,2)值得就是横向距离2,纵向距离2)
-
Relative attention和Fully Convolution差不多
Fully Conv会把所有的feature加起来,这样不管你的feature在图像上哪里,都能被选出来
而relative也一样,它不管一个feature在总的image上的位置,但是feature内部的点的position是一致的
Lesson 15
beam search中k的影响
- k太小
构成的句子不成文(ungrammatical),不正确(incorrect) - k太大
- computationally expensive
- 可以缓解一些不成文的问题
- 在NMT中,k变大会降低BLEU score(有些反直觉,但是最合理的解释是optimal in sequence prob和high BLEU score两个不能双全)
- 在对话系统中,会become more generic
即k比较小时,聊的东西更扣题,但make no sense;k比较大时,会选择比较’correct’的答案,单核说东西无关(即generic)
Sampling-based decoding
1.Greedy decoding
2.Beam search
3-1.Pure sampling
在每个step t,randomly sample from the prob distribution p来获得下一个单词(beam search是只找前面k个)
3-2.Top-n sampling
在每个step t,只在prob distribution p前n个概率最大的词里,randomly sample下一个单词
和pure sampling类似,只不过truncate了prob distribution
n=1是greedy search;n=V(单词总量)是pure sampling
- 增大n可以获得更diverse/risky output
- 减小n可以获得更safe/generic output
小技巧:Softmax temperature
传统softmax把vector转化为probability,这里加一个temperature hyperparameter
P
t
(
w
)
=
exp
(
s
w
τ
)
∑
w
′
∈
W
exp
(
s
w
′
τ
)
P_t(w)=\frac{\exp{(\frac{s_w}{\tau}})}{\sum_{w'\in W}{\exp{(\frac{s_{w'}}{\tau}})}}
Pt(w)=∑w′∈Wexp(τsw′)exp(τsw)
- 当
τ
\tau
τ增大时,
P
t
P_t
Pt变得更uniform(raise temperature)
会有更diverse output(概率被分散了) - 当
τ
\tau
τ增大时,
P
t
P_t
Pt变得更spiky(lower temperature)
less diverse output(概率被集中了,集中到山峰上)
NLP tasks and neural approcaches
Summarization
-
定义:input text x,output summary y,y比x短还能包括x的main info
- single-document
只对一篇single document x写 - multi-document
对多篇document x 1 , x 2 , . . . x_1,x_2,... x1,x2,...来写,这些文章中会有Overlap,比如对一件事的多个news comment
- single-document
-
dataset
- Gigaword
用文章的前两句话来生成headlines(sentence compression) - LCSTS
microblog(中文微博数据) - NYT,CNN/Daily mail
全篇news article,对应结果是sentences summary - Wikihow
full how-to article到summary sentences
- Gigaword
-
Sentence simplication和summary有点类似,但是并不相同
把复杂的句子rewrite为简单的句子 -
dataset
- Simple Wikipedia
标准Wikipedia变成简单version - Newsla
news article变成儿童可以看懂的版本
- Simple Wikipedia
两个strategies
- Extractive summarization
Select parts,从原文中选出一些标志性的句子来生成summary- easier
- restrictive(限制很大)
- Abstractive summarization
利用natural language generation来生成新text- more difficult
- more flexible
Extractive summarization
- Pre-neural summarization(有个pipeline)
- Content selection
从source article选哪些句子要用 - Information ordering
给这些句子排个序,谁先谁后 - Sentence realization
编辑sequence of sentences(简化,remove parts,fix continuity issue)
- Content selection
content selection中
- 需要sentence scoring funcs来进行选择
- topic keywords的出现(比如tf-idf)
- 比如句子出现的位置(开头/结尾)
- Graph-based algorithms则会把article看成a set of sentences,句子之间有edge(连接)
重要的是找到edge weight,这里的edge weight可以理解为sentence similarity
这种算法可以identify which sentences are the central of the article
最后自然需要有一个函数来进行评估、优化
ROGUE
Recall-Oriented Understudy for Gisting Evalution
=
∑
S
∈
R
e
f
e
r
e
n
c
e
S
u
m
m
a
r
i
e
s
∑
g
r
a
m
n
∈
S
C
o
u
n
t
m
a
t
c
h
(
g
r
a
m
n
)
∑
S
∈
R
e
f
e
r
e
n
c
e
S
u
m
m
a
r
i
e
s
∑
g
r
a
m
n
∈
S
C
o
u
n
t
(
g
r
a
m
n
)
=\frac{\sum_{S \in{Reference \, Summaries}}\sum_{gram_n \in S}Count_{match}(gram_n)}{\sum_{S \in{Reference \, Summaries}}\sum_{gram_n \in S}Count(gram_n)}
=∑S∈ReferenceSummaries∑gramn∈SCount(gramn)∑S∈ReferenceSummaries∑gramn∈SCountmatch(gramn)
和BLEU算法很像,based on n-gram overlap,但还是有区别
- ROUGE 没有brevity penalty
- ROUGE based on recall,BLEU based on precision
precision is more important for machine translation,因为你要加penalty防止过短
recall is more important for summarization,因为我们想要contain all the important info,这里recall代表了你contain它们的程度
但实际中,F1这个包括了pre和recall的标准,总是出现在summarization文献中 - BLEU只返回最后的single number,但ROUGE返回a combination of precisions for n=1,2,3,4 n-grams(即对每个n-gram,它都会返回一个score)
最常见的ROUGUE scores- ROUGE-1:unigram overlap
- ROUGE-2:bigram overlap
- ROUGE-L:longest common subsequece overlap
我们不再只关注有多少词Overlap,而在乎最长subsequence如何
summarization history
- 2015早期:
abstractive summarization
seq2seq+attention NMT - 2015后期
make it easier to copy(但还是要防止copy过多)
hierarchical/mutli-level attention
more global/high-level content selection
利用Reinforment Learning to maximize ROUGE或者别的discrete goals(e.g. length)
copy mechanisms(copy机制)
seq2seq+attention擅长写fluent output,但是不擅长于copy details(比如rare words)
让copy/generate 形成一个hybrid extractive/abstracive approach
自然,这就有很多变种(variants)
- 计算一个公式,这个公式是generate a new word而不是copy word的概率
这个based on current context,current hidden state
P g e n = p g e n P v o c a b ( w ) + ( 1 − p g e n ) ∑ i : ω i = ω a i t P_{gen}=p_{gen}P_{vocab}(w)+(1-p_{gen})\sum_{i:\omega_i=\omega}a_i^t Pgen=pgenPvocab(w)+(1−pgen)i:ωi=ω∑ait
然后我们还可以继续使用这个 P g e n P_{gen} Pgen,用它乘以下一个单词的prob distribution list,就能算出具体某一个单词出现的概率
- Question:
- copy太多了
有的时候copy long phrases,甚至整句,我们不希望abstractive变成extractive - 在处理长文本时表现差
比如输入的document很长,Overall content selection表现很差 - 自然,也没有Overall stragegy for selecting content
- copy太多了
better content selection
separate stages content selection and surface realization(text generation)
在seq2seq+attention,这两者混合了
-
在decoder的每一步(surface realization),我们做了word-level content selection(attention)
这就导致了没有global content selection strategy -
Solution 1:Bottom-up summarization
分成两步,非常简单- Content selection stage
用neural sequence-tagging model来给每个词加上tag(是否要在summarization中include) - Bottem-up attention stage
seq2seq+attention apply a mask,让那些don’t include的单词不能出现在最后Output内
这样在select的时候就避免了generate,而且这样选词,让句子比较破碎,避免了copy整段整段的句子
- Content selection stage
-
Solution 2: Reinforcement Learning直接optimize ROUGE-L
单纯用RL,可以获得higher ROUGE score,却只能获得lower human judgement score
但如果我们把两者结合,就可以比较好的结果
Dialogue
对话和别的很不同
- Task-oriented dialogue
- Assistive operation,客服系统
- Co-operative(合作解决task)
- Adversarial(对抗型,比如辩论)
- Social dialogue
- Chit-chat(单纯对话,for fun)
- Therapy(mental wellbeing)
因为很难,所以一般不会自由组织neural network,而是会用pre-define templates/从一大堆responses中,retrieve an appropriate response
- seq2seq+attention引起了大家的注意,但是还是有很多serious pervasive deficiencies(缺点) for chit-chat
- genericeness/boring response
- 说一些Irrelevant response
- repetition(复读机)
- Lack of context(记不住之前的聊天内容)
- Lack of consistent persona(前后不一致,仿佛两个人在说话)
一些解决办法
-
irrelevant response
Optimize for Maximum Mutual Information(MMI) between input S and response T
log p ( S , T ) p ( S ) p ( T ) \log{\frac{p(S,T)}{p(S)p(T)}} logp(S)p(T)p(S,T)
T ^ = a r g m a x T { log p ( T ∣ S ) − log p ( T ) } \hat{T}=argmax_{T}\{\log{p(T|S)-\log{p(T)\}}} T^=argmaxT{logp(T∣S)−logp(T)}
要在正确输入T的情况下才能得到output S,但是这个input T也是有个prob distribution限制的,如果prob太高,也是会有Penalty -
genericness/boring response
- 比较早的时候就介入进行调整
- directly upweight rare words during beam search(直接提升一些rare word的weight)
- 用sampling decoding algorithm而不是beam search
- Condition fixes
- Condition the decoder on some additional content,比如从related words中进行sample
- 训练一个retrieve-and-refine model,而不是一个generate from scratch model
把一些training set edit一下以变得适合current scenario
-
repetition response
- 简单方法
直接block repeating n-grams during beam search - complex solution
- train a coverage mechanism in seq2seq,防止attention机制对于同一个词关注multiple times
- 定义一个training object to discourage repetition
这里可能会需要RL来train
- 简单方法
Storytelling
根据一张图/a brief prompt(提示)来写story,或者直接续写故事
在COCO dataset上训练而得,将image转变为sentence encoding;然后再训练另一个language Model,把sentence encoding转变为某种风格的output
- Common sentence-encoding space
利用了Skip-though vectors来实现
原理:通过predict words around it to learn word word
但现在还是有问题,就是会有很多环境描写,但是没有实际的剧情推进
NLG evaluation
BLEU,ROUGE,METEOR,F1这些来评判翻译结果
但是not ideal for machine translation,对于summarization的表现更糟,对于dialogue的还要再糟
-
perplexity
只能告诉你how powerful your LM,但是不能告诉你generation的好坏。
比如你的decoding很糟糕,但是perplexity并不能帮你识别出来 -
word vector based metrics
计算word vector’s similarity或者average the sentence’s word vectors
不要求只有一模一样的词我们才需要,实际上,只要similarity够,我们就认为这个generation还过得去
define more focused automatic metric,让我们可以更去关注generated text的某些方面,比如:
- Fluency(use well-trained LM run through the result来得到prob)
- Correct style(用一个基于目标corpus的LM model来跑,看看结果如何)
- Diversity(rare word usage,n-grams的独特性)
- Relevance to input(semantic similarity measures)
- Length/Reptition
- Task-specific metrics(是否达成目标,比如compression rate for summarization)
- Human evaluation是否真的很好?
实际上并不如此- inconsistent,评判标准不一致,比如对任务bored了,要求会下降
- illogical
- misinterpret
- 无法解释他们想要的那种感觉(very subjective)
一些建议
- 有的时候,有specific improvement会比improve overall generation quality更manageable,更好操作
- 有contraint会比没constriant更有着手点
- improve LM’s perplexity大多数情况下会improve generation quality(尽管这不是唯一的方法)
- 需要一个automatic metric,即使它imperfect
Lesson 16
Coreference Resolution
he,she这些词,到底指代的是哪个entity(实体),找到他们间的对应关系
但是要注意有一种情况,比如A和B,那么后面会用they来同时指代他们两个
有的时候:He is the smartest kid in his class.
有的system会认为smartest kid是指代he,他们之间有link;有的则不会
- application
- Full text understanding
- Machine Translation
有的时候language会因为gender,number有不同的形式,而有的语言会丢掉pronouns(代词),这就让一些翻译变难
Alicia like Juan because he’s smart.
这句话从Spanish翻译为English时,由于不知道主语,所以这边统一用了male的他
比如一些语言,Turkish/Indonesian,这些语言不分男女 - Dialogue systems
比如有的时候说“看007”,实际上指的是看这部电影,但是两者之间并没有实际的联系,所以我们需要建立联系
- Solution
- Detect the mentions
找到那些代词(easy) - Cluster mentions
把一致的归在一起,好建立联系(hard)
Mention Detection
可以直接用NLP system直接来preprocess,找这些pronouns
- Pronouns:he/she/I
用part-of-speech tagger来找,noun,verb,adj来过滤 - Name entities:people,places
NER system来过滤 - Nouns:a dog,the fat cat stuck in the tree
用parser to find the structure of the sentence to find where the nouns
-
但还是会有一些问题
- It is sunny.
这里的it不指代任何东西 - No student
这里不指代任何人,只是想说明没有人 - The best sth in the world
这个还有争议,就是如果真的有这样的一个东西,那么这个就是指代它;但是大多数情况下,这个都是一个主观的想法,并不是真的指代那个东西
- It is sunny.
-
SolutIon:Train classifiers to pick things are mentioned and not
这个东西实在cluster里面做的,所以即使有分的不好的或者分错的,也不要紧
在2016之前,似乎得走pipeline的方式,将data分五步逐步过滤出最后的noun。
后来训练出了end-to-end coreference system
别的问题:anaphora
一些词没有independent reference,或者说光看当前句子没法知道它指代的是谁,必须要结合前文
- EG:Donald Trump said he would sign the bill.
这里的Donald Trump是antecedent,he是anaphor(首语),如果不结合前文,不好直接判断he指代的是Donald Trump还是别人
但有的就不会
- EG:No dancer twisted her knee.
这里her指的就是dancer,因为前面no dancer指代nothing
Bridging anaphora
- EG:We went to see a concert last night.The tickets are expensive
这里ticket和concert指的不是同一个东西,但是实际上指的就是同一个
Cataphora
anaphora是找代词之前的那个指代实体,但是cataphora是找代词之后的指代实体
- EG:
From the corner on which he was lying,Lord Henry Wotton could just catch … —— Oscar Wilde
这里的he,指代的就是后面的Lord Henry Wotton
这种写法在modern language中基本上不用了,但是人们训练的系统总是look backwards,没后look forward
4 Coreference Models
- Rules-based(pronominal anaphora resolution)
当你找到了一个pronoun,跑这个算法就可以find what is it coreferent with
EG:第3步from left to right traverse
第5步,找candidate
但也有问题- He poured the water from the pitcher into the cup until it was full
- He poured the water from the pitcher into the cup until it was empty
这两个句子syntactic structure一样,但是两个it指代的东西不同,所以Hobb’s algorithm也有局限性。按照算法,两个都会返回the pitcher
这种被称为Winograd Schema,可以用来测试system的能力
- Mention Pair
我们收集mention pairs,训练一个binary classifier,判别这个pair是否coreferent- 具体:我们只要through left to right,每次得到一个新mention(pronoun),然后逐一和前面的词来进行比较,判断是否coreferent
- 如果是positive examples,我们希望 p ( m i , m j ) p(m_i,m_j) p(mi,mj)接近1;如果是negative examples,我们希望 p ( m i , m j ) p(m_i,m_j) p(mi,mj)接近0
- y i j = 1 y_{ij}=1 yij=1 如果 m i m_i mi和 m j m_j mj coreferent,其余情况则为-1。N 是指在document中mention的次数
- 用cross entropy loss来优化,这里两次循环对应的是从左往右
J = − ∑ i = 2 N ∑ j = 1 i y i j log p ( m j , m i ) J=-\sum_{i=2}^N{\sum_{j=1}^i{y_{ij}\log{p(m_j,m_i)}}} J=−i=2∑Nj=1∑iyijlogp(mj,mi) - 我们会set threshold,只有大于threshold的才会建立coreference link
A coreferent B,B coreferent C,这样ABC就能归入一个cluster - 缺点:比如有个很长的句子,我们不希望找到所有的pair,我们只需要找到那些particular ones,不然有的时候全部跑完很费时间
- 具体:我们只要through left to right,每次得到一个新mention(pronoun),然后逐一和前面的词来进行比较,判断是否coreferent
- Mention Ranking
对于每个Mention(代词),我们try to find an antecedent comes before it which is coreferent with it。
然后我们会在N个decision里选一个(我们会选一个,哪怕实际上不是这个而是另有其人)
这对于开头的mention就会有影响,因为要找之前的,但它没有之前的,所以我们要 add one additional dummy mention at the front,称之为NA
然后两个办法:1.直接说没有preceding 2.说开头mention(比如I)的指代词是之前那个NA,这样我们到第二个词的时候,就会有两个选择了
最后用softmax将他们转化为probability,我们希望那个正确的有high probability
∑ j = 1 i − 1 ∏ ( y i j = 1 ) p ( m j , m i ) \sum_{j=1}^{i-1}\prod (y_{ij}=1)p(m_j,m_i) j=1∑i−1∏(yij=1)p(mj,mi)
对应的loss func,minimize the loss
J = ∑ i = 2 N − log ∑ j = 1 i − 1 ∏ ( y i j = 1 ) p ( m j , m i ) J=\sum_{i=2}^{N}{-\log{\sum_{j=1}^{i-1}\prod (y_{ij}=1)p(m_j,m_i)}} J=i=2∑N−logj=1∑i−1∏(yij=1)p(mj,mi)
- 但我们还遗留了一个问题,怎么评判是否coreferent?
- Non-neural way(classic way)
有很多features,然后有一个feature based classifier来给评分
比如name/place/place
但这个也有syntactic constraint,和之前Hobb’s算法一样,会很有一些问题
还有parallelism问题:John went with Jack to a movie.Joe went with him to a bar.
这里的Him不知道指代的是谁 - Neural Coref Model
但这个可以看到,word vectors中依然加了featuers,所以还是会有feature-based的问题 - End-to-end model
- 我们先从word vector开始,我们会把它放到character-level CNN中,将得到的和原本的concatenate在一起
- bi-LSTM across the sentence
- have a representation for span(sub-sequence),想给每个span一个representation
g i = [ x S T A R T ( i ) ∗ , x E N D ( i ) ∗ , x i ^ , ϕ ( i ) ] g_i=[x^*_{START(i)},x^*_{END(i)},\hat{x_i},\phi(i)] gi=[xSTART(i)∗,xEND(i)∗,xi^,ϕ(i)]
这里的start,end是bi-LSTM中span的start and end’s hidden states
中间的是attention-based representation of the words in the span,每个sub-sequence都会有headword,我们需要抓住headword,这里用的就是attention
最后的是additional features
我们要学的是attention weight(即对每个word要pay多少attention)
- attention score
α t = ω α ∗ F F N N α ( x t ∗ ) \alpha_t=\omega_{\alpha}*FFNN_{\alpha}(x^*_t) αt=ωα∗FFNNα(xt∗) - attention distribution(利用attention score来计算)
a i , t = exp α t ∑ k = S T A R T ( i ) E N D ( i ) exp ( α k ) a_{i,t}=\frac{\exp{\alpha_t}}{\sum_{k=START(i)}^{END(i)}\exp{(\alpha_k)}} ai,t=∑k=START(i)END(i)exp(αk)expαt
- final representation
x i ^ = ∑ t = S T A R T ( i ) E N D ( i ) a i , t x t \hat{x_i}=\sum_{t=START(i)}^{END(i)}a_{i,t}x_t xi^=t=START(i)∑END(i)ai,txt
最后的additional features decide if they are two spans coreferent
s ( i , j ) = s m ( i ) + s m ( j ) + s a ( i , j ) s(i,j)=s_m(i)+s_m(j)+s_a(i,j) s(i,j)=sm(i)+sm(j)+sa(i,j)
前面两个是i,j是一个mention吗?最后一个是do they look coreferent?
但还是有个问题,这样会使span的个数是text中words个数的square,这个使得计算特别困难
所以必须要做prune,来减少计算量,只计算a few of 有可能会被提到的spans
- Non-neural way(classic way)
- Clustering
每个mention一开始时都是一个独立的cluster,然后随着算法递进,我们将其Merge起来
最终算法在nothing to merge时,停下来
Coreference Evaluation
很多不同的方法:MUC,CEAF,LEA.B-CUBED,BLANC
- B-cubed
图一四红一白,距离一个完美的cluster达到了 4 5 \frac{4}{5} 54,即precision= 4 5 \frac{4}{5} 54
但实际上有六个红的,只有4个被找到了,所以recall= 4 6 \frac{4}{6} 64
同理,以此类推
Lesson 17 Multitask Learning(decaNLP)
start at random(don’t know where to start),但其实是有pre-trained word vector的
当训练完达成一个目标时,我们需要restart at the beginning to achieve 下一个task
NLP needs many types of reasoning:logcial,linguistic,emotional,visual
还是需要supvision的,因为没有human language judgement,computer能通过算法实现一些目标,但是绝对实现不了一些目标
如何在同一个框架内express many NLP tasks?
- Sequence tagging
name entity recognition,aspect specific sentiment - Text classification
dialogue state tracking,sentiment classification - Seq2seq
machine translation,summarization,question answering
3 equivalent supertasks of NLP
- Language Modeling(predict next word),
- Question Answering,
- Dialogue
如果你question answering想要预测下一个词回答什么,那么也是一个LM
至于Dialogue,现在没有比较好的dataset,所以Dialogue一般在做one-step dialogue,即在做named entity tag标注
所以最终,我们还是把question answering作为主要解决的问题
这边的question answering是answer is in the question的那种
- no task-specific models or parameters 因为我们认为task ID is not avaliable
- 必须能够internally适应不同任务(不是一个if-statement,根据task
不同而切换不同模型) - 必须为那些没见过的task预留zero-shot inference
- 从最左边开始
- 从pre-trained model(GloVe)里找到所需要的word vectors并fixed;
后面加上character n-gram embeddings;
然后通过Linear layer,再通过shared bi-LSTM with skip connections layer
实际上为了应对不同的问题,其实初始化的word vector会有所不同
如果没有fixed,那么随着训练只会变得适应training data, ==但test data总会出现一些unseen questions,如果为了fit training data而改变,反而会使得包容性下降.==所以fixed word vector - co-attention layer,have那两列sequences’ hidden states的outer products
这样就有了context/question dependent,即contextual representation - 两侧Transformer layers(表现还不是特别好)
- 然后会有decoder会输出,output总是个word(要么来自question,要么来自softmax,要么来自context)
所使用的数据
通常情况下我们认为训练multi-task时,当你训练完第一个task再去训练第二task时,Model会‘忘记’第一个task,但实际上并没有,就像你学一门新语言,你不会replace all your old languages
Training Strategies
-
Fully Joint
take a mini batch from the different tasks,we train on that mini batch
和mini batch差不多,只不过一次同时在几个上训练,几个有不同的task -
Anti-curriculum learning
先从最简单的开始,然后一步步变难;但这里要反过来,先训练最难的,然后一步步变简单
intuitive:训练最难的会陷入local optimal(局部最优),这个时候训练simple tasks,就可以生成很多没见过的word,让task变得更general -
close the gap:
我们训练10个separate models和1个Multi-task model,对模型进行相同调整,会发现他们的performance效果改进不一样,导致了模型结果之间出现了gap
但有的时候,大家都变好了,只不过有的变得更好;更多时候是大多数模型都变坏了 -
pre-train的优势
拿一个没有见过的dataset来训练两个model:一个是pre-trained model,一个是结构和pre-trained 完全一样但是没有train过(参数全部random initialize的)
会发现pre-trained的最后效果更好
Domain Adaptation
它算是transfer learning的一种简单形式,就有不同的distribution of words
这里是用来额Amazon和Yelp两个不同的dataset来训练的
Zero-shot Classification
The question pointer makes it possible to handle with alternations of the quetion without any additional fine-tuning
比如把labels positive变为happy(supportive),或者把negative变为sad(unsupportive),即提问的时候换个词,但方向不变
有这个想法,实际测试了也有用,但是实际上并没有相关的dataset可以用来train