Stanford NLP3

最新推荐文章于 2024-07-22 12:00:00 发布

LinMoson

最新推荐文章于 2024-07-22 12:00:00 发布

阅读量614

点赞数

分类专栏： NLP stanford网课笔记

本文链接：https://blog.csdn.net/qq_36257407/article/details/108256923

版权

笔记同时被 3 个专栏收录

15 篇文章 0 订阅

订阅专栏

stanford网课

4 篇文章 0 订阅

订阅专栏

NLP

3 篇文章 0 订阅

订阅专栏

Lesson 13

Representation for a word

早年间，supervised neural network,效果还不如一些feature classifier(SVM之类的)
后来训练unsupervised neural network,效果赶上feature classifier了，但是花费的时间很长（7weeks）
如果再加一点hand-crafted features，准确率还能进一步提升

后来，我们可以train on supervised small corpus，找到dependency parser
我们可以initialize them with random word vector，但如果我们用pre-trained data，效果会好1%（主要是好在rare words上）

对于unknown words with word vector的solution

train的时候，我们会用vocab来contain那些supervised data(比如出现次数>5次)，其余的全部认为是UNK
把所有次数<5次的words map to UNK,专门为他们train a word vector
再专门为他们跑一次word vector

但这样还是没有distinguish different UNKs
所以我们需要用char-level model
但是在Question Answering中，即test-time时如果遇到new words，有新方法
1. 可能pre-trained word embedding中contain比你实际应用中用到的更多的vocabulary，所以直接从pre-train里面直接找到这个词的word vector，并提取出来
2. 如果还是没法在pretrain中找到，那就assign一个random word vector，这样每个word都有unique identity，这样如果你在answer中遇到同一个单词，他们就能match perfectly
但还是有新问题
1. 一个词one spelling，但是有多个意义，所以我们需要多个word vectors to distinguish their meanings；比如一个word vector是Mixture，但是我们model会把它separate
2. 有一些词他们semantic意义相同，但是它们用在不同的地方，比如有的是noun，有的是verb；词还有connotation(内涵)difference
想法
传统来说，我们都是利input word in LSTM models，然后用这个hidden state 来predict next words.
那么每个hidden state就是所谓的context-specific word representation(上下文指定的hidden state)，那么我们能否用这个representation来做更多别的事？

NER

为每一个词找一个token，比如person/location/date，算是另一种程度上的features
在这里插入图片描述

TagLM - Pre-ELMo

一般是用CoNLL 2003 dataset

IDEA:想要学word meaning in context,但only learn task RNN on small task-labeled data
用semi-supervised approach(给了他们better features)

利用许多unlabeled data,可以train a conventional word embedding model(Word2Vec)用来predict next word，simultaneously，we can train a recurrent language model（bi-LSTM model）
利用trained recurrent language model，把word输入，generate hidden states in bi-LSTM language model(也是一个word representation)
把两个word representation一起用到sequence tagging model里

细节：我们利用unlabeled data train bi-LSTM model，然后我们想do name entity recognition for ‘New York’ is located
1. 右侧，是一个predict next word的LM。
  run ‘New York’ is located through pre-trained language model(这个pre-train之后所有参数都frozen)，一个forward,一个backward,然后concate得到一个word representation。
  每个Position，它会给你一个contextual word representation（即concatenated vectors作为features用在named entity recognizer里）
  没有gradient flow back，这样这个LM就不会更新
2. 左边，对于name entity recognizer,我们用supervised data（same sentence）来进行训练。通过word2vec得到其word embedding，通过character level CNN/RNN得到character-level embedding，然后把两个concate
3. 再把第2步里得到的feed into bi-LSTM，得到一个output
4. 再把第1步和第3步的output concatenate起来，这样就变成了pair state
5. 第4步得到的东西feed into second layer bi-LSTM model中，对于Output，做softmax classification，然后就能给出tags(比如beginning/end of the location)
关于language model的细节
1. bidirection比unidirection重要
2. LM trained on supervised data doesn’t help
3. 一定要用much more data来train，不然效果不好
4. 如果只用LM embedding来Predict,效果并不好

改进版：ELMo

breakout of word token vectors or contextual word vectors
在这里插入图片描述

Difference
1. do bidirectional language model a little differently。一个是要compact LM easy for people to use（所以可能会用character representation）
2. TagLM中，从Pre-train LM中，feed进主model的，是top level of the LM
  但在ELMo中，use all layers of the bi-LSTM model，给每层hidden level(layer) learn 一个weight（后面会提），还learn了一个global scaling factor $\gamma$

weight of layers

这里没有很深，只有two layers of bi-LSTM
但Lower layer is better for lower-level syntax(处理比较简单的)，比如part-of-speech tagging
higher layer is better for higher-level semantics(处理比较难的)，比如question answering

ULMfit

Universal Language Model Fine-tuning for Text Classification(transfer learning)
把Language model变为目标的name entity recognition

big unsupervised corpus to train a neural language model(3 hidden layer)
fine tune the model on我们真正需要工作的domain
最后把text classification导入

他们没直接feed features into a different network，而是一直use the same network但是在最后设立了一个不同的目标。
比如我们可以用这个network来predict next word,但此时我们freeze the parameters of the network,也就是freeze了softmax的参数
但是这个network还是可以用来predict别的，比如predict pos/neg sentiment in the text classification

特性
如果你能train it on a big amount of data,即使你后面transfer learning到别的task，并且只train on pretty little data，最后的表现也能很好

Transformers（参考NLP4最后的笔记）

motivation
我们想要build model faster - 我们希望可以parallelization，传统的都是RNN，得等一个state完了才能下一个，所以特别慢。
尽管GRU和LSTM又很好地效果，但是它们处理long sentence还是有问题，所以我们需要attention机制。
或许，我们可以只用attention机制，而不用RNN的部分？
attention is all you need!!!

在这里插入图片描述

Dot-Product Attention

新的attention layer

input: a query q and a set of key-value(k-v) pairs
output:weighted sum of values
$A(q,K,V)=\sum_i\frac{e^{q*k_i}}{\sum_je^{q*k_j}}v_i$
计算key和query的similarity，所以后面需要用vector for the corresponding value
所以这里在use the softmax over query,key来给出attention weight based on the corresponding values
queries and keys have same dimensionality $d_k$ ,value have $d_v$

scaled Dot-product attention

在这里插入图片描述
In the encoder,everything is word vectors(queries,keys,values)

Multi-head attention

第二个New idea:attention的想法很好，但是只有一个attention distribution也许很不好，也许一个Postion上需要注意多个地方在这里插入图片描述

Complete transformer block

在这里插入图片描述

把两个attention distribution的结果求和
然后normalize他们（不做batch normalization，而是layer normalization）

有的是时候为了pass info through the chain，我们需要每走一步就深一层，这样会让layer很多，但是GPU可以并行运算。总之，no free lunch

Encoder input

用了byte pair encoding,当然，直接输入word vector也是可以的在这里插入图片描述

Complete Encoder
每个block,我们use the same Q,K,V from the previous layer
block会repeat 6 times

Transformer Decoder

在这里插入图片描述

BERT（Bidirectional Encoder Representations from Transformers）

Pre-training of Deep Bidirectional Transformers for Language Understanding.

核心思想
Use encoder from transformer network to calculate a representation of a sentence.生成的这个representation可以用来做name entity recognition/sentimental analysis，多种用途
原因
LM（文本预测）一般只用left/right context,但是文本理解(language understanding) is bidirectional
1. Why are LMs unidirectional?
  - Directionality is needed to generate a well-formed probability distribution
  - Words can see themselves in a bidirectional encoder(bi-LSTM会把左右两个结合起来，那么输入下一层的时候，相当于能看到future word,失去了prediction的意义了)
Solution
Mask out some words in the sentence,然后预测这部分被遮住的words
一般k%被遮住，这个k=15%。遮的太少,too expensive to train;遮的太多，not enough context
- EG:The man went to the [MASK] to buy a [MASK] of milk

我们的LM也不再是language model，而是变成generate a probability of a sentence；我们用cross entropy loss to guess 两个MASK，并且我们希望猜到store/gallon这两个词
在这里插入图片描述
这边是using the entire context(both sides)来预测blank

Another Object
除了预测被mask的单词，它还有一个功能：learn the relationships between sentences,predicet whether Sentence B is actual sentence that proceeds Sentence A,or just a random sentence
即判断句子A和句子B的关系
三种关系：follow,contradict,has no relation to premise(与前文无关)

BERT具体算法

Input sentence are represented as word pieces(token embedding+position embeddings+segment embedding)
train to predict masked words + predict sentence relationship
得到的pre-trained model for various tasks，keep on using this model(不再加别的layer)，fine tune it for a particular task
具体做法是：remove the very top level prediction(即把最后predict blank/relationship那层)，substitute with a final prediction layer for particular task

备注：BERT train on GLUE dataset

Lesson 14

Learning representations of variable length data

Q：长度变化的数据，如何转化为word representation
A：RNN 天然适合于面对这种情况，GRU/LSTM目前占主导地位
但是RNN不能parallel处理，速度慢，而且我们想要Hierarchy models
CNN虽然可以并行处理，但是需要我们砍掉一部分来协调长度；a convolution可以完全使用all the info inside its local receptive field

self-attention

Attention机制中，encoder决定what info to absorb based on每个position上content similarity
那我们为什么不直接在representation中用它？

self-attention
我们想要re-represent the word representation(重构its new representation)
具体措施：
compare with自己以及周围所有的Neighborhood word，然后produce a weighted combination of entire neighborhood，然后基于weight,sum all the infomation(这样就将自己重构了)
我们也可以再add feed-forward layers to computer new features

Transformer

比如要做English-German Translation

Encoder part
Attention is permutation invariant(和位置变动无关),所以我们可以change the order of our position,并且不会影响结果
但是为了记录position，we need to add position representation
self-attention layer可以re-compute representation for each postion simultaneously
然后还有feed-forward layer和residual connections between layers.
对于input，还有skip connection just add to activation
Decoder part
用self-attention模拟了一个language model(这样就能mask word了)

具体layer分析

Encoder self-attention
比如我们想re-represent $e_2$

linearly transform it into a query,
然后linearly transform every position in the neighborhood to key
这些linearly transformation可以被视为features
我们最后用softmax来把它表示出来
我们在对其做一次linear transformation to mix the information and pass it through a feed-forward layer

Decoder part

attention机制好用在：dimension远远>lengh的时候
attention的另一种理解
当我们input一个句子：I kicked the ball
1. CNN
  Kick可以从I，ball这里学到主语，学到宾语（只要在receptive field内）

在这里插入图片描述
2. self-Attention
所有Input一视同仁(average)，不能从不同地方获取different info
3. one-attention layer:for who
这样就可以注意到主语
4. multihead attention
我们可以pay attention to 多个，重点：他们可以并行

当然，每个attention都带了一个softmax函数，虽然几乎其余几项都是0，只要attention的那项为1

self-attention在image上应用

model of the joint distribution of pixedls
变成sequence modeling problem
吧probabilities assign下去，这样就能allow meansuring generalization

如果想要得到image各部分之间的dependency,需要一个large receptive filed,这样用attention，就可以get result at a low computational cost
避免了传统cnn会把far away pixels算进来

text synthesis with self similarity
实际上是在做image generation
classic work called non-local means是用来做denoising的
基于image上的其他一些patches，compute content-based similarity，然后Pull information

compute content-based similarity between elements,然后基于content-based similarity,construct a convex combination that brings these things together
在这里插入图片描述

Image Transformer

在这里插入图片描述

Music generation with relative self-attention

music也有raw representation,也是sequential,也有start/stop token
在这里插入图片描述

几种不同的转化（trained on half length）

self-similarity in music

每个点都是weighted average of the past,但好处是无论多远，我们都能access to it

不同的方式
1. CNN：看周围几个点
2. MultIhead attention+Conv
  可以access the history directly,also know we relate to them
  
  music translation可以生成beyond the length that it was trained on(超过它所训练的长度)
relative attention
1. 我们不仅考虑东西有多远（左上距离右下的距离这种,(2,2)值得就是横向距离2，纵向距离2）
2. 还会考虑我们对这些东西有多care about（weight）
  
  距离：比如language translation
  以前都是生成3D tensor，计算relative distance，然后找一个embedding that corresponds to that distance
  
  这种时候，可以节约很多memory
Relative attention和Fully Convolution差不多
Fully Conv会把所有的feature加起来，这样不管你的feature在图像上哪里，都能被选出来
而relative也一样，它不管一个feature在总的image上的位置，但是feature内部的点的position是一致的

Lesson 15

beam search中k的影响

k太小
构成的句子不成文(ungrammatical),不正确(incorrect)
k太大
1. computationally expensive
2. 可以缓解一些不成文的问题
3. 在NMT中，k变大会降低BLEU score(有些反直觉，但是最合理的解释是optimal in sequence prob和high BLEU score两个不能双全)
4. 在对话系统中，会become more generic
  即k比较小时，聊的东西更扣题，但make no sense；k比较大时，会选择比较’correct’的答案，单核说东西无关（即generic）

Sampling-based decoding

1.Greedy decoding

2.Beam search

3-1.Pure sampling

在每个step t，randomly sample from the prob distribution p来获得下一个单词（beam search是只找前面k个）

3-2.Top-n sampling

在每个step t，只在prob distribution p前n个概率最大的词里，randomly sample下一个单词
和pure sampling类似，只不过truncate了prob distribution
n=1是greedy search；n=V(单词总量)是pure sampling

增大n可以获得更diverse/risky output
减小n可以获得更safe/generic output

小技巧:Softmax temperature

传统softmax把vector转化为probability，这里加一个temperature hyperparameter
$P_t(w)=\frac{\exp{(\frac{s_w}{\tau}})}{\sum_{w'\in W}{\exp{(\frac{s_{w'}}{\tau}})}}$

当 $\tau$ 增大时， $P_t$ 变得更uniform（raise temperature）
会有更diverse output(概率被分散了)
当 $\tau$ 增大时， $P_t$ 变得更spiky（lower temperature）
less diverse output(概率被集中了，集中到山峰上)

NLP tasks and neural approcaches

Summarization

定义：input text x，output summary y,y比x短还能包括x的main info
1. single-document
  只对一篇single document x写
2. multi-document
  对多篇document $x_1,x_2,...$ 来写，这些文章中会有Overlap，比如对一件事的多个news comment
dataset
1. Gigaword
  用文章的前两句话来生成headlines(sentence compression)
2. LCSTS
  microblog（中文微博数据）
3. NYT,CNN/Daily mail
  全篇news article，对应结果是sentences summary
4. Wikihow
  full how-to article到summary sentences
Sentence simplication和summary有点类似，但是并不相同
把复杂的句子rewrite为简单的句子
dataset
1. Simple Wikipedia
  标准Wikipedia变成简单version
2. Newsla
  news article变成儿童可以看懂的版本

两个strategies

Extractive summarization
Select parts，从原文中选出一些标志性的句子来生成summary
- easier
- restrictive(限制很大)
Abstractive summarization
利用natural language generation来生成新text
- more difficult
- more flexible

Extractive summarization

Pre-neural summarization(有个pipeline)
1. Content selection
  从source article选哪些句子要用
2. Information ordering
  给这些句子排个序，谁先谁后
3. Sentence realization
  编辑sequence of sentences（简化，remove parts,fix continuity issue）

content selection中

需要sentence scoring funcs来进行选择
1. topic keywords的出现（比如tf-idf）
2. 比如句子出现的位置（开头/结尾）
Graph-based algorithms则会把article看成a set of sentences，句子之间有edge(连接)
重要的是找到edge weight，这里的edge weight可以理解为sentence similarity
这种算法可以identify which sentences are the central of the article

最后自然需要有一个函数来进行评估、优化

ROGUE

Recall-Oriented Understudy for Gisting Evalution
$=\frac{\sum_{S \in{Reference \, Summaries}}\sum_{gram_n \in S}Count_{match}(gram_n)}{\sum_{S \in{Reference \, Summaries}}\sum_{gram_n \in S}Count(gram_n)}$

和BLEU算法很像，based on n-gram overlap，但还是有区别

ROUGE 没有brevity penalty
ROUGE based on recall,BLEU based on precision
precision is more important for machine translation，因为你要加penalty防止过短
recall is more important for summarization，因为我们想要contain all the important info，这里recall代表了你contain它们的程度
但实际中，F1这个包括了pre和recall的标准，总是出现在summarization文献中
BLEU只返回最后的single number，但ROUGE返回a combination of precisions for n=1,2,3,4 n-grams(即对每个n-gram，它都会返回一个score)
最常见的ROUGUE scores
1. ROUGE-1:unigram overlap
2. ROUGE-2:bigram overlap
3. ROUGE-L:longest common subsequece overlap
  我们不再只关注有多少词Overlap，而在乎最长subsequence如何

summarization history

2015早期:
abstractive summarization
seq2seq+attention NMT
2015后期
make it easier to copy(但还是要防止copy过多)
hierarchical/mutli-level attention
more global/high-level content selection
利用Reinforment Learning to maximize ROUGE或者别的discrete goals(e.g. length)

copy mechanisms（copy机制）

seq2seq+attention擅长写fluent output，但是不擅长于copy details(比如rare words)
让copy/generate 形成一个hybrid extractive/abstracive approach
自然，这就有很多变种（variants）

计算一个公式，这个公式是generate a new word而不是copy word的概率
这个based on current context,current hidden state
$P_{gen}=p_{gen}P_{vocab}(w)+(1-p_{gen})\sum_{i:\omega_i=\omega}a_i^t$
然后我们还可以继续使用这个 $P_{gen}$ ，用它乘以下一个单词的prob distribution list，就能算出具体某一个单词出现的概率

Question:
1. copy太多了
  有的时候copy long phrases，甚至整句，我们不希望abstractive变成extractive
2. 在处理长文本时表现差
  比如输入的document很长，Overall content selection表现很差
3. 自然，也没有Overall stragegy for selecting content

better content selection

separate stages content selection and surface realization(text generation)
在seq2seq+attention，这两者混合了

在decoder的每一步（surface realization），我们做了word-level content selection(attention)
这就导致了没有global content selection strategy
Solution 1:Bottom-up summarization
分成两步，非常简单
1. Content selection stage
  用neural sequence-tagging model来给每个词加上tag(是否要在summarization中include)
2. Bottem-up attention stage
  seq2seq+attention apply a mask，让那些don’t include的单词不能出现在最后Output内
这样在select的时候就避免了generate，而且这样选词，让句子比较破碎，避免了copy整段整段的句子
Solution 2: Reinforcement Learning直接optimize ROUGE-L
单纯用RL，可以获得higher ROUGE score,却只能获得lower human judgement score
但如果我们把两者结合，就可以比较好的结果

Dialogue

对话和别的很不同

Task-oriented dialogue
1. Assistive operation，客服系统
2. Co-operative（合作解决task）
3. Adversarial(对抗型，比如辩论)
Social dialogue
1. Chit-chat(单纯对话，for fun)
2. Therapy(mental wellbeing)

因为很难，所以一般不会自由组织neural network，而是会用pre-define templates/从一大堆responses中，retrieve an appropriate response

seq2seq+attention引起了大家的注意，但是还是有很多serious pervasive deficiencies(缺点) for chit-chat
1. genericeness/boring response
2. 说一些Irrelevant response
3. repetition(复读机)
4. Lack of context(记不住之前的聊天内容)
5. Lack of consistent persona(前后不一致，仿佛两个人在说话)

一些解决办法

irrelevant response
Optimize for Maximum Mutual Information(MMI) between input S and response T
$\log{\frac{p(S,T)}{p(S)p(T)}}$
$\hat{T}=argmax_{T}\{\log{p(T|S)-\log{p(T)\}}}$
要在正确输入T的情况下才能得到output S，但是这个input T也是有个prob distribution限制的，如果prob太高，也是会有Penalty
genericness/boring response
- 比较早的时候就介入进行调整
1. directly upweight rare words during beam search(直接提升一些rare word的weight)
2. 用sampling decoding algorithm而不是beam search
- Condition fixes
1. Condition the decoder on some additional content，比如从related words中进行sample
2. 训练一个retrieve-and-refine model，而不是一个generate from scratch model
  把一些training set edit一下以变得适合current scenario
repetition response
- 简单方法
  直接block repeating n-grams during beam search
- complex solution
1. train a coverage mechanism in seq2seq,防止attention机制对于同一个词关注multiple times
2. 定义一个training object to discourage repetition
  这里可能会需要RL来train

Storytelling

根据一张图/a brief prompt(提示)来写story，或者直接续写故事
在COCO dataset上训练而得，将image转变为sentence encoding；然后再训练另一个language Model，把sentence encoding转变为某种风格的output

Common sentence-encoding space
利用了Skip-though vectors来实现
原理：通过predict words around it to learn word word

但现在还是有问题，就是会有很多环境描写，但是没有实际的剧情推进

NLG evaluation

BLEU,ROUGE,METEOR,F1这些来评判翻译结果
但是not ideal for machine translation,对于summarization的表现更糟，对于dialogue的还要再糟

perplexity
只能告诉你how powerful your LM，但是不能告诉你generation的好坏。
比如你的decoding很糟糕，但是perplexity并不能帮你识别出来
word vector based metrics
计算word vector’s similarity或者average the sentence’s word vectors
不要求只有一模一样的词我们才需要，实际上，只要similarity够，我们就认为这个generation还过得去

define more focused automatic metric，让我们可以更去关注generated text的某些方面，比如：

Fluency（use well-trained LM run through the result来得到prob）
Correct style(用一个基于目标corpus的LM model来跑，看看结果如何)
Diversity（rare word usage，n-grams的独特性）
Relevance to input（semantic similarity measures）
Length/Reptition
Task-specific metrics(是否达成目标，比如compression rate for summarization)

Human evaluation是否真的很好？
实际上并不如此
1. inconsistent，评判标准不一致，比如对任务bored了，要求会下降
2. illogical
3. misinterpret
4. 无法解释他们想要的那种感觉（very subjective）

一些建议

有的时候,有specific improvement会比improve overall generation quality更manageable，更好操作
有contraint会比没constriant更有着手点
improve LM’s perplexity大多数情况下会improve generation quality(尽管这不是唯一的方法)
需要一个automatic metric,即使它imperfect

Lesson 16

Coreference Resolution

he,she这些词，到底指代的是哪个entity(实体)，找到他们间的对应关系
在这里插入图片描述
但是要注意有一种情况，比如A和B，那么后面会用they来同时指代他们两个

有的时候：He is the smartest kid in his class.
有的system会认为smartest kid是指代he，他们之间有link；有的则不会

application

Full text understanding
Machine Translation
有的时候language会因为gender,number有不同的形式，而有的语言会丢掉pronouns(代词)，这就让一些翻译变难
Alicia like Juan because he’s smart.
这句话从Spanish翻译为English时，由于不知道主语，所以这边统一用了male的他
比如一些语言，Turkish/Indonesian,这些语言不分男女
Dialogue systems
比如有的时候说“看007”，实际上指的是看这部电影，但是两者之间并没有实际的联系，所以我们需要建立联系

Solution

Detect the mentions
找到那些代词（easy）
Cluster mentions
把一致的归在一起，好建立联系（hard）

Mention Detection

可以直接用NLP system直接来preprocess,找这些pronouns

Pronouns:he/she/I
用part-of-speech tagger来找，noun,verb,adj来过滤
Name entities:people,places
NER system来过滤
Nouns:a dog,the fat cat stuck in the tree
用parser to find the structure of the sentence to find where the nouns

但还是会有一些问题
- It is sunny.
  这里的it不指代任何东西
- No student
  这里不指代任何人，只是想说明没有人
- The best sth in the world
  这个还有争议，就是如果真的有这样的一个东西，那么这个就是指代它；但是大多数情况下，这个都是一个主观的想法，并不是真的指代那个东西
SolutIon:Train classifiers to pick things are mentioned and not
这个东西实在cluster里面做的，所以即使有分的不好的或者分错的，也不要紧

在2016之前，似乎得走pipeline的方式，将data分五步逐步过滤出最后的noun。
后来训练出了end-to-end coreference system

别的问题：anaphora

一些词没有independent reference，或者说光看当前句子没法知道它指代的是谁，必须要结合前文

EG：Donald Trump said he would sign the bill.
这里的Donald Trump是antecedent,he是anaphor(首语)，如果不结合前文，不好直接判断he指代的是Donald Trump还是别人

但有的就不会

EG：No dancer twisted her knee.
这里her指的就是dancer，因为前面no dancer指代nothing

Bridging anaphora

EG：We went to see a concert last night.The tickets are expensive
这里ticket和concert指的不是同一个东西，但是实际上指的就是同一个

Cataphora

anaphora是找代词之前的那个指代实体，但是cataphora是找代词之后的指代实体

EG：
From the corner on which he was lying,Lord Henry Wotton could just catch … —— Oscar Wilde
这里的he，指代的就是后面的Lord Henry Wotton
这种写法在modern language中基本上不用了，但是人们训练的系统总是look backwards,没后look forward

4 Coreference Models

Rules-based(pronominal anaphora resolution)
当你找到了一个pronoun,跑这个算法就可以find what is it coreferent with

EG:第3步from left to right traverse
第5步，找candidate

但也有问题
1. He poured the water from the pitcher into the cup until it was full
2. He poured the water from the pitcher into the cup until it was empty
  这两个句子syntactic structure一样，但是两个it指代的东西不同，所以Hobb’s algorithm也有局限性。按照算法，两个都会返回the pitcher
  这种被称为Winograd Schema,可以用来测试system的能力
Mention Pair
我们收集mention pairs，训练一个binary classifier，判别这个pair是否coreferent
1. 具体：我们只要through left to right，每次得到一个新mention(pronoun)，然后逐一和前面的词来进行比较，判断是否coreferent
2. 如果是positive examples，我们希望 $p(m_i,m_j)$ 接近1；如果是negative examples，我们希望 $p(m_i,m_j)$ 接近0
3. $y_{ij}=1$ 如果 $m_i$ 和 $m_j$ coreferent，其余情况则为-1。N 是指在document中mention的次数
4. 用cross entropy loss来优化，这里两次循环对应的是从左往右
  $J=-\sum_{i=2}^N{\sum_{j=1}^i{y_{ij}\log{p(m_j,m_i)}}}$
5. 我们会set threshold，只有大于threshold的才会建立coreference link
  A coreferent B,B coreferent C，这样ABC就能归入一个cluster
6. 缺点：比如有个很长的句子，我们不希望找到所有的pair，我们只需要找到那些particular ones，不然有的时候全部跑完很费时间
Mention Ranking
对于每个Mention(代词)，我们try to find an antecedent comes before it which is coreferent with it。
然后我们会在N个decision里选一个（我们会选一个，哪怕实际上不是这个而是另有其人）
这对于开头的mention就会有影响，因为要找之前的，但它没有之前的，所以我们要 add one additional dummy mention at the front，称之为NA

然后两个办法：1.直接说没有preceding 2.说开头mention(比如I)的指代词是之前那个NA，这样我们到第二个词的时候，就会有两个选择了
最后用softmax将他们转化为probability，我们希望那个正确的有high probability
$\sum_{j=1}^{i-1}\prod (y_{ij}=1)p(m_j,m_i)$
对应的loss func,minimize the loss
$J=\sum_{i=2}^{N}{-\log{\sum_{j=1}^{i-1}\prod (y_{ij}=1)p(m_j,m_i)}}$

但我们还遗留了一个问题，怎么评判是否coreferent？
1. Non-neural way(classic way)
  有很多features，然后有一个feature based classifier来给评分
  比如name/place/place
  但这个也有syntactic constraint，和之前Hobb’s算法一样，会很有一些问题
  还有parallelism问题：John went with Jack to a movie.Joe went with him to a bar.
  这里的Him不知道指代的是谁
2. Neural Coref Model
  但这个可以看到，word vectors中依然加了featuers，所以还是会有feature-based的问题
3. End-to-end model
  1. 我们先从word vector开始，我们会把它放到character-level CNN中，将得到的和原本的concatenate在一起
  2. bi-LSTM across the sentence
  3. have a representation for span(sub-sequence),想给每个span一个representation
    
    $g_i=[x^*_{START(i)},x^*_{END(i)},\hat{x_i},\phi(i)]$
    这里的start,end是bi-LSTM中span的start and end’s hidden states
    中间的是attention-based representation of the words in the span，每个sub-sequence都会有headword，我们需要抓住headword，这里用的就是attention
    最后的是additional features
    我们要学的是attention weight（即对每个word要pay多少attention）
  - attention score
    $\alpha_t=\omega_{\alpha}*FFNN_{\alpha}(x^*_t)$
  - attention distribution(利用attention score来计算)
    $a_{i,t}=\frac{\exp{\alpha_t}}{\sum_{k=START(i)}^{END(i)}\exp{(\alpha_k)}}$
  - final representation
    $\hat{x_i}=\sum_{t=START(i)}^{END(i)}a_{i,t}x_t$
    最后的additional features decide if they are two spans coreferent
    $s(i,j)=s_m(i)+s_m(j)+s_a(i,j)$
    前面两个是i,j是一个mention吗？最后一个是do they look coreferent?
    但还是有个问题，这样会使span的个数是text中words个数的square,这个使得计算特别困难
    所以必须要做prune，来减少计算量，只计算a few of 有可能会被提到的spans

Clustering
每个mention一开始时都是一个独立的cluster，然后随着算法递进，我们将其Merge起来
最终算法在nothing to merge时，停下来

Coreference Evaluation

很多不同的方法：MUC,CEAF,LEA.B-CUBED,BLANC

B-cubed
图一四红一白，距离一个完美的cluster达到了 $\frac{4}{5}$ ，即precision= $\frac{4}{5}$
但实际上有六个红的，只有4个被找到了，所以recall= $\frac{4}{6}$
同理，以此类推

Lesson 17 Multitask Learning(decaNLP)

start at random（don’t know where to start），但其实是有pre-trained word vector的
当训练完达成一个目标时，我们需要restart at the beginning to achieve 下一个task

NLP needs many types of reasoning:logcial,linguistic,emotional,visual
还是需要supvision的，因为没有human language judgement，computer能通过算法实现一些目标，但是绝对实现不了一些目标

如何在同一个框架内express many NLP tasks?

Sequence tagging
name entity recognition,aspect specific sentiment
Text classification
dialogue state tracking,sentiment classification
Seq2seq
machine translation,summarization,question answering

3 equivalent supertasks of NLP

Language Modeling(predict next word),
Question Answering,
Dialogue

如果你question answering想要预测下一个词回答什么，那么也是一个LM
至于Dialogue，现在没有比较好的dataset，所以Dialogue一般在做one-step dialogue，即在做named entity tag标注
所以最终，我们还是把question answering作为主要解决的问题

这边的question answering是answer is in the question的那种

no task-specific models or parameters 因为我们认为task ID is not avaliable
必须能够internally适应不同任务(不是一个if-statement，根据task
不同而切换不同模型)
必须为那些没见过的task预留zero-shot inference

在这里插入图片描述

从最左边开始

从pre-trained model(GloVe)里找到所需要的word vectors并fixed;
后面加上character n-gram embeddings；
然后通过Linear layer，再通过shared bi-LSTM with skip connections layer
实际上为了应对不同的问题，其实初始化的word vector会有所不同
如果没有fixed，那么随着训练只会变得适应training data, ==但test data总会出现一些unseen questions，如果为了fit training data而改变，反而会使得包容性下降.==所以fixed word vector
co-attention layer，have那两列sequences’ hidden states的outer products
这样就有了context/question dependent，即contextual representation
两侧Transformer layers(表现还不是特别好)
然后会有decoder会输出，output总是个word(要么来自question，要么来自softmax,要么来自context)

所使用的数据
在这里插入图片描述
通常情况下我们认为训练multi-task时，当你训练完第一个task再去训练第二task时，Model会‘忘记’第一个task，但实际上并没有，就像你学一门新语言，你不会replace all your old languages

Training Strategies

Fully Joint
take a mini batch from the different tasks,we train on that mini batch
和mini batch差不多,只不过一次同时在几个上训练，几个有不同的task
Anti-curriculum learning
先从最简单的开始，然后一步步变难；但这里要反过来，先训练最难的，然后一步步变简单
intuitive：训练最难的会陷入local optimal（局部最优），这个时候训练simple tasks，就可以生成很多没见过的word，让task变得更general
close the gap:
我们训练10个separate models和1个Multi-task model，对模型进行相同调整，会发现他们的performance效果改进不一样，导致了模型结果之间出现了gap
但有的时候，大家都变好了，只不过有的变得更好；更多时候是大多数模型都变坏了
pre-train的优势
拿一个没有见过的dataset来训练两个model:一个是pre-trained model，一个是结构和pre-trained 完全一样但是没有train过（参数全部random initialize的）
会发现pre-trained的最后效果更好

Domain Adaptation

它算是transfer learning的一种简单形式，就有不同的distribution of words
这里是用来额Amazon和Yelp两个不同的dataset来训练的

Zero-shot Classification

The question pointer makes it possible to handle with alternations of the quetion without any additional fine-tuning
比如把labels positive变为happy(supportive),或者把negative变为sad(unsupportive)，即提问的时候换个词，但方向不变

有这个想法，实际测试了也有用，但是实际上并没有相关的dataset可以用来train