CS224N NLP

附大佬的笔记:
github.com/LooperXX/LooperXX.github.io.git

文章目录

Abbreviation

--
[ToL]To learn
[ToLM]To learn more
[ToLO]To learn optionally
(0501)05 min 01s
(h0501)1 hour 05 min 01s
(hh0501)2 hour 05 min 01s

Lecture 1 - Introduction and Word Vectors

image-20220214151948950

NLP

Convert one-hot encoding to distributed representitions

Ont hot can’t represent the relation between word vectors,it is too big

Word2vec

Ignore the position of word of context

image-20220214135823259 image-20220214135951707 image-20220214140036077

Use two vector in one word: centor word context word.

softmax function

image-20220214140209594

image-20220214141232602

Train the model: gradient descent

image-20220214141455212

There is a term to calculate the gradient descent. (39:50-56:40)

result is :image-20220214143920015

ToL

Review derivation and the following especially.

image-20220214142712551

Show some achievement with code(5640-h0516)

  • We can do vector addition, subtraction, multiplication and division, etc.

QA

Why are there center word and context word(h0650)

To avoid one vector dot product himself in some situation???

Even synonyms can be merged into a vector(h1215)

Which is different from lee ,He says synonyms use different.

Lecture 2 Word Vectors,Word Senses,and Neural Classifiers

image-20220214152314870

image-20220214152611205

Bag models (0245)

The model makes the same predictions at each position.

Gradient descent (0600)

Not usually use because of the big calculation.

step size: not too big nor too small

image-20220214153736035

stochastic gradient descent SGD TOBELM (0920)

Take part of the corpus

billion faster.

Maybe even get better result.

But it is stochastic, either you need sparse matrix update operations to only update certain rows of full embedding matrices U and V, or you need to keep around a hash for vectors.(1344)ToL

more details of word2vec(1400)

image-20220214160400315

SG use center to predict context

SGNS negative sampling [ToBLO]

use logistic function instead of softmax and take sampling of corpus

CBOW opposite.

image-20220214162201460

Why use two vectors(1500)

Sometime it will dot product with itself.

image-20220214165957190

[ToL]

The first one is positive word and the last is negative word (2800)

negative word is being sampled cause the center word will turn up on other occasions, when it does, there will have other sampling, and it will learn step by step.

Why not capture co-occurrence counts directly?(2337)

image-20220214171624671

SVD(3230) [ToL]

https://zhuanlan.zhihu.com/p/29846048

use svd to get lower dimensional representations for words

image-20220214172338354(3451)

Count based vs direct prediction

image-20220214173136681(3900)

Encoing meaning components in vector differences(3948)

This is to make addition subtraction available for word vectors.

image-20220214173907221

GloVe (4313)

image-20220214174416350

let dot product minus log of the co-occurrence

How to evaluate word vectors Intrinsic vs. extrinsic(4756)

image-20220214175746085

Analogy evaluation and hyperparameters (intrinsic)(5515)

Word vector distances and their correlation with human judgements(5640)

Data shows that 300 dimensional word vector is good(5536)

The objective function for the GloVe model and What log-bilinear means(5739)

Word senses and word sense ambiguity(h0353)

One word different mean different vector.

then a word can be the sum of them all

image-20220214184234513

It will work good but not bad (h1200)

the vector is so sparse that you can separate out different senses (h1402)

Lecture 3 Gradients by hand(matric calculus) and algorithmically(the backpropagation algorithm) all the math details of doing nerual net learning

image-20220214191638029

Need to be learn again, it is not totally understanded.

Named Entity Recognition(0530)

image-20220214185926393

Simple NER (0636)

image-20220214190032048

How the sample model run (0836)

image-20220214190306082

update equation(1220)

image-20220214191531863

jacobian(1811)

image-20220214192319871

Chain Rule(2015)

image-20220214192526698

image-20220214193151609

do one example step (2650)

image-20220214193417520

hadamard product ToL

Reusing Computation(3402)

image-20220215112833279

ds/dw

image-20220215113433454 image-20220215113255573

Forward and backward propagation(5000)

image-20220215115109857 image-20220215115507912

An example(5507)

a = x+y

b = max(y,z)

f = ab

image-20220215120119537

Compute all gradients at once (h0005)

image-20220215145351805

Back-prop in general computation graph(h0800)[ToL]

image-20220215145612746

Automatic Differentiation(h1346)

Many tools can calculate automaticly.image-20220215151328471

Manual Gradient checking : Numeric Gradient(h1900)

image-20220215152039987

Lecture 4 Dependency Parsing

image-20220215152912089

Two views of linguistic structure

Constituency = phrase structure grammar = context-free grammars(CFGs)(0331)

Phrase structure organizes words into nested constituents

image-20220215155446438

Dependency structure(1449)

Dependency structure shows which words depend on (modify, attach to,or are arguments of)

image-20220215155924838

Why do we need sentence structure?(2205)

Can not express meaning by just one word.

image-20220215160252254

Prepositional phrase attachment ambiguity.(2422)

There is some sentence to show it:

San Jose cops kill man with knife

Scientists count whales from space

The board approved [its acquisition] [by Royal Trustco Ltd.] [of Toronto] [for $27 a share] [at its monthly meeting].

Coordination scope ambiguity(3614)

**Shuttle veteran and longtime NASA executive Fred Gregory appointed to board **

Doctor: No heart, cognitive issues

Adjectival/Adverbial Modifier Ambiguity(3755)

Students get [first hand job] experience Students get first [hand job] experience

Verb Phrase(VP) attachment ambiguity(4404)

Mutilated body washes up on Rio beach to be used for Olympics beach volleyball.

image-20220215163226892

Dependency Grammar and Dependency structure(4355)

image-20220215163439157

Will add a fake ROOT for handy

Dependency Grammar history(4742)

image-20220215163821573

The rise of annotated data Universal Dependency tree(5100)

image-20220215164213166

Tree bank(5400)

Its too slow to write a grammar by hand but its still worth,cause it can used in another place but not only nlp .

how to build parser with dependency(5738)

image-20220215165030760

Dependency Parsing

image-20220215165444250

Projectivity(h0416)

image-20220215165801145

Methods of Dependency Parsing(h0521)

image-20220215170003800

Greedy transition-based parsing(h0621)

Basic transition-based dependency parser (h0808)

image-20220215170303720

[root] I ate fish

[root I ate] fish

[root ate] fish

[root ate fish]

[root ate]

[root]

MaltParser(h1351)[ToL]

image-20220215171511327

Evaluation of Dependency Parsing (h1845)[ToL]

image-20220215172606079

Lecture-5 Languages models and Recurrent Neural Networks(RNNs)

image-20220215173841609

A neural dependency parser(0624)

image-20220215175916431

Distributed Representations(0945)

image-20220215180234046##

Deep Learning Classifier are non-linear classifiers(1210)

image-20220215180544369

Deep Learning Classifier’s non-linear classifiers:

image-20220215180703045

Simple feed-forward neural network multi-class classifier (1621)

image-20220215181359982

Neural Dependency Parser Model Architecture(1730)

image-20220215182714531

Graph-based dependency parsers (2044)

image-20220215182932684

Regularization && Overfitting (2529)

image-20220215183327050

Dropout (3100)[ToL]

image-20220215184016985

Vectorization(3333)

image-20220215184453079

Non-linearities (4000)

image-20220215185618924

Parameter Initialization (4357)

image-20220215185707615

Optimizers(4617)

image-20220215185920518

Learning Rates(4810)

It can be slow as the learning go on.

image-20220215190108626

Language Modeling (5036)

image-20220215190413343

n-gram Language Models(5356)

image-20220215190718037 image-20220215190841180

Sparsity Problems (5922)

Many situation didn’t occur so it will be zero

image-20220215191735246

Storage Problems(h0117)

How to build a neural language model(h0609)

image-20220215192255066

A fixed-window neural Language Model(h1100)

image-20220216103904942

Recurrent Neural Network (RNN)(h1250)

x1 -> y1

Wx1 x2 -> y1

image-20220216105731982

A Simple RNN Language Model(h1430)

image-20220216110248289

image-20220216110444328

Lecture 6 Simple and LSTM Recurrent Neural Networks.

image-20220216110620895

image-20220216111222942

The Simple RNN Language Model (0310)

image-20220216112005817

Training an RNN Language Model (0818)

RNN takes more time.

Teacher Forcing

penalize when dont take its advise

image-20220216112357329

image-20220216112814935

image-20220216113456552

But how do we get the answer?

image-20220216113810612

image-20220216114843011

Evaluating Language Models (2447)[ToL]

image-20220216115442761

Language Model is a system that predicts the next word(3130)

image-20220216120043119

Other use of RNN(3229)

Tag for word

image-20220216120154220

Used for classification(3420)

image-20220216120331039

Used to Language encoder module (3500)

image-20220216120515954

Used to generate text (3600)

image-20220216120602654

Problems with Vanishing and Exploding Gradients(3750)[IMPORTANT]

image-20220216120728010

[ToL]

image-20220216120836593

Why This is a problem (4400)

image-20220216121352667

image-20220216121537213

image-20220216121801767

We can give him a limit.

image-20220216121845504

Long Short Term Memory RNNS(LSTMS)(5000)[ToL]

image-20220216142509947

image-20220216143131901

image-20220216143953637

image-20220216145201781

Bidirectional RNN (h2000)

We need information from the word after

image-20220216150058982

Lecture-7 Translation, Seq2Seq, Attention

image-20220216150827060

Machine Translation(0245)

image-20220216152638415

What do you need (1200)

you need parallel corpus,Then you need alignment

Decoding for SMT(1748)

Try many possible sequences.

image-20220216153938352

What is Neural Machine Translation(NMT)(2130)

Neural Machine Translation(NMT) is a way to do Machine Translation with a single end-to-end neural net work.

The neural network architecture is called sequence-to-sequence model(aka seq2seq) and it involves RNNs

image-20220216154743629

Seq2seq is more than MT(2600)

image-20220216155851923

(2732)[ToL]

Multi-layer RNNs(3323)

image-20220216160937711

Lower-level basic meaning

Higher-level overall meaning

image-20220216161044182

Greedy decoding(4000)

image-20220216161822091

Exhaustive search decoding(4200)

image-20220216161859032

beam search decoding(4400)

image-20220216162108945

image-20220216162654834

image-20220216163345111

image-20220216163610037

image-20220216163703962

How do we evaluate Machine Translation(5550)

BLEU

image-20220216163928786

NMT perhaps the biggest success story of NLP Deep Learning(h00000)

Attention(h1300)

image-20220216165707869

image-20220216165937488

Lecture 8 Final Projects; Practical Tips

image-20220216170053324

Sequence to Sequence with attention(0235)

image-20220216173442920

Attention: in equations(0800)

image-20220216174203323

image-20220216174430719

there are several attention variants(1500)

image-20220216174747222

Attention is a general Deep Learning technique(2240)

image-20220216175744427

Final Project(3000)

Lecture-9 Self- Attention and Transformers

Issues with recurrent models (0434)

Linear interaction distance

Sometimes it is too far too learn from the words.

image-20220216184249889

Lack of parallelizability(0723)

GPU can count parallelizable but RNN lacks that.

image-20220216184542395

If not recurrence

Word window models aggregate local contexts (1031)

image-20220217113153381

Attention(1406)

image-20220217113459930

Self-Attention(1638)

image-20220217114733959

Self-attention as an nlp building block(2222)

image-20220217115247771

Fix the first self-attention problem

sequence order (2423)

image-20220217120240889

Position representation vector through sinusoids(2624)
Sinusoidal position representations(2730)
Position representation vector from scratch(2830)

image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459image-20220217120619459

Adding nonlinearities in self-attention(2953)

Barriers and solutions for Self-Attention as building block(2945)

image-20220221185604333

image-20220221185720186

(3040)

image-20220221185116405

(3428)

image-20220221185521157

The transformer encoder-decoder(3638)

image-20220221185909509

[ToL]

image-20220221190102912

key query value(4000)

image-20220221190217303

image-20220221190523039

Multi-headed attention (4322)

(4450)

image-20220221190908268

image-20220221190957705

Residual connections(4723)

image-20220221191310743

Layer normalization(5045)

image-20220221191749317

Scaled fot product(5415)

Lecture 10 - Transformers and Pretraining

image-20220224134741859

Word structure and subword models(0300)

transform transformerify

taaaasty

image-20220224135937734

The byte-pair encoding(0659)

Subwords model learn the structure of word. The byte-pair between it and dont learn structure.

(0943)

image-20220224145251761

image-20220224145105071

Motivating word meaning and context(1556)

image-20220224145804570

Pretraining whole models(2000)

image-20220224145922233

Wordv2vec dont consider context but we can use LSTM to achieve that.

Mask some data and pretrain the model with them.

this model haven’t met overfitting now, you can save some data to test it.(2811)

transformers for encoding and decoding (3030)

Pretraining through language modeling(3400)

image-20220224151624946

image-20220224151950866

Stochastic gradient descent and pretrain/finetune(3740)

Model pretraining has three ways (4021)

image-20220224152730308

Decoder can see the history, the Encoder can also the future.

Encoder-Decoder maybe is the better.

Decoder(4300)

image-20220224152938046

image-20220224153716173

Generative Pretrained Transformer(GPT) (4818)

image-20220224153928012

image-20220224154243901

GPT2(5400)

image-20220224154759716

Pretraining Encoding(5545)

(Bert)(5654)

image-20220224165457421

image-20220224165235710

Bert will mask some words, ask what have I mask

Bidirectional encoder representations from transformers(h0100)

[ToL]

image-20220224170312332

image-20220224170413603

Limitations of pretrained encoders(h0900)

image-20220224171252011

Extensions of BERT(h1000)

image-20220224171454465

Pretraining Encoder-Decoder (h1200)

T5(h1500)

The model even dont know how many words are masked

image-20220224172344435

image-20220224172541657

In the pretraining the model learned a lot, but it is not always good

GPT3(h1800)

image-20220224172754530

image-20220224172922203

Lecture 11 Question Answering

image-20220224174146459

What is question answering(0414)

image-20220224175257101

image-20220224175334367

There are lots of practical applications(0629)

Beyond textual QA problems(1100)

Reading comprehension(1223)

image-20220224180147691

They are useful for many practical applications

Reading comprehension is an important tested for evaluating how well computer systems understand human language

Standord question answering dataset (1815)

image-20220224180828915

Neural models for reading comprehension(2428)

image-20220224181443258

LSTM-based vs BERT models (2713)

image-20220224181551779

image-20220224181815290

BiDAF(3200)

image-20220224181853733

Encoding(3200)

image-20220224182135349

Attention(3400)

image-20220224182405343

image-20220224182904883

Modeling and output layers(4640)

image-20220224183615556

image-20220224183819872

BERT for reading comprehension (5227)

image-20220224184028029

Comparisons between BiDAF and BERT models(2734)

image-20220224185118280

Can we design better pre-training objectives(h0000)

image-20220224185550578

open domain question answering(h1000)

image-20220225104946631

image-20220224191246022

image-20220225105306708

DPR(H1400)

image-20220224192658862

image-20220225105652971

image-20220225105747670

DensePhrase:Demo(h1800)

Lecture 12 - Natural Language Generation[ToL]

image-20220301143159380

What is neural language generation?(0300)

image-20220301142422083

Mache Translate

Dialogue Systems //siri

Summarization

Visual Description

Creative Generation //story

Components of NLG Systems(0845)

Basic of natural language generation(0916)

image-20220301143317131

A look at a single step(1024)

image-20220301143429583

then select and train(1115)

teacher forcing need to be leaned

image-20220301143650876

Decoding(1317)

image-20220301143923558

Greedy methods(1432)

image-20220301143958990

Greedy methods get repetitive(1545)

image-20220301144123549

why do repetition happen(1613)

image-20220301144237210

How can we reduce repetition (1824)[ToL]

image-20220301144518763

People is not always choose the greedy methods(1930)

image-20220301144630546

Time to get random: Sampling(2047)

image-20220301144729442

Decoding : Top-k sampling(2100)

image-20220301145000174

image-20220301145018125

Issues with Top-k sampling(2339)

image-20220301145153941

Decoding: Top-p(nucleus)sampling(2421)

image-20220301145243854

Scaling randomness: Softmax temperature (2500)[ToL]

image-20220301145837161

improving decoding: re-balancing distributions(2710)

image-20220301150002936

Backpropagation-based distribution re-balancing(3027)

image-20220301150637319

Improving Decoding: Re-ranking(3300)[ToL]

image-20220301151136510

Decoding: Takeaways(3540)

image-20220301151258962

Training NLG models(4114)

Maximum Likelihood Training(4200)

Are greedy decoders bad because of how they’re trained?

image-20220301152118621

Unlikelihood Training(4427)[ToL]

image-20220301153527149

Exposure Bias(4513)[ToL]

image-20220301153610391

Exposure Bias Solutions(4645)

image-20220301153742775

image-20220301153907117

image-20220301153919593

Reinforce Basics(4900)

image-20220301154050890

Reward Estimation(5020)

image-20220301154205522

image-20220301154243893

reinforce’s dark side(5300)

image-20220301154452756

image-20220301154547630

Training: Takeways(5423)

image-20220301154732991

Evaluating NLG Systems(5613)

Types of evaluation methods for text generation(5734)

image-20220301155705613

Content Overlap metrics(5800)

image-20220301155931178

A simple failure case(5900)

image-20220301160050567

Semantic overlap metrics(h0100)

image-20220301160319080

Model-based metrics(h0120)

image-20220301160406112

word distance functions(h0234)

image-20220301160511479

Beyond word matching(h0350)

image-20220301160556251

Human evaluations(h0433)

image-20220301160658568

image-20220301160747509

Issues(h0700)

image-20220301160937146

Takeways(h0912)

image-20220301161428035

Ethical Considerations(h1025)

image-20220301161515113

image-20220301161639415

image-20220301161723483

image-20220301161839135

image-20220301161931109

image-20220301162101280

Lecture 13 - Coreference Resolution

image-20220301162522611

What is Coreference Resolution?(0604)

Identify all mentions that refer to the same entity in the world

image-20220301165446496

Applications (1712)

image-20220301165651721

image-20220301165822337

Coreference Resolution in Two steps(1947)

image-20220301165948737

Mention Detection(2049)

image-20220301170016948

Not quite so simple(2255)

image-20220301170236541

It is the best donut.

I want to find the best donut.

Avoiding a traditional pipeline system(2811)

image-20220301170543068

End to End[ToL]

Onto Coreference! First, some linguistics (3035)

Coreference and Anaphor

image-20220301171220450

image-20220301171334445

not all anaphoric relations are coreferential (3349)

image-20220301171524154

Anaphora vs Cataphora(3610)

One look its reference before it the other is after it.

image-20220301171753920

Taking stock (3801)

image-20220301171920183

Four kinds of coreference Models(4018)

image-20220301172140149

Traditional pronominal anaphora resolution:Hobbs’s naive algorithm(4130)

image-20220301172320435

image-20220301172342791

image-20220301172431380

Knowledge-based Pronominal Coreference(4820)

image-20220301172732198

Hobb’s method can not really solve the questions, the model should really understand the sentence.

Coreference Models: Mention Pair(5624)

image-20220301173814531

image-20220301173826974

Mention Pair Test Time(5800)

image-20220301173911539

Disadvantage(5953)

image-20220301174101225

Coreference Models: Mention Ranking(h0050)

image-20220301174326929

image-20220301174335701

Convolutional Neural Nets(h0341)

image-20220301174555163

What is convolution anyway?(h0452)

image-20220301184216564

image-20220301184306662

image-20220301184445934

image-20220301184526687

Summarize what we have usually use pooling

image-20220301184655490

image-20220301184706063

Max pooling is usually better.

image-20220301184805861

End-to-End Neural Coref Model(h1206)

image-20220301184935797

image-20220301185015078

image-20220301185022792

image-20220301185132395

image-20220301185213638

image-20220301185316970

image-20220301185347334

image-20220301185443640

image-20220301185551550

Conclusion (h2017)

image-20220301185734941

Lecture 14 - T5 and Large Language Models

image-20220302144735211

(0243)

image-20220302145100222

image-20220302145356635

T5 with a task prefix(0800)

image-20220302145406303

Others

image-20220302145536205

image-20220302145606261

STSB

image-20220302145658323

Summarize

image-20220302145646869

T5 change little from original transformer(1300)

image-20220302145917510

what should my pre-training data set be?(1325)

Get from open source data source and then wipe them and get c4 1500

Then is how to train from a start(1659)

image-20220302151128378

pretrain(1805)

image-20220302151510138

choose the model(2412)

image-20220302152005363

They use the encoder-Decoder model, It turns out it works well.

They dont change hyper paramenters because of the cost

pre-training objective(2629)

image-20220302153925164

Choose different train method

different structure of data source(2822)

image-20220302154612488

Multi task learning (3443)

image-20220302155158191

close the gap between multi-task training and this pre-training followed by separate fine tuning(3621)

image-20220302155756918

What if it happens there are four times computes as much as before (3737)

image-20220302160009440

Overview(3840)

image-20220302160104583

image-20220302160234452

image-20220302160555766

image-20220302160612550

What about all of the other languages?(mT5)(4735)

image-20220302161124160

Same model different corpus.

image-20220302161211041

image-20220302161358092

XTREME (5000)

image-20220302161445454

How much knowledge does a language model pick up during pre-training?(5225)

image-20220302161913596

image-20220302161932089

image-20220302161949876

image-20220302162028438

Salient span masking (5631)

image-20220302162316816

Instead of mask randomly, it mask username please date, etc.

Do large language models memorize their training data(h0100)

It seems it did

image-20220302162918979

image-20220302163050189

image-20220302163113267

image-20220302163505954

image-20220302163519627

image-20220302163719877

They need to see examples, they need to see particular examples fewer times in order!

Can we close the gap between large and small models by improving the transformer architecture(h1010)

image-20220302164909562

in these test, they change some architecture such as RELu.

there actually were very few, if any modifications that improved performance meaningfully.

image-20220302165203416

image-20220302165316814(h1700)

QA(h1915)

Lecture 15 - Add Knowledge to Language Models

image-20220302172329814

Recap: LM(0232)

image-20220302172634570

image-20220302172712490

What does a language model know?(0423)

image-20220302172753547

Thing may right in logic but wrong in fact.

image-20220302172916623

The importance of know ledge-aware language models(0700)

image-20220302173300654

Query traditional knowledge bases(0750)

image-20220302173336194

Query language models as knowledge bases(0955)

image-20220302173553905

Compare and disadvantage(1010)

image-20220302173820443

Techniques to add knowledge to LMs(130)

image-20220302173937785

Add pretrained embeddings(1403)

image-20220302174313016

Aside: What is entity linking?(1516)

image-20220302174603921

Method 1: Add pretrained entity embeddings(1815)

image-20220302174729224

How to we incorporate pretrained entity embeddings from a different embedding space?(2000)

image-20220302174927805

ERNIE: Enhanced language representation with informative entities(2143)

image-20220302175236060

image-20220302175420597

image-20220302175702140

image-20220302175713761

strengths & remaining challenges(2610)

image-20220302175826353

Jointly learn to link entities with KnowBERT(2958)

image-20220302180440491

Use an external memory(3140)

image-20220302180727662

KGLM(3355)

image-20220302180818299

Local knowledge and full knowledge

image-20220302181037473

When should the model use the external knowledge(3600)

image-20220302181146581

image-20220302181436660

image-20220302181526770

image-20220302181538323

Compare to the others(4334)

image-20220302181801664

More recent takes: Nearest Neighbor Language Models(kNN-LM)(4730)

image-20220302182325290

image-20220302182507490

Modify the training data(5230)

image-20220302182823268

image-20220302182959293

WKLM(5458)

image-20220302183028613

image-20220302183142193

image-20220302183255968

Learn inductive biases through masking(5811)

image-20220302183351631

image-20220302183427849

Salient span masking(5927)

image-20220302183458012

Recap(h0053)

image-20220302183700886

Evaluating knowledge in LMS(h0211)

LAMA(h0250)

image-20220302183849664

image-20220302183927125

The limitations (h0650)

image-20220302184139639

LAMA_UnHelpful Names(LAMA-UHN)

image-20220302184226621

** They delete something that may caused by co-occurrence **

Developing better prompts to query knowledge in LMS

image-20220302184443068

image-20220302184528706

Knowledge-driven downstream tasks(h1253)

image-20220302184702209

Relation extraction performance on TACED(h1400)

image-20220302184753193

Entity typing performance on Open Entuty

image-20220302184828514

Recap: Evaluating knowledge in LMs(h1600)

image-20220302184929078

Other exciting progress & what’s next?(h1652)

image-20220302185006721

Lecture 17 - Model Analysis and Explanation

image-20220303104239293

image-20220303104308448

Motivation

what are our models doing(0415)

image-20220303104435113

how do we make tomorrow’s model?(0515)

image-20220303104651667

What biases are built into model?(0700)

image-20220303105015554

how do we make in the following 25years(0800)

image-20220303105141648

Model analysis at varying levels of abstraction(0904)

image-20220303105647998

Model evaluation as model analysis(1117)

image-20220303105924421

Model evaluation as model analysis in natural language inference(1344)

image-20220303110240168

What if the model is simple using heuristics to get good accuracy?(1558)

image-20220303110832177

image-20220303110953359

Language models as linguistic test subjects(2023)

image-20220303111752546

image-20220303112316410

image-20220303112622131

Careful test sets as unit test suites: CheckListing(3230)

image-20220303115000790

Fitting the dataset vs learning the task(3500)

image-20220303115116821

Knowledge evaluation as model analysis(3642)

image-20220303115222614

Input influence: does my model really use long-distance context?(3822)

image-20220303115456959

Prediction explanations: what in the input led to this output?(4054)

image-20220303115848462

Prediction explanations: simple saliency maps(4230)

image-20220303120124359

image-20220303133241797

Explanation by input reduction (4607)

image-20220303134148143

image-20220303134313746

Analyzing models by breaking them(5106)

image-20220303134604267

image-20220303134644433

They add a nonsense sentence at the end and the prediction changed.

image-20220303134756682

Change the Q also make the prediction changed

Are models robust to noise in their input?(5518)

image-20220303135054871

It seems not.

Analysis of “interpretable” architecture components(5719)

image-20220303135659761

image-20220303135716017

image-20220303140006202

image-20220303140154452

image-20220303140306747

image-20220303140430315

Probing: supervised analysis of neural networks(h0408)

image-20220303140720120

image-20220303140831970

image-20220303141059579

image-20220303141301877

image-20220303141354126

image-20220303141443881

the most efficient layer is in the middlwe.

image-20220303141554363

deeper, more abstract

Emergent simple structure in neural networks(h1019)

image-20220303141709095

Probing: tress simply recoverable from BERT representations(h1136)

image-20220303141908032

Final thoughts on probing and correlation studies(h1341)

image-20220303142155844

Not causal study

Recasting model tweaks and ablations as analysis(h1406)

image-20220303142341661

Ablation analysis: do we need all these attension heads?(h1445)

image-20220303142453543

What’s the right layer order for a transformer?(h1537)

image-20220303142557160

Parting thoughts(h1612)

image-20220303142651251

Lecture 18 - Future of NLP + Deep Learning

image-20220303145634077

image-20220303145648087

General Representation Learning Recipe(0312)

image-20220303145813909

Certain properties emerge only when we scale up the model size!

Large Language Models and GPT-3(0358)

Large Language models and GPT-3(0514)

image-20220303150148074

What’s new about GPT-3

image-20220303150225480

image-20220303150257686

image-20220303150317443

There are three lessons left, They will be finished in the review when I come back from Lee.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值