附大佬的笔记:
github.com/LooperXX/LooperXX.github.io.git
文章目录
- Abbreviation
- Lecture 1 - Introduction and Word Vectors
- Lecture 2 Word Vectors,Word Senses,and Neural Classifiers
-
- Bag models (0245)
- Gradient descent (0600)
- more details of word2vec(1400)
- Why use two vectors(1500)
- Why not capture co-occurrence counts directly?(2337)
- SVD(3230) [ToL]
- Count based vs direct prediction
- Encoing meaning components in vector differences(3948)
- GloVe (4313)
- How to evaluate word vectors Intrinsic vs. extrinsic(4756)
- Data shows that 300 dimensional word vector is good(5536)
- The objective function for the GloVe model and What log-bilinear means(5739)
- Word senses and word sense ambiguity(h0353)
- Lecture 3 Gradients by hand(matric calculus) and algorithmically(the backpropagation algorithm) all the math details of doing nerual net learning
-
- Need to be learn again, it is not totally understanded.
- Named Entity Recognition(0530)
- Simple NER (0636)
- update equation(1220)
- jacobian(1811)
- Chain Rule(2015)
- do one example step (2650)
- image-20220214193417520
- Reusing Computation(3402)
- Forward and backward propagation(5000)
- An example(5507)
- Compute all gradients at once (h0005)
- Back-prop in general computation graph(h0800)[ToL]
- Automatic Differentiation(h1346)
- Manual Gradient checking : Numeric Gradient(h1900)
- Lecture 4 Dependency Parsing
-
- Two views of linguistic structure
- Why do we need sentence structure?(2205)
- Prepositional phrase attachment ambiguity.(2422)
- Coordination scope ambiguity(3614)
- Adjectival/Adverbial Modifier Ambiguity(3755)
- Verb Phrase(VP) attachment ambiguity(4404)
- Dependency Grammar and Dependency structure(4355)
- Dependency Grammar history(4742)
- The rise of annotated data Universal Dependency tree(5100)
- how to build parser with dependency(5738)
- Dependency Parsing
- Methods of Dependency Parsing(h0521)
- Greedy transition-based parsing(h0621)
- Basic transition-based dependency parser (h0808)
- MaltParser(h1351)[ToL]
- Evaluation of Dependency Parsing (h1845)[ToL]
- Lecture-5 Languages models and Recurrent Neural Networks(RNNs)
-
- A neural dependency parser(0624)
- Distributed Representations(0945)
- Deep Learning Classifier are non-linear classifiers(1210)
- Simple feed-forward neural network multi-class classifier (1621)
- Neural Dependency Parser Model Architecture(1730)
- Graph-based dependency parsers (2044)
- Regularization && Overfitting (2529)
- Dropout (3100)[ToL]
- Vectorization(3333)
- Non-linearities (4000)
- Parameter Initialization (4357)
- Optimizers(4617)
- Learning Rates(4810)
- Language Modeling (5036)
- n-gram Language Models(5356)
- Sparsity Problems (5922)
- Storage Problems(h0117)
- How to build a neural language model(h0609)
- A fixed-window neural Language Model(h1100)
- Recurrent Neural Network (RNN)(h1250)
- A Simple RNN Language Model(h1430)
- Lecture 6 Simple and LSTM Recurrent Neural Networks.
-
- The Simple RNN Language Model (0310)
- Training an RNN Language Model (0818)
- Evaluating Language Models (2447)[ToL]
- Language Model is a system that predicts the next word(3130)
- Other use of RNN(3229)
- Problems with Vanishing and Exploding Gradients(3750)[IMPORTANT]
- Long Short Term Memory RNNS(LSTMS)(5000)[ToL]
- Bidirectional RNN (h2000)
- Lecture-7 Translation, Seq2Seq, Attention
-
- Machine Translation(0245)
- Decoding for SMT(1748)
- What is Neural Machine Translation(NMT)(2130)
- Seq2seq is more than MT(2600)
- (2732)[ToL]
- Multi-layer RNNs(3323)
- Greedy decoding(4000)
- Exhaustive search decoding(4200)
- beam search decoding(4400)
- How do we evaluate Machine Translation(5550)
- NMT perhaps the biggest success story of NLP Deep Learning(h00000)
- Attention(h1300)
- Lecture 8 Final Projects; Practical Tips
- Lecture-9 Self- Attention and Transformers
-
- Issues with recurrent models (0434)
- If not recurrence
- Self-Attention(1638)
- Self-attention as an nlp building block(2222)
- Fix the first self-attention problem
- Barriers and solutions for Self-Attention as building block(2945)
- The transformer encoder-decoder(3638)
- Residual connections(4723)
- Layer normalization(5045)
- Scaled fot product(5415)
- Lecture 10 - Transformers and Pretraining
-
- Word structure and subword models(0300)
- The byte-pair encoding(0659)
- Motivating word meaning and context(1556)
- Pretraining whole models(2000)
- this model haven't met overfitting now, you can save some data to test it.(2811)
- transformers for encoding and decoding (3030)
- Pretraining through language modeling(3400)
- Stochastic gradient descent and pretrain/finetune(3740)
- Model pretraining has three ways (4021)
- Generative Pretrained Transformer(GPT) (4818)
- GPT2(5400)
- Pretraining Encoding(5545)
- Bidirectional encoder representations from transformers(h0100)
- Limitations of pretrained encoders(h0900)
- Extensions of BERT(h1000)
- Pretraining Encoder-Decoder (h1200)
- GPT3(h1800)
- Lecture 11 Question Answering
- What is question answering(0414)
- Beyond textual QA problems(1100)
- Reading comprehension(1223)
- Standord question answering dataset (1815)
- Neural models for reading comprehension(2428)
- LSTM-based vs BERT models (2713)
- BiDAF(3200)
- BERT for reading comprehension (5227)
- Comparisons between BiDAF and BERT models(2734)
- Can we design better pre-training objectives(h0000)
- open domain question answering(h1000)
- DPR(H1400)
- DensePhrase:Demo(h1800)
- Lecture 12 - Natural Language Generation[ToL]
-
- What is neural language generation?(0300)
- Components of NLG Systems(0845)
- Decoding(1317)
-
- Greedy methods(1432)
- Greedy methods get repetitive(1545)
- why do repetition happen(1613)
- How can we reduce repetition (1824)[ToL]
- People is not always choose the greedy methods(1930)
- Time to get random: Sampling(2047)
- Decoding : Top-k sampling(2100)
- Issues with Top-k sampling(2339)
- Decoding: Top-p(nucleus)sampling(2421)
- Scaling randomness: Softmax temperature (2500)[ToL]
- improving decoding: re-balancing distributions(2710)
- Backpropagation-based distribution re-balancing(3027)
- Improving Decoding: Re-ranking(3300)[ToL]
- Decoding: Takeaways(3540)
- Training NLG models(4114)
- Evaluating NLG Systems(5613)
- Types of evaluation methods for text generation(5734)
- Ethical Considerations(h1025)
- Lecture 13 - Coreference Resolution
-
- What is Coreference Resolution?(0604)
- Applications (1712)
- Coreference Resolution in Two steps(1947)
- Mention Detection(2049)
- Avoiding a traditional pipeline system(2811)
- Onto Coreference! First, some linguistics (3035)
- Anaphora vs Cataphora(3610)
- Taking stock (3801)
- Four kinds of coreference Models(4018)
- Traditional pronominal anaphora resolution:Hobbs's naive algorithm(4130)
- Knowledge-based Pronominal Coreference(4820)
- Coreference Models: Mention Pair(5624)
- Coreference Models: Mention Ranking(h0050)
- Convolutional Neural Nets(h0341)
- What is convolution anyway?(h0452)
- End-to-End Neural Coref Model(h1206)
- Conclusion (h2017)
- Lecture 14 - T5 and Large Language Models
-
- T5 with a task prefix(0800)
- Others
- T5 change little from original transformer(1300)
- what should my pre-training data set be?(1325)
- Then is how to train from a start(1659)
- pretrain(1805)
- choose the model(2412)
- pre-training objective(2629)
- different structure of data source(2822)
- Multi task learning (3443)
- close the gap between multi-task training and this pre-training followed by separate fine tuning(3621)
- What if it happens there are four times computes as much as before (3737)
- Overview(3840)
- What about all of the other languages?(mT5)(4735)
- XTREME (5000)
- How much knowledge does a language model pick up during pre-training?(5225)
- Salient span masking (5631)
- Do large language models memorize their training data(h0100)
- Can we close the gap between large and small models by improving the transformer architecture(h1010)
- QA(h1915)
- Lecture 15 - Add Knowledge to Language Models
-
- Recap: LM(0232)
- What does a language model know?(0423)
- The importance of know ledge-aware language models(0700)
- Query traditional knowledge bases(0750)
- Query language models as knowledge bases(0955)
- Compare and disadvantage(1010)
- Techniques to add knowledge to LMs(130)
- Add pretrained embeddings(1403)
- Aside: What is entity linking?(1516)
- Method 1: Add pretrained entity embeddings(1815)
- ERNIE: Enhanced language representation with informative entities(2143)
- Jointly learn to link entities with KnowBERT(2958)
- Use an external memory(3140)
- Compare to the others(4334)
- More recent takes: Nearest Neighbor Language Models(kNN-LM)(4730)
- Modify the training data(5230)
- WKLM(5458)
- Learn inductive biases through masking(5811)
- Salient span masking(5927)
- Recap(h0053)
- Evaluating knowledge in LMS(h0211)
- LAMA_UnHelpful Names(LAMA-UHN)
- Relation extraction performance on TACED(h1400)
- Entity typing performance on Open Entuty
- Recap: Evaluating knowledge in LMs(h1600)
- Other exciting progress & what's next?(h1652)
- Lecture 17 - Model Analysis and Explanation
-
- Motivation
- Model analysis at varying levels of abstraction(0904)
- Model evaluation as model analysis(1117)
- Model evaluation as model analysis in natural language inference(1344)
- Language models as linguistic test subjects(2023)
- Careful test sets as unit test suites: CheckListing(3230)
- Fitting the dataset vs learning the task(3500)
- Knowledge evaluation as model analysis(3642)
- Input influence: does my model really use long-distance context?(3822)
- Prediction explanations: what in the input led to this output?(4054)
- Prediction explanations: simple saliency maps(4230)
- Explanation by input reduction (4607)
- Analyzing models by breaking them(5106)
- Are models robust to noise in their input?(5518)
- Analysis of "interpretable" architecture components(5719)
- Probing: supervised analysis of neural networks(h0408)
- Emergent simple structure in neural networks(h1019)
- Probing: tress simply recoverable from BERT representations(h1136)
- Final thoughts on probing and correlation studies(h1341)
- Recasting model tweaks and ablations as analysis(h1406)
- What's the right layer order for a transformer?(h1537)
- Parting thoughts(h1612)
- Lecture 18 - Future of NLP + Deep Learning
- There are three lessons left, They will be finished in the review when I come back from Lee.
Abbreviation
| - | - |
|---|---|
| [ToL] | To learn |
| [ToLM] | To learn more |
| [ToLO] | To learn optionally |
| (0501) | 05 min 01s |
| (h0501) | 1 hour 05 min 01s |
| (hh0501) | 2 hour 05 min 01s |
Lecture 1 - Introduction and Word Vectors

NLP
Convert one-hot encoding to distributed representitions
Ont hot can’t represent the relation between word vectors,it is too big
Word2vec
Ignore the position of word of context
Use two vector in one word: centor word context word.
softmax function

Train the model: gradient descent
There is a term to calculate the gradient descent. (39:50-56:40)
result is :
ToL
Review derivation and the following especially.

Show some achievement with code(5640-h0516)
- We can do vector addition, subtraction, multiplication and division, etc.
QA
Why are there center word and context word(h0650)
To avoid one vector dot product himself in some situation???
Even synonyms can be merged into a vector(h1215)
Which is different from lee ,He says synonyms use different.
Lecture 2 Word Vectors,Word Senses,and Neural Classifiers

Bag models (0245)
The model makes the same predictions at each position.
Gradient descent (0600)
Not usually use because of the big calculation.
step size: not too big nor too small
stochastic gradient descent SGD TOBELM (0920)
Take part of the corpus
billion faster.
Maybe even get better result.
But it is stochastic, either you need sparse matrix update operations to only update certain rows of full embedding matrices U and V, or you need to keep around a hash for vectors.(1344)ToL
more details of word2vec(1400)

SG use center to predict context
SGNS negative sampling [ToBLO]
use logistic function instead of softmax and take sampling of corpus
CBOW opposite.
Why use two vectors(1500)
Sometime it will dot product with itself.
[ToL]
The first one is positive word and the last is negative word (2800)
negative word is being sampled cause the center word will turn up on other occasions, when it does, there will have other sampling, and it will learn step by step.
Why not capture co-occurrence counts directly?(2337)

SVD(3230) [ToL]
https://zhuanlan.zhihu.com/p/29846048
use svd to get lower dimensional representations for words
(3451)
Count based vs direct prediction
(3900)
Encoing meaning components in vector differences(3948)
This is to make addition subtraction available for word vectors.

GloVe (4313)
let dot product minus log of the co-occurrence
How to evaluate word vectors Intrinsic vs. extrinsic(4756)
Analogy evaluation and hyperparameters (intrinsic)(5515)
Word vector distances and their correlation with human judgements(5640)
Data shows that 300 dimensional word vector is good(5536)
The objective function for the GloVe model and What log-bilinear means(5739)
Word senses and word sense ambiguity(h0353)
One word different mean different vector.
then a word can be the sum of them all
It will work good but not bad (h1200)
the vector is so sparse that you can separate out different senses (h1402)
Lecture 3 Gradients by hand(matric calculus) and algorithmically(the backpropagation algorithm) all the math details of doing nerual net learning

Need to be learn again, it is not totally understanded.
Named Entity Recognition(0530)

Simple NER (0636)

How the sample model run (0836)

update equation(1220)

jacobian(1811)

Chain Rule(2015)

do one example step (2650)
hadamard product ToL
Reusing Computation(3402)
ds/dw
Forward and backward propagation(5000)
An example(5507)
a = x+y
b = max(y,z)
f = ab
Compute all gradients at once (h0005)
Back-prop in general computation graph(h0800)[ToL]

Automatic Differentiation(h1346)
Many tools can calculate automaticly.
Manual Gradient checking : Numeric Gradient(h1900)
Lecture 4 Dependency Parsing

Two views of linguistic structure
Constituency = phrase structure grammar = context-free grammars(CFGs)(0331)
Phrase structure organizes words into nested constituents

Dependency structure(1449)
Dependency structure shows which words depend on (modify, attach to,or are arguments of)

Why do we need sentence structure?(2205)
Can not express meaning by just one word.

Prepositional phrase attachment ambiguity.(2422)
There is some sentence to show it:
San Jose cops kill man with knife
Scientists count whales from space
The board approved [its acquisition] [by Royal Trustco Ltd.] [of Toronto] [for $27 a share] [at its monthly meeting].
Coordination scope ambiguity(3614)
**Shuttle veteran and longtime NASA executive Fred Gregory appointed to board **
Doctor: No heart, cognitive issues
Adjectival/Adverbial Modifier Ambiguity(3755)
Students get [first hand job] experience Students get first [hand job] experience
Verb Phrase(VP) attachment ambiguity(4404)
Mutilated body washes up on Rio beach to be used for Olympics beach volleyball.

Dependency Grammar and Dependency structure(4355)

Will add a fake ROOT for handy
Dependency Grammar history(4742)

The rise of annotated data Universal Dependency tree(5100)

Tree bank(5400)
Its too slow to write a grammar by hand but its still worth,cause it can used in another place but not only nlp .
how to build parser with dependency(5738)

Dependency Parsing

Projectivity(h0416)

Methods of Dependency Parsing(h0521)

Greedy transition-based parsing(h0621)
Basic transition-based dependency parser (h0808)

[root] I ate fish
[root I ate] fish
[root ate] fish
[root ate fish]
[root ate]
[root]
MaltParser(h1351)[ToL]

Evaluation of Dependency Parsing (h1845)[ToL]

Lecture-5 Languages models and Recurrent Neural Networks(RNNs)

A neural dependency parser(0624)

Distributed Representations(0945)

Deep Learning Classifier are non-linear classifiers(1210)

Deep Learning Classifier’s non-linear classifiers:

Simple feed-forward neural network multi-class classifier (1621)

Neural Dependency Parser Model Architecture(1730)

Graph-based dependency parsers (2044)

Regularization && Overfitting (2529)

Dropout (3100)[ToL]

Vectorization(3333)

Non-linearities (4000)

Parameter Initialization (4357)

Optimizers(4617)

Learning Rates(4810)
It can be slow as the learning go on.

Language Modeling (5036)

n-gram Language Models(5356)
Sparsity Problems (5922)
Many situation didn’t occur so it will be zero

Storage Problems(h0117)
How to build a neural language model(h0609)

A fixed-window neural Language Model(h1100)

Recurrent Neural Network (RNN)(h1250)
x1 -> y1
Wx1 x2 -> y1

A Simple RNN Language Model(h1430)

Lecture 6 Simple and LSTM Recurrent Neural Networks.


The Simple RNN Language Model (0310)

Training an RNN Language Model (0818)
RNN takes more time.
Teacher Forcing
penalize when dont take its advise



But how do we get the answer?


Evaluating Language Models (2447)[ToL]

Language Model is a system that predicts the next word(3130)

Other use of RNN(3229)
Tag for word

Used for classification(3420)

Used to Language encoder module (3500)

Used to generate text (3600)

Problems with Vanishing and Exploding Gradients(3750)[IMPORTANT]

[ToL]

Why This is a problem (4400)



We can give him a limit.

Long Short Term Memory RNNS(LSTMS)(5000)[ToL]




最低0.47元/天 解锁文章
910

被折叠的 条评论
为什么被折叠?



