从零开始NLP
最近打算学习一下NLP,在这里记录一下看到的知识。
1:N-gram
n-gram是一种语言模型,作用就是为一句给定的单词的序列返回一个概率,表示这个序列出现的概率值。常见的有unigram,bigram,trigram等等。
n个单词的句子出现的概率:
unigram,假设单词之间相互独立,那么可以表示为:
unigram是不可取的,因为‘i have a dream’和'dream have a i'计算出来的概率是一样的,bigram就是根据条件概率计算:
当然trigram的结果就是:
n-gram能解决什么问题呢?https://zhuanlan.zhihu.com/p/32829048,其中一个就是根据语料库的n-gram查看一个语句是不是合理的,比如"我喜欢吃西瓜"就比"西瓜喜欢吃我"出现的概率大。
import nltk
word_data = "The best performance can bring in sky high success."
nltk_tokens = nltk.word_tokenize(word_data)
print(list(nltk.bigrams(nltk_tokens)))
output:
[('The', 'best'), ('best', 'performance'), ('performance', 'can'), ('can', 'bring'),
('bring', 'in'), ('in', 'sky'), ('sky', 'high'), ('high', 'success'), ('success', '.')]
# encoding=utf8
from __future__ import print_function
import re
from nltk.util import ngrams
def generate_ngrams(s, n):
# Convert to lowercases
s = s.lower()
# Replace all none alphanumeric characters with spaces
s = re.sub(r'[^a-zA-Z0-9\s]', ' ', s)
# Break sentence in the token, remove empty tokens
tokens = [token for token in s.split(" ") if token != ""]
# Use the zip function to help us generate n-grams
# Concatentate the tokens into ngrams and return
ngrams = zip(*[tokens[i:] for i in range(n)])
return [" ".join(ngram) for ngram in ngrams]
if __name__ == "__main__":
s = """
Natural-language processing (NLP) is an area of
computer science and artificial intelligence
concerned with the interactions between computers
and human (natural) languages.
"""
# print (generate_ngrams(s, 2))
s = s.lower()
s = re.sub(r'[^a-zA-Z0-9\s]', ' ', s)
tokens = [token for token in s.split(" ") if token != ""]
output = list(ngrams(tokens, 2))
print (output)
2: lstm为什么能解决长期依赖问题
RNN从理论上看是能够利用任意远的输入信息的,但是在实际应用中,使用RNN学习长序列问题的时候效果不好,这是因为梯度在传播的过程中发生了消失(或者爆炸)的情况。
colah在http://colah.github.io/posts/2015-08-Understanding-LSTMs/举了这样一个例子:
One of the appeals of RNNs is the idea that they might be able to connect previous information to the present task, such as using previous video frames might inform the understanding of the present frame. If RNNs could do this, they’d be extremely useful. But can they? It depends.
Sometimes, we only need to look at recent information to perform the present task. For example, consider a language model trying to predict the next word based on the previous ones. If we are trying to predict the last word in “the clouds are in the sky,” we don’t need any further context – it’s pretty obvious the next word is going to be sky. In such cases, where the gap between the relevant information and the place that it’s needed is small, RNNs can learn to use the past information.
从梯度推导的角度看,RNN无法依赖长期信息的原因是梯度可能会出现消失或者爆炸的情况,详情见RNN梯度推导。
LSTM的结构包含了三个门,输入门,遗忘门和输出门,并且还引入了cell state这样的一个历史状态,结构如下:
那为什么LSTM能够解决长期依赖问题呢?这是因为RNN的状态S是根据St-1非线性变换的结果:
这样导致链式法则求导时引入了激活函数的导数和参数的乘积:
这样连乘就容易导致梯度的爆炸或者消失,而LSTM在更新状态的时候使用的是线性的相加:
The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged.
因此状态求导之后结果就是各个门的结果,梯度不依赖w的取值,实际应用中相比RNN能够更好的传递梯度:
3:主题模型
LSA:对基于tf-idf的词项-文档矩阵进行SVD分解 https://www.cnblogs.com/pinard/p/6805861.html,由于SVD很慢,基本不再使用
NMF:基于非负矩阵分解的NMF方法在求解上效率更高https://www.cnblogs.com/pinard/p/6812011.html
LDA:
4:word2vec源码解析
https://blog.csdn.net/jeryjeryjery/article/details/80245924这篇文章介绍的word2vec源码解析写的很好,看了下https://github.com/tmikolov/word2vec上面word2vec的C源码,对于word2vec的理解更加深入了,发现其中hierachical softmax用哈夫曼树来实现真的很巧妙,其实最终的梯度更新和逻辑回归类似。