自然语言处理(NLP)
Siri: 1. 听 2. 懂 3. 思考 4. 组织语言 5.回答
- 语音识别
- 自然语言处理-语义分析
- 逻辑分析 - 综合业务场景上下文
- 自然语言处理 - 分析结果生成自然语言文本
- 语音合成
自然语言处理的常用处理过程:
先针对训练文本进行分词处理(词干处理, 原型提取), 统计词频 - 逆文档频率算法获得该词对某种语义的贡献, 根据每个词的贡献力度, 构建有监督类学习模型, 把测试样本, 交给模型处理, 得到语义类别.
自然语言工具包 - NLTK
文本分词
分词处理相关API:
import nltk.tokenize as tk
# 把样本按照句子进行拆分 sent_list: 句子拆分
sent_list = tk.sent_tokenize(text)
# 把样本字符串按照单词进行拆分, word_list: 单词列表
word_list = tk.word_tokenize(text)
# WordPunctTokenizer: 分词对象
tokenizer = tk.WordPunctTokenizer()
word_list = tokenizer.tokenize(text)
案例:
import nltk.tokenize as tk
doc = 'For support please see the REST framework discussion group, try the #restframework channel on irc.' \
'freenode.net, search the IRC archives, or raise a question on Stack Overflow, ' \
'making sure to include the django-rest-framework tag.' + """
Let's see how it works! We need to analyze a couple of setences with punctuations to see it in action. Let's go.
"""
# 拆出句子
sent_list = tk.sent_tokenize(doc)
for i, sent in enumerate(sent_list):
print("%2d" % (i+1), sent)
print('-' * 45)
# 拆分单词
word_list = tk.word_tokenize(doc)
for i, word in enumerate(word_list):
print("%2d" % (i+1), word)
# 拆分单词 WordPunctTokenizer
tokenizer = tk.WordPunctTokenizer()
word_list = tokenizer.tokenize(doc)
for i, word in enumerate(word_list):
print("%2d" % (i+1), word)
词干提取
分词后的单词词性与时态对语义分析没有影响, 所以需要对单词进行词干提取.
词干提取相关API:
import nltk.stem.porter as pt
import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb
stemmer = pt.PorterStemmer() # 波特词干提取器(宽松)
stemmer = lc.LancaserStemmer() # 朗卡斯特词干提取器(严格)
stemmer = sb.SnowballStemmer('english') # 思诺博词干提取器(中庸)
r = stemmer.stem('playing') # 提取词干, 返回结果
案例:
import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb
import nltk.stem.porter as pt
words = ['table', 'probably', 'wolves', 'playing', 'is',
'dog', 'the', 'beaches', 'grounded', 'dreamt',
'envision']
pt_stemmer = pt.PorterStemmer()
lc_stemmer = lc.LancasterStemmer()
sb_stemmer = sb.SnowballStemmer(language='english')
for word in words:
pt_stem = pt_stemmer.stem(word)
lc_stem = lc_stemmer.stem(word)
sb_stem = sb_stemmer.stem(word)
print('%8s, %8s, %8s, %8s' %(word, pt_stem, lc_stem, sb_stem))
词性还原
与词干提取作用类似, 词性还原更利于人工二次处理. 因为有些词干并非正确的单词, 人工阅读更麻烦. 词性还原可以把名词的复数形式改为单数, 动词特殊形式恢复为原型.
import nltk.stem as ns
# 磁性还原器
lemmatizer = ns.WordNetLemmatizer()
# 还原名词
n_lemma = lemmatizer.lemmatize(word, pos='n')
# 还原动词
v_lemma = lemmatizer.lemmatize(word, pos='v')
案例:
import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb
import nltk.stem.porter as pt
import nltk.stem as ns
words = ['table', 'probably', 'wolves', 'playing', 'is',
'dog', 'the', 'beaches', 'grounded', 'dreamt',
'envision']
lemmatizer = ns.WordNetLemmatizer()
for word in words:
n = lemmatizer.lemmatize(word, pos='n')
v = lemmatizer.lemmatize(word, pos='v')
print('%8s, %8s, %8s' % (word, n, v))
词袋模型
一句话的语义很大程度取决于某个单词出现的次数, 所以可以把句子中所有可能出现的单词作为特征名, 每一个句子为一个样本, 单词在句子中出现的次数为特征值构建数学模型. 该模型即称为词袋模型.
The brown dog is running. The black dog is in the black room. Running in the room is forbidden.
- The brown dog is running.
- The black dog is in the black room.
- Running in the room is forbidden.
the | brown | dog | running | black | in | room | forb |
---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
2 | 0 | 1 | 1 | 0 | 2 | 1 | 1 |
1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 |
构建词袋模型相关API:
import sklearn.feature_extraction.text as ft
# 词袋模型生成器
cv = fc.CountVectorizer()
bow = cv.fit_transform(sent)
# 获取所有的特征名(词袋的表头, 所有的单词)
words = cv.get_feature_names()
案例:
import nltk.tokenize as tk
import sklearn.feature_extraction.text as ft
doc = 'The brown dog is running.\
The black dog is in the black room. \
Running in the room is forbidden.'
# 分解所有句子
sentences = tk.sent_tokenize(doc)
print(sentences)
cv = ft.CountVectorizer()
bow = cv.fit_transform(sentences)
print(bow.toarray())
words = cv.get_feature_names()
print(words)
词频(TF)
单词在句子中出现的次数除以句子的总次数称为词频. 即一个单词在一个句子中出现的频率. 词频相比单词出现的次数更可以客观的评估单词对一句话的贡献度. 对词袋矩阵归一化处理即可得到单词的词频.
案例:
tf = sp.normalize(bow, norm='l1')
print(tf, '\n', tf.toarray())
文档频率(DF)
含有某个单词的文档样本数量/总文档样本数量
文档频率越低, 代表某个单词对语义的贡献度越高
逆文档频率(IDF)
总文档样本数量 / 含有某个单词的文档样本数量
词频-逆文档频率(TF-IDF)
词频矩阵中的每一个元素, 乘以相应单词的逆文档频率, 其值越大说明该词对样本语义的贡献度越大, 根据每个词的贡献力度, 构建学习模型.
获取TF-IDF矩阵的API:
# 首先获取词袋模型矩阵
cv = ft.CountVectorizer()
bow = cb.fit_transform(sentences).toArray()
# 获取TF-IDF模型训练器
tt = ft.TfidfTransformer()
tfidf = tt.fit_transform(bow)
案例:
tt = ft.TfidfTransformer()
tfidf = tt.fit_transform(bow)
print(tfidf.toarray())
文本分类(主题识别)
import sklearn.datasets as sd
import sklearn.feature_extraction.text as ft
import sklearn.naive_bayes as nb
train = sd.load_files('20news', encoding='latin1', shuffle=True, random_state=7)
# load_file将会在文件夹下所有文件
train_data = train.data
train_y = train.target
print(len(train_data))
categories = train.target_names
print(categories)
# 构建词袋模型
cv = ft.CountVectorizer()
train_bow = cv.fit_transform(train_data)
# 构建TFIDF矩阵
tt = ft.TfidfTransformer()
train_x = tt.fit_transform(train_bow)
# 基于多想分布的朴素贝叶斯进行模型训练
model = nb.MultinomialNB()
model.fit(train_x, train_y)
# 测试模型:
test_data = [
'The curveballs of right handed pitchers tend to curve to the left.',
'Caesar cipher is an ancient form of encryption.',
'This two-wheeler is really good on slippery roads',
]
test_bow = cv.transform(test_data)
test_x = tt.transform((test_bow))
pre_test_y = model.predict(test_x)
for sent, i in zip(test_data, pre_test_y):
print(sent, '->', categories[i])
nltk分类器
nltk提供了朴素贝叶斯分类器方便的处理自然语言相关的分类问题. 并且可以自动处理词袋, TFIDF矩阵的整理, 完成模型训练, 最终实现类别预测.
相关API如下:
import nltk.classify as cf
import nltk.classify.util as cu
"""train_data的数据格式:
[({'age': 1, 'score': 2, 'student': 1}, 'good'),
({'age': 1, 'score': 2, 'student': 1}, 'bad')]
"""
# 使用nltk得到的朴素贝叶斯分类器训练模型
model = cf.NaiveBayesClassifier.train(train_data)
# 对测试数据进行预测: test_data: {'age': 1, 'score': 2, 'student': 1},
model.classify(test_data)
# 评估分类器, 返回分类器的得分
ac = cu.accuracy(model, test_data)
情感分析
分析语料库中movie_reviews文档, 通过正面及敷面的评价进行自然语言训练, 实现情感分析.
案例:
import nltk.corpus as nc
import nltk.classify as cf
import nltk.classify.util as cu
pdata = []
fileids = nc.movie_reviews.fileids('pos')
# 真暴力所有正面评价的单词, 存入pdata列表
for fileid in fileids:
sample = {}
words = nc.movie_reviews.words(fileid)
for word in words:
sample[word] = True
pdata.append((sample, 'POSITIVE'))
ndata = []
fileids = nc.movie_reviews.fileids('neg')
# 将包里所有正面评价的单词, 存入pdata列表
for fileid in fileids:
sample = {}
words = nc.movie_reviews.words(fileid)
for word in words:
sample[word] = True
ndata.append((sample, 'NEGATIVE'))
print(len(pdata), len(ndata))
# 拆分训练集与测试集(80%作为训练)
pnumb = int(0.8 * len(pdata))
nnumb = int(0.8 * len(ndata))
train_data = pdata[:pnumb] + ndata[:nnumb]
test_data = pdata[pnumb:] + ndata[nnumb:]
# 基于朴素贝叶斯分类器开始训练
model = cf.NaiveBayesClassifier.train(train_data)
ac = cu.accuracy(model, test_data)
print(ac)
# 模拟业务场景
reviews = [
'It is an amazing movie',
'This is a dull movie, I would never recommend it to anynone.',
'The cinematography is pretty great in this movie.',
'The direction was terrible and the story was all over the place.'
]
sent, probs = [], []
for review in reviews:
sample = {}
words = review.split()
for word in words:
sample[word] = True
pcls = model.classify(sample)
print(review, '->', pcls)
语音识别
通过傅里叶变换, 将时域的声音函数分解为一系列不同频率的正弦函数的叠加, 通过频率谱线的特殊分布, 建立音频内容和文本的对应关系, 以此作为模型训练的基础.
案例: freq.wav
import numpy as np
import numpy.fft as nf
import scipy.io.wavfile as wf
import matplotlib.pyplot as mp
sample_rate, sigs = wf.read('freq.wav')
print(sample_rate)
print(sigs.shape, sigs.dtype)
# x坐标
times = np.arange(len(sigs)) / sample_rate
# 傅里叶变换, 获取拆解出的正弦波频率与能量信息
freqs = nf.fftfreq(sigs.size, 1 / sample_rate)
ffts = nf.fft(sigs)
pows = np.abs(ffts)
# 绘制两个图像
mp.figure('Audio', facecolor='lightgray')
mp.subplot(121)
mp.title('Time Domain')
mp.xlabel('Time', fontsize=12)
mp.ylabel('Signal', fontsize=12)
mp.grid(linestyle=":")
mp.plot(times, sigs, c='dodgerblue')
# 绘制频域图
mp.subplot(122)
mp.title('Frequency Domain')
mp.xlabel('Fequency', fontsize=12)
mp.ylabel('Pow', fontsize=12)
mp.grid(linestyle=":")
mp.plot(freqs[freqs>0], pows[freqs>0], c='orangered')
mp.tight_layout()
mp.show()
###语音识别过程
梅尔频率倒谱系数(MFCC): 对声音做傅里叶变换后, 发现通过与声音内容密切相关的13个特殊频率所对应的能量分布, 可以使用MFCC矩阵作为语音识别的特征. 基于隐马尔科夫模型进行模式识别, 找到测试样本最匹配的声音模型, 从而识别语音内容.
MFCC相关API:
import scipy.io.wavfile as wf
import python_speech_features as tf
# 读取音频文件, 获取采样率及每个采样点的值
sample_rate, sigs = wf.read('freq.wav')
# 交给语音特征提取器, 获取该语音的梅尔频率倒谱矩阵
mfcc = sf.mfcc(sigs, sample_rate)
案例:比较不同音频的mfcc矩阵
import numpy as np
import numpy.fft as nf
import scipy.io.wavfile as wf
import python_speech_features as sf
import matplotlib.pyplot as mp
sample_rate, sigs = wf.read('apple01.wav')
mfcc = sf.mfcc(sigs, sample_rate)
print(mfcc.shape)
mp.matshow(mfcc.T, cmap='gist_rainbow')
mp.title('MFCC', fontsize=16)
mp.ylabel('Feature', fontsize=12)
mp.xlabel('Sample', fontsize=12)
mp.tick_params(labelsize=10)
mp.show()
sample_rate, sigs = wf.read('freq.wav')
mfcc = sf.mfcc(sigs, sample_rate)
print(mfcc.shape)
mp.matshow(mfcc.T, cmap='gist_rainbow')
mp.title('MFCC', fontsize=16)
mp.ylabel('Feature', fontsize=12)
mp.xlabel('Sample', fontsize=12)
mp.tick_params(labelsize=10)
mp.show()
隐马尔可夫模型相关API:
import hmmlearn.hmm as hl
# 构建隐马模型
model = hl.GaussianHMM(
n_components=4, # 用几个高斯分布函数拟合样本数据
convariance_type='diag', # 使用相关矩阵的辅对角线进行相关性比较
n_iter=1000 # 最大迭代上限
)
score = model.score(test_mfccs)