Skr-Eric的机器学习课堂（八）-- 文本识别

最新推荐文章于 2023-10-18 18:25:28 发布

Skr-Eric

最新推荐文章于 2023-10-18 18:25:28 发布

阅读量358

点赞数

分类专栏：机器学习--Skr-Eric的编程课堂文章标签：机器学习

本文链接：https://blog.csdn.net/Skr_Eric/article/details/104394963

版权

机器学习--Skr-Eric的编程课堂专栏收录该内容

9 篇文章 0 订阅

订阅专栏

文本识别(自然语言处理，NLP)

人机交互：

语音识别文本识别

语音-------->文本-------->语义

__________________________

| v

人机器女友

^_________________________|

语音<------- 文本<--------数据

语音合成业务逻辑

1.分词：根据语法规则，把整篇文档(文本集合)拆分成相对较小的语言单位，如句子或单词等。

2.NLTK - 自然语言工具包

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import nltk.tokenize as tk
doc = "Are you curious about tokenization? " \
      "Let's see how it works! " \
      "We need to analyze a couple of sentences " \
      "with punctuations to see it in action."
print(doc)
# 按句拆分
tokens = tk.sent_tokenize(doc)
for i, token in enumerate(tokens):
    print('%2d' % (i + 1), token)
print('-' * 15)
# 按词拆分
tokens = tk.word_tokenize(doc)
for i, token in enumerate(tokens):
    print('%2d' % (i + 1), token)
print('-' * 15)
# 按词和标点拆分
tokenizer = tk.WordPunctTokenizer()
tokens = tokenizer.tokenize(doc)
for i, token in enumerate(tokens):
    print('%2d' % (i + 1), token)

3.词干提取：在名词和动词中，除了与数和时态有关的成分以外的核心成分。

词干并不一定是合法的单词。

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import nltk.stem.porter as pt
import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb
words = ['table', 'probably', 'wolves', 'playing',
         'is', 'dog', 'the', 'beaches', 'grounded',
         'dreamt', 'envision']
# 波特词干提取器
pt_stemmer = pt.PorterStemmer()
# 兰卡斯特词干提取器
lc_stemmer = lc.LancasterStemmer()
# 思诺博词干提取器
sb_stemmer = sb.SnowballStemmer('english')
for word in words:
    pt_stem = pt_stemmer.stem(word)  # 波特词干
    lc_stem = lc_stemmer.stem(word)  # 兰卡斯特词干
    sb_stem = sb_stemmer.stem(word)  # 思诺博词干
    print('%8s %8s %8s %8s' % (
        word, pt_stem, lc_stem, sb_stem))

4.词型还原：复数名词->单数名词，分词->动词原型

单词原型一定是合法的单词。

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import nltk.stem as ns
words = ['table', 'probably', 'wolves', 'playing',
         'is', 'dog', 'the', 'beaches', 'grounded',
         'dreamt', 'envision']
# 词型还原器
lemmatizer = ns.WordNetLemmatizer()
for word in words:
    # 将名词还原为单数形式
    n_lemma = lemmatizer.lemmatize(word, pos='n')
    # 将动词还原为原型形式
    v_lemma = lemmatizer.lemmatize(word, pos='v')
    print('%8s %8s %8s' % (word, n_lemma, v_lemma))

5.数学模型

1)词典：一篇文档中所包含不同单词的有序序列。

2)样本：按照句子对文档进行分词，一句为一样本。

3)词袋(BOW)：词典中每一个单词在每一个样本中出现的次数。

4)词频(TF)：对词袋做归一化，表示了词典中的每个单词对样本语义表现力贡献的大小。

5)文档频率(DF)：针对词典中的每一个单词，用包含该单词的样本数比上总样本数。单词越稀有，文档频率越小。单词的稀有度表现了该单词对样本可识别性贡献的大小。

6)逆文档频率(IDF)：1/文档频率

逆文档频率^->文档频率v->单词稀有^->可识别性贡献^

词频^----------------------------------->语义表现力贡献^

7)词频逆文档频率(TF-IDF)：词频 x log(逆文档频率)，综合体现了单词对样本语义表现力和可识别性贡献的大小，即该单词可对每个样本的重要性的大小。

-----------------------------------------------------------------

The brown dog is running. The black dog is in the black room. Running in the room is forbidden.

-----------------------------------------------------------------

1 The brown dog is running

2 The black dog is in the black room

3 Running in the room is forbidden

-----------------------------------------------------------------

black brown dog forbidden in is room running the

1 0 1 1 0 0 1 0 1 1

2 2 0 1 0 1 1 1 0 2

3 0 0 0 1 1 1 1 1 1

词袋矩阵

| 归一化

词频矩阵

0. 0.2 0.2 0. 0. 0.2 0. 0.2 0.2

0.25 0. 0.125 0. 0.125 0.125 0.125 0. 0.25

0. 0. 0. 0.17 0.17 0.17 0.17 0.17 0.17

3/1 3/1 3/2 3/1 3/2 3/3 3/2 3/2 3/3

TF-IDF矩阵

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import nltk.tokenize as tk
import sklearn.feature_extraction.text as ft
import sklearn.preprocessing as sp
doc = 'The brown dog is running. ' \
      'The black dog is in the black room. ' \
      'Running in the room is forbidden.'
print(doc)
sentences = tk.sent_tokenize(doc)
print(sentences)
# 计数矢量化器
cv = ft.CountVectorizer()
# 词袋矩阵
bow = cv.fit_transform(sentences).toarray()
print(bow)
# 词典
words = cv.get_feature_names()
print(words)
# 词频
tf = sp.normalize(bow, norm='l1')
print(tf)
# 词频逆文档频率转换器
tt = ft.TfidfTransformer()
# 词频逆文档频率(TF-IDF)
tfidf = tt.fit_transform(bow).toarray()
print(tfidf)

6.主题识别

关于不同主题文本素材

TF-IDF -> 主题标签

------------------------------------

xxxxxxxxx -> 二手市场 0

xxxxxxxxx -> 加密解密 1

xxxxxxxxx -> 摩托车 2

xxxxxxxxx -> 太空 3

xxxxxxxxx -> 棒球 4

-----------------------------------

xxxxxxxxx -> ?

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import sklearn.datasets as sd
import sklearn.feature_extraction.text as ft
import sklearn.naive_bayes as nb
train = sd.load_files(
    '../../data/20news', encoding='latin1',
    shuffle=True, random_state=7)
# 训练集文本
train_data = train.data
# 训练集类别
train_y = train.target
# 主题条目
categories = train.target_names
# 计数矢量化器
cv = ft.CountVectorizer()
# 训练集的词袋矩阵
train_bow = cv.fit_transform(train_data)
# 词频逆文档频率转换器
tt = ft.TfidfTransformer()
# 训练集的词频逆文档频率矩阵
train_x = tt.fit_transform(train_bow)
# 多项分布朴素贝叶斯分类器
model = nb.MultinomialNB()
# 用训练集的词频逆文档频率矩阵
# 训练多项分布朴素贝叶斯分类器
model.fit(train_x, train_y)
# 测试集文本
test_data = [
    'The curveballs of right handed pitchers tend to curve to the left',
    'Caesar cipher is an ancient form of encryption',
    'This two-wheeler is really good on slippery roads']
# 测试集的词袋矩阵
test_bow = cv.transform(test_data)
# 测试集的词频逆文档频率矩阵
test_x = tt.transform(test_bow)
# 预测测试集类别
pred_test_y = model.predict(test_x)
# 打印预测结果
for sentence, index in zip(
        test_data, pred_test_y):
    print(sentence, '->', categories[index])

1 2 3 4 5 6

2 3 0 0 1 4

0 5 0 1 1 3

...

多项分布朴素贝叶斯分类器

7.情感分析

A B C

1 2 3 -> {'A': 1, 'C': 3, 'B': 2}

4 5 6 -> {'C':6, 'A': 4, 'B', 5}

7 8 9 ...

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import nltk.corpus as nc
import nltk.classify as cf
import nltk.classify.util as cu
pdata = []
fileids = nc.movie_reviews.fileids('pos')
for fileid in fileids:
    words = nc.movie_reviews.words(fileid)
    sample = {}
    for word in words:
        sample[word] = True
    pdata.append((sample, 'POSITIVE'))
ndata = []
fileids = nc.movie_reviews.fileids('neg')
for fileid in fileids:
    words = nc.movie_reviews.words(fileid)
    sample = {}
    for word in words:
        sample[word] = True
    ndata.append((sample, 'NEGATIVE'))
pnumb, nnumb = \
    int(0.8 * len(pdata)), int(0.8 * len(ndata))
train_data = pdata[:pnumb] + ndata[:nnumb]
test_data = pdata[pnumb:] + ndata[nnumb:]
model = cf.NaiveBayesClassifier.train(train_data)
ac = cu.accuracy(model, test_data)
print('%.2f%%' % round(ac * 100, 2))
tops = model.most_informative_features()
for top in tops[:5]:
    print(top[0])
reviews = [
    'It is an amazing movie.',
    'This is a dull movie. I wound never recommend it to anyoue.',
    'The cinematography is pretty great in this movie.',
    'The direction was terrible and the story was all over the place.']
sents, probs = [], []
for review in reviews:
    words = review.split()
    sample = {}
    for word in words:
        sample[word] = True
    pcls = model.prob_classify(sample)
    sent = pcls.max()
    prob = pcls.prob(sent)
    sents.append(sent)
    probs.append(prob)
for review, sent, prob in zip(
        reviews, sents, probs):
    print(review, '->', sent, '%.2f%%' % round(
        prob * 100, 2))

8.主题抽取

将构成文档的每个句子视为独立的样本，通过词频逆文档频率将其转换为是一个数值向量，该向量表达了样本中不同单词的价值分布。语义相似的样本，即主题相同的句子，其单词的价值分布必然相近，通过聚类的方法（隐含狄利克雷分布）划分出不同的主题集群，在聚类中心附近选取密度较高的样本，从中提取价值最高的若干单词，作为主题的标签。

想要看更多的课程请微信关注SkrEric的编程课堂

Skr-Eric

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Skr-Eric的机器学习课堂（八）-- 文本识别

文本识别(自然语言处理，NLP)人机交互：语音识别文本识别语音-------->文本-------->语义__________________________| v人 ...
复制链接

扫一扫

专栏目录