python 文本分析教程_《Python机器学习基础教程》七、处理文本数据

本章讨论了处理文本“也叫自然语言处理(NLP)”的基础知识,还给出了一个对电影评论进行分类的示例应用。如果你想要尝试文本处理数据,那么这里讨论的工具应该是很好的出发点。特别是对于文本分类任务,比如检测垃圾邮件和欺诈或者情感分析,词袋模型提供了一种简单而又强大的解决方案。正如机器学习中常见的情况,数据表示是NLP应用的关键,检查所提取的词例和n元分词有助于深入理解建模过程。在文本处理应用中,对于监督任务和无监督任务而言,通常都可以用有意义的方式对模型进行内省,正如我们在本章所见。在实践中使用基于NLP的方法时,你应该充分利用这一能力。

自然语言和文本处理是一个很大的研究领域,本章所述的CountVectorizer类和TfidfVectorizer类仅实现了相对简单的文本处理方法。对于更高级的文本处理方法,我们推荐使用Python包spacy(一个相对较新的包,但非常高效,且设计良好)、nltk(一个非常完善且完整的库,但有些过时)和gensim(着重于主题建模的NLP包)。

近年来,在文本处理方面有许多非常令人激动的新进展,这些内容都超出了本书的范围,并且都和神经网络有关。第一个进展是使用连续向量表示,也叫做词向量(word vector)或分布式词表示(distributed word representation),它在work2vec库中实现。

近年来,NLP还有另一个研究方向不断升温,就是使用递归神经网络(recurrent neural network,RNN)进行文本处理。与智能分类类别标签的分类模型相比,RNN是一种特别强大的神经网络,可以生成同样是文本的输出。能够生成文本作为输出,使得RNN非常适合自动翻译和摘要。

一、用字符串表示的数据类型

在文本分析的语境中,数据集通常被称为语料库(corpus),每个由单个文本表示的数据点称为文档(document)。这些术语来自于信息检索(information retrieval,IR)和自然语言处理(natural language processing,NLP)的社区,它们主要针对文本数据。

二、示例应用:电影评论的情感分析

本章的示例:

from sklearn.datasets import load_files

reviews_train = load_files("data/aclImdb/train/")

# load_files returns a bunch, containing training texts and training labels

text_train, y_train = reviews_train.data, reviews_train.target

print("type of text_train:{}".format(type(text_train)))

print("length of text_train:{}".format(len(text_train)))

print("text_train[6]:\n{}".format(text_train[6]))

type of text_train:

length of text_train: 25000

text_train[6]:

b"This movie has a special way of telling the story, at first i found it rather odd as it jumped through time and I had no idea whats happening.
Anyway the story line was although simple, but still very real and touching. You met someone the first time, you fell in love completely, but broke up at last and promoted a deadly agony. Who hasn't go through this? but we will never forget this kind of pain in our life.
I would say i am rather touched as two actor has shown great performance in showing the love between the characters. I just wish that the story could be a happy ending."

对示例进行数据清洗,并删除干扰分析的格式:

text_train = [doc.replace(b"
", b" ") for doc in text_train]

np.unique(y_train)

array([0, 1])

print("Samples per class (training):{}".format(np.bincount(y_train)))

Samples per class (training): [12500 12500]

同样的方式加载测试数据集:

reviews_test = load_files("data/aclImdb/test/")

text_test, y_test = reviews_test.data, reviews_test.target

print("Number of documents in test data:{}".format(len(text_test)))

print("Samples per class (test):{}".format(np.bincount(y_test)))

text_test = [doc.replace(b"
", b" ") for doc in text_test]

Number of documents in test data: 25000

Samples per class (test): [12500 12500]

我们要解决的任务如下:给定一条评论,我们希望根据该评论的文本内容对其分配一个“正面的”或“负面的”标签。这是一项标准的二分类任务。但是,文本数据并不是机器学习模型可以处理的格式。我们需要将文本的字符串表示转换为数值表示,从而可以对其应用机器学习算法。

三、将文本数据表示为词袋

用于机器学习的文本表示有一种最简单的方法,也是最有效且最常用的方法,就是使用词袋(bag-of-words)表示。使用这种表示方式时,我们舍弃了输入文本中的大部分结构,如章节、段落、句子和格式,只计算语料库中每个单词在每个文本中的出现频次。舍弃结构并仅计算单词出现次数,这会让脑海中出现将文本表示为“袋”的画面。

对于文档语料库,计算词袋表示包括以下三个步骤:分词(tokenization)。将每个文档划分为出现在其中的单词(称为词例(token)),比如按空格和标点划分。

构建词表(vocabulary building)。收集一个词表,里面包含出现在任意文档中的所有词,并对它们进行编号(比如按字母顺序排序)。

编码(encoding)。对于每个文档,计算词表中每个单词在该文档中的出现频次。

下面是字符串“This is how you get ants.”的处理过程。其输出是包含每个文档中单词计数的一个向量。对于词表中的每个单词,我们都有它在每个文档中的出现次数。也就是说,整个数据集中的每个唯一单词都对应于这种数值表示的一个特征。请注意,原始字符中的单词顺序与词袋特征表示完全无关。

1.将词袋应用于玩具数据集

bards_words =["The fool doth think he is wise,",

"but the wise man knows himself to be a fool"]

from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()

vect.fit(bards_words)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',

dtype=, encoding='utf-8', input='content',

lowercase=True, max_df=1.0, max_features=None, min_df=1,

ngram_range=(1, 1), preprocessor=None, stop_words=None,

strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',

tokenizer=None, vocabulary=None)

print("Vocabulary size:{}".format(len(vect.vocabulary_)))

print("Vocabulary content:\n{}".format(vect.vocabulary_))

Vocabulary size: 13

Vocabulary content:

{'the': 9, 'fool': 3, 'doth': 2, 'think': 10, 'he': 4, 'is': 6, 'wise': 12, 'but': 1, 'man': 8, 'knows': 7, 'himself': 5, 'to': 11, 'be': 0}

bag_of_words = vect.transform(bards_words)

print("bag_of_words:{}".format(repr(bag_of_words)))

bag_of_words: <2x13 sparse matrix of type ''

with 16 stored elements in Compressed Sparse Row format>

print("Dense representation of bag_of_words:\n{}".format(

bag_of_words.toarray()))

Dense representation of bag_of_words:

[[0 0 1 1 1 0 1 0 0 1 1 0 1]

[1 1 0 1 0 1 0 1 1 1 0 1 1]]

2.将词袋应用于电影评论

处理电影评论,并改进特征提取。

vect = CountVectorizer().fit(text_train)

X_train = vect.transform(text_train)

print("X_train:\n{}".format(repr(X_train)))

X_train:

<25000x74849 sparse matrix of type ''

with 3431196 stored elements in Compressed Sparse Row format>

feature_names = vect.get_feature_names()

print("Number of features:{}".format(len(feature_names)))

print("First 20 features:\n{}".format(feature_names[:20]))

print("Features 20010 to 20030:\n{}".format(feature_names[20010:20030]))

print("Every 2000th feature:\n{}".format(feature_names[::2000]))

Number of features: 74849

First 20 features:

['00', '000', '0000000000001', '00001', '00015', '000s', '001', '003830', '006', '007', '0079', '0080', '0083', '0093638', '00am', '00pm', '00s', '01', '01pm', '02']

Features 20010 to 20030:

['dratted', 'draub', 'draught', 'draughts', 'draughtswoman', 'draw', 'drawback', 'drawbacks', 'drawer', 'drawers', 'drawing', 'drawings', 'drawl', 'drawled', 'drawling', 'drawn', 'draws', 'draza', 'dre', 'drea']

Every 2000th feature:

['00', 'aesir', 'aquarian', 'barking', 'blustering', 'bête', 'chicanery', 'condensing', 'cunning', 'detox', 'draper', 'enshrined', 'favorit', 'freezer', 'goldman', 'hasan', 'huitieme', 'intelligible', 'kantrowitz', 'lawful', 'maars', 'megalunged', 'mostey', 'norrland', 'padilla', 'pincher', 'promisingly', 'receptionist', 'rivals', 'schnaas', 'shunning', 'sparse', 'subset', 'temptations', 'treatises', 'unproven', 'walkman', 'xylophonist']

from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LogisticRegression

scores = cross_val_score(LogisticRegression(), X_train, y_train, cv=5)

print("Mean cross-validation accuracy:{:.2f}".format(np.mean(scores)))

Mean cross-validation accuracy: 0.88

from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.001, 0.01, 0.1, 1, 10]}

grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)

grid.fit(X_train, y_train)

print("Best cross-validation score:{:.2f}".format(grid.best_score_))

print("Best parameters: ", grid.best_params_)

Best cross-validation score: 0.89

Best parameters: {'C': 0.1}

X_test = vect.transform(text_test)

print("Test score:{:.2f}".format(grid.score(X_test, y_test)))

Test score: 0.88

vect = CountVectorizer(min_df=5).fit(text_train)

X_train = vect.transform(text_train)

print("X_train with min_df:{}".format(repr(X_train)))

X_train with min_df: <25000x27271 sparse matrix of type ''

with 3354014 stored elements in Compressed Sparse Row format>

feature_names = vect.get_feature_names()

print("First 50 features:\n{}".format(feature_names[:50]))

print("Features 20010 to 20030:\n{}".format(feature_names[20010:20030]))

print("Every 700th feature:\n{}".format(feature_names[::700]))

First 50 features:

['00', '000', '007', '00s', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '100', '1000', '100th', '101', '102', '103', '104', '105', '107', '108', '10s', '10th', '11', '110', '112', '116', '117', '11th', '12', '120', '12th', '13', '135', '13th', '14', '140', '14th', '15', '150', '15th', '16', '160', '1600', '16mm', '16s', '16th']

Features 20010 to 20030:

['repentance', 'repercussions', 'repertoire', 'repetition', 'repetitions', 'repetitious', 'repetitive', 'rephrase', 'replace', 'replaced', 'replacement', 'replaces', 'replacing', 'replay', 'replayable', 'replayed', 'replaying', 'replays', 'replete', 'replica']

Every 700th feature:

['00', 'affections', 'appropriately', 'barbra', 'blurbs', 'butchered', 'cheese', 'commitment', 'courts', 'deconstructed', 'disgraceful', 'dvds', 'eschews', 'fell', 'freezer', 'goriest', 'hauser', 'hungary', 'insinuate', 'juggle', 'leering', 'maelstrom', 'messiah', 'music', 'occasional', 'parking', 'pleasantville', 'pronunciation', 'recipient', 'reviews', 'sas', 'shea', 'sneers', 'steiger', 'swastika', 'thrusting', 'tvs', 'vampyre', 'westerns']

grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)

grid.fit(X_train, y_train)

print("Best cross-validation score:{:.2f}".format(grid.best_score_))

Best cross-validation score: 0.89

四、停用词

删除没有信息量的单词还有另一种方法,就是舍弃哪些出现次数太多以至于没有信息量的单词。有两种主要方法:使用特定语言的停用词(stopword)列表,或者舍弃那些出现过于频繁的单词。

from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

print("Number of stop words:{}".format(len(ENGLISH_STOP_WORDS)))

print("Every 10th stopword:\n{}".format(list(ENGLISH_STOP_WORDS)[::10]))

Number of stop words: 318

Every 10th stopword:

['they', 'of', 'who', 'found', 'none', 'co', 'full', 'otherwise', 'never', 'have', 'she', 'neither', 'whereby', 'one', 'any', 'de', 'hence', 'wherever', 'whose', 'him', 'which', 'nine', 'still', 'from', 'here', 'what', 'everything', 'us', 'etc', 'mine', 'find', 'most']

# Specifying stop_words="english" uses the built-in list.

# We could also augment it and pass our own.

vect = CountVectorizer(min_df=5, stop_words="english").fit(text_train)

X_train = vect.transform(text_train)

print("X_train with stop words:\n{}".format(repr(X_train)))

X_train with stop words:

<25000x26966 sparse matrix of type ''

with 2149958 stored elements in Compressed Sparse Row format>

grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)

grid.fit(X_train, y_train)

print("Best cross-validation score:{:.2f}".format(grid.best_score_))

Best cross-validation score: 0.88

可以尝试通过设置CountVectorizer的max_df选项来舍弃出现最频繁的单词,并查看它对特征数量和性能有什么影响。

五、用tf-idf缩放数据

另一种方法是按照我们预计的特征信息量大小来缩放特征,而不是舍弃那些认为不重要的特征。最常见的一种做法就是使用词频-逆向文档频率(term frequency-inverse document frequency,tf-idf)方法。这一方法对在某个特定文档中经常出现的术语给与很高的权重,但对在语料库的许多文档都经常出现的术语给与的权重却不高。如果一个单词在某个特定文档中经常出现,但在许多文档中却不常出现,那么这个单词很可能是对文档内容的很好描述。

scikit-learn在两个类中实现了tf-idf方法:TfidfTransformer和TfidfVectorizer,前者接受CountVectorizer生成的稀疏矩阵并将其变换,后者接受文本数据并完成词袋特征提取与tf-idf变换。

单词w在文档d中的tf-idf分数在TfidfVectorizer类和TfidfTransformer类中都有实现,其计算公式如下所示:

其中,N是训练集中的文档数量,Nw是训练集中出现单词w的文档数量,tf(词频)是单词w在查询文档d(你想要变换成编码的文档)中出现的次数。两个类在计算tf-idf表示之后都还应用了L2范数。换句话说,它们将每个文档的表示缩放到欧几里得范数为1。利用这种缩放方法,文档长度(单词数量)不会改变向量化表示。

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.pipeline import make_pipeline

pipe = make_pipeline(TfidfVectorizer(min_df=5, norm=None),

LogisticRegression())

param_grid = {'logisticregression__C': [0.001, 0.01, 0.1, 1, 10]}

grid = GridSearchCV(pipe, param_grid, cv=5)

grid.fit(text_train, y_train)

print("Best cross-validation score:{:.2f}".format(grid.best_score_))

Best cross-validation score: 0.89

vectorizer = grid.best_estimator_.named_steps["tfidfvectorizer"]

# transform the training dataset:

X_train = vectorizer.transform(text_train)

# find maximum value for each of the features over dataset:

max_value = X_train.max(axis=0).toarray().ravel()

sorted_by_tfidf = max_value.argsort()

# get feature names

feature_names = np.array(vectorizer.get_feature_names())

print("Features with lowest tfidf:\n{}".format(

feature_names[sorted_by_tfidf[:20]]))

print("Features with highest tfidf:\n{}".format(

feature_names[sorted_by_tfidf[-20:]]))

Features with lowest tfidf:

['poignant' 'disagree' 'instantly' 'importantly' 'lacked' 'occurred'

'currently' 'altogether' 'nearby' 'undoubtedly' 'directs' 'fond' 'stinker'

'avoided' 'emphasis' 'commented' 'disappoint' 'realizing' 'downhill'

'inane']

Features with highest tfidf:

['coop' 'homer' 'dillinger' 'hackenstein' 'gadget' 'taker' 'macarthur'

'vargas' 'jesse' 'basket' 'dominick' 'the' 'victor' 'bridget' 'victoria'

'khouri' 'zizek' 'rob' 'timon' 'titanic']

sorted_by_idf = np.argsort(vectorizer.idf_)

print("Features with lowest idf:\n{}".format(

feature_names[sorted_by_idf[:100]]))

Features with lowest idf:

['the' 'and' 'of' 'to' 'this' 'is' 'it' 'in' 'that' 'but' 'for' 'with'

'was' 'as' 'on' 'movie' 'not' 'have' 'one' 'be' 'film' 'are' 'you' 'all'

'at' 'an' 'by' 'so' 'from' 'like' 'who' 'they' 'there' 'if' 'his' 'out'

'just' 'about' 'he' 'or' 'has' 'what' 'some' 'good' 'can' 'more' 'when'

'time' 'up' 'very' 'even' 'only' 'no' 'would' 'my' 'see' 'really' 'story'

'which' 'well' 'had' 'me' 'than' 'much' 'their' 'get' 'were' 'other'

'been' 'do' 'most' 'don' 'her' 'also' 'into' 'first' 'made' 'how' 'great'

'because' 'will' 'people' 'make' 'way' 'could' 'we' 'bad' 'after' 'any'

'too' 'then' 'them' 'she' 'watch' 'think' 'acting' 'movies' 'seen' 'its'

'him']

六、研究模型系数

下面的图给出Logistic回归模型中最大的25个系数与最小的25个系数,其高度代表每个系数的大小:

mglearn.tools.visualize_coefficients(

grid.best_estimator_.named_steps["logisticregression"].coef_,

feature_names, n_top_features=40)

左侧的负系数属于模型找到的表示负面评论的单侧,而右侧的正系数属于模型找到的表示正面评论的单词。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值