《python machine learning》 chapter 8 Applying Machine Learning to Sentiment Analysis
git源码:https://github.com/xuman-Amy/sentimental-analysis
项目说明:根据Internet Movie Database (IMDb)上获取的50000个影评,预测影评是积极的还是消极的。
(1)清洗 准备文本数据
(2)从数据集中构建特征向量
(3)训练模型区分影评的positive 和 negative
(4)out-of-core处理大数据集
(5)从文本分类中推断主题
【1、 准备数据】
数据说明:影评集为50000的大数据集,每条影评被标记为positive 和 negative,positive表示电影获得六星及以上的好评;negative表示六星以下。
【获取数据】
import pandas as pd
df = pd.read_csv("G:\Machine Learning\python machine learning\python machine learning code\code\ch08\movie_data.csv")
df.head()
【bag-of-words】
利用bag-of-words将文本数据转换为数值型特征向量。
bag-of-words的基本思想:
(1)创建一个具有唯一token的单词表,例如来自整个文档的单词
(2)在每个文档中创建一个特征向量——特征向量包含每个单词在特定文档中出现的频率。
【sklearn 实现bag-of-words】
将单词转换为特征向量
利用
#bag-of-words
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()
doc = np.array([
'The sun is shining',
'The weather is sweet',
'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(doc)
print(count.vocabulary_)
将CountVectorizer将每个单词存储在字典中,与之相映射的是字典的数字索引。
特征向量中,0-9列与字典的索引相对应,特征向量如下:
print(bag.toarray())
(and, is, one, shining, sun, sweet, the, two, weather)
向量中的值也叫做原词的频率(raw term frequencies) tf(t,d)即term t 出现在词典d中的频率。
【tf-idf 】
term frequency-inverse document frequency:词频-逆向文件频率
如果某个词或短语在一篇文章中出现的频率TF高,并且在其他文章中很少出现,则认为此词或者短语具有很好的类别区分能力,适合用来分类。
文档总个数,
文档d包含term t 的个数。加1 是为了确保分母不为0,log是为了保证较低的df不会占有太大权重。
【sklearn 实现tf-idf】
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(use_idf = True, norm = 'l2', smooth_idf = True)
tfidf.fit_transform(count.fit_transform(docs))
np.set_printoptions(precision = 2)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())
图表分析:
对比tf 和 tf-idf 表格发现,在tf表格中,单词‘is’在第三个文档中出现频率最高(3), 但是在tf-idf中频率偏低(0.45),这是因为is同样出现在了第一个和第二个文档中,所以判断is不像是包含有用判别信息的单词。
说明:
在sklearn中计算tf-idf的公式与上述公式稍有不同,在sklearn中计算公式为:
最后将tf-idf进行规范化,公式如下:
sklearn直接对数据进行了规范化处理,norm =’l2‘,返回一个长度为1的向量,其计算公式为
【清洗数据】
在进行bag-of-words等上述文档处理步骤前,首先要进行清洗数据,将不必要的信息条带化。
调用python的regular expression(regex)(正则表达式) 库,re, 进行数据清洗工作。
import re
def preprocessor(text):
text = re.sub('<[^>]*>','', text)
emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
text = (re.sub('[\W]+', ' ', text.lower()) +
' '.join(emoticons).replace('-', ''))
return text
df['review'] = df['review'].apply(preprocessor)
【将文档处理成tokens】
(1)在空白字符处对清洗过的文档进行分割
def tokenizer(text):
return text.split()
tokenizer('runners like running and thus they run')
(2)运用Porter stemmer algorithm返回单词的词干
# word stemming 词干
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
def tokenizer_porter(text):
return [porter.stem(word) for word in text.split()]
tokenizer_porter('runners like running and thus they run')
【去除stop-words】
stop-words在所有文档中都极其常见,且基本不包含便于区分类别的有用信息
可以从NLTK中直接加载127个英语中的stop-words。
#removing stop-words
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')
[w for w in tokenizer_porter('runners like running and thus they run')[-10:] if w not in stop]
【训练文本分类的逻辑回归模型】
【将前25000划分为训练集,后25000划分为测试集】
#training LR model
#split train and test dataset
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values
【运用grid search 寻找5-hold的LR模型的最佳参数集】
# train LR mmodel
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
tfidf = TfidfVectorizer(strip_accents=None,
lowercase=False,
preprocessor=None)
param_grid = [{'vect__ngram_range': [(1, 1)],
'vect__stop_words': [stop, None],
'vect__tokenizer': [tokenizer, tokenizer_porter],
'clf__penalty': ['l1', 'l2'],
'clf__C': [1.0, 10.0, 100.0]},
{'vect__ngram_range': [(1, 1)],
'vect__stop_words': [stop, None],
'vect__tokenizer': [tokenizer, tokenizer_porter],
'vect__use_idf':[False],
'vect__norm':[None],
'clf__penalty': ['l1', 'l2'],
'clf__C': [1.0, 10.0, 100.0]},
]
lr_tfidf = Pipeline([('vect', tfidf),
('clf', LogisticRegression(random_state=0))])
gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
scoring='accuracy',
cv=5,
verbose=1,
n_jobs=-1)
gs_lr_tfidf.fit(X_train, y_train)
print('Best parameter set: %s ' % gs_lr_tfidf.best_params_)
print('CV Accuracy: %.3f' % gs_lr_tfidf.best_score_)
clf = gs_lr_tfidf.best_estimator_
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))
【out-of-core 处理大数据集】
定义tokenizer进行数据清洗以及分离单词token
# construct word tokens
import numpy as np
import re
from nltk.corpus import stopwords
def tokenizer(text):
text = re.sub('<[^>]*>', '', text)
emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
text = re.sub('[\W]+', ' ', text.lower()) +\
' '.join(emoticons).replace('-', '')
tokenized = [w for w in text.split() if w not in stop]
return tokenized
#read in and return documents at a time
def stream_docs(path):
with open(path, 'r', encoding='utf-8') as csv:
next(csv) # skip header
for line in csv:
text, label = line[:-3], int(line[-2])
yield text, label
获取指定长度的文本
# take a document stream from the stream_docs function
# and return a particular number of documents specified by the size parameter
def get_minibatch(doc_stream, size):
docs, y = [], []
try:
for _ in range(size):
text, label = next(doc_stream)
docs.apped(text)
y.append(label)
except StopIteration:
return None, None
return docs, y
运用HashingVectorizer进行文本预处理
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier
vect = HashingVectorizer(decode_error = 'ignore',
n_features = 2 ** 21,
preprocessor = None,
tokenizer = tokenizer)
clf = SGDClassifier(loss = 'log', random_state = 1, n_iter = 1)
doc_stream = stream_docs(path =
'G:\Machine Learning\python machine learning\python machine learning code\code\ch08\movie_data.csv')
【进行out-of-core】
分批次加载训练集数据,每次加载1000条,共45*1000
import pyprind
pbar = pyprind.ProgBar(45)
classes = np.array([0,1])
for _ in range(45):
X_train, y_train = get_minibatch(doc_stream, size=1000)
if not X_train:
break
X_train = vect.transform(X_train)
clf.partial_fit(X_train, y_train, classes = classes)
pbar.update()
加载5000条测试集
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print('Accuracy: %.3f' % clf.score(X_test, y_test))
【Topic modeling with Latent Dirichlet Allocation (LDA)】
主题建模:将主题分配给未标记文本的广泛任务。
采用Latent Dirichlet Allocation (LDA) 隐狄利赫雷分布进行主题建模。
【LDA基本思想】
LDA是一种生成概率模型,试图找到在不同文档中高频出现的单词组,这些单词组能够反映文档主题。
输入为bag-of-words矩阵,输出为LDA将其分解为两个新的矩阵:与主题矩阵对应的文档;与主题矩阵对应的单词
主题个数为LDA的超参数,必须事先设置好。
【sklearn实现LDA】
利用LDA分解影评数据集,并对影评进行不同主题的分类。
【1. 读入数据】
import pandas as pd
df = pd.read_csv("G:\Machine Learning\python machine learning\python machine learning code\code\ch08\movie_data.csv",
encoding = 'utf-8')
df.head()
【2. 创建bag-of-words 输入】
from sklearn.feature_extraction.text import CountVectorizer
# max_df = .1,max document frequency 最高文档频率占总文本的10%,
count = CountVectorizer(stop_words = 'english',
max_df = .1,
max_features = 5000)
X = count.fit_transform(df['review'].values)
【3. LDA】
#LDA modeling
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_topics = 10,
random_state = 123,
learning_method = 'batch')
X_topics = lda.fit_transform(X)
找出top5的单词
#five most important words for each of the 10 topics.
n_top_words = 5
feature_names = count.get_feature_names()
for topic_idx, topic in enumerate(lda.components_):
print('Topic %d :' % (topic_idx + 1))
print(" ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1 :-1]]))
推测电影类别
根据内容验证一下推测
horror = X_topics[:,5].argsort()[:: -1]
for iter_idx, movie_idx in enumerate(horror[:3]):
print('\nHorror movie #%d:' % (iter_idx + 1))
print(df['review'][movie_idx][:300], '...')