机器学习——情感分析

最新推荐文章于 2024-04-10 10:30:00 发布

Amy_mm

最新推荐文章于 2024-04-10 10:30:00 发布

阅读量4.6k

点赞数 8

分类专栏：机器学习

本文链接：https://blog.csdn.net/Amy_mm/article/details/79976053

版权

机器学习专栏收录该内容

24 篇文章 0 订阅

订阅专栏

《python machine learning》 chapter 8 Applying Machine Learning to Sentiment Analysis

git源码：https://github.com/xuman-Amy/sentimental-analysis

项目说明：根据Internet Movie Database (IMDb)上获取的50000个影评，预测影评是积极的还是消极的。

（1）清洗准备文本数据

（2）从数据集中构建特征向量

（3）训练模型区分影评的positive 和 negative

（4）out-of-core处理大数据集

（5）从文本分类中推断主题

【1、准备数据】

数据说明：影评集为50000的大数据集，每条影评被标记为positive 和 negative，positive表示电影获得六星及以上的好评；negative表示六星以下。

【获取数据】

import pandas as pd
df = pd.read_csv("G:\Machine Learning\python machine learning\python machine learning code\code\ch08\movie_data.csv")
df.head()

【bag-of-words】

利用bag-of-words将文本数据转换为数值型特征向量。

bag-of-words的基本思想：

（1）创建一个具有唯一token的单词表，例如来自整个文档的单词

（2）在每个文档中创建一个特征向量——特征向量包含每个单词在特定文档中出现的频率。

【sklearn 实现bag-of-words】

将单词转换为特征向量

利用

#bag-of-words
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()
doc = np.array([ 
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(doc)
print(count.vocabulary_)

将CountVectorizer将每个单词存储在字典中，与之相映射的是字典的数字索引。

特征向量中，0-9列与字典的索引相对应,特征向量如下：

print(bag.toarray())

（and, is, one, shining, sun, sweet, the, two, weather）

向量中的值也叫做原词的频率（raw term frequencies） tf(t,d)即term t 出现在词典d中的频率。

【tf-idf 】

term frequency-inverse document frequency：词频-逆向文件频率

如果某个词或短语在一篇文章中出现的频率TF高，并且在其他文章中很少出现，则认为此词或者短语具有很好的类别区分能力，适合用来分类。

文档总个数，文档d包含term t 的个数。加1 是为了确保分母不为0，log是为了保证较低的df不会占有太大权重。

【sklearn 实现tf-idf】

from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(use_idf = True, norm = 'l2', smooth_idf = True)
tfidf.fit_transform(count.fit_transform(docs))
np.set_printoptions(precision = 2)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

图表分析：

对比tf 和 tf-idf 表格发现，在tf表格中，单词‘is’在第三个文档中出现频率最高(3), 但是在tf-idf中频率偏低（0.45），这是因为is同样出现在了第一个和第二个文档中，所以判断is不像是包含有用判别信息的单词。

说明：

在sklearn中计算tf-idf的公式与上述公式稍有不同，在sklearn中计算公式为：

最后将tf-idf进行规范化，公式如下：

sklearn直接对数据进行了规范化处理，norm =’l2‘，返回一个长度为1的向量,其计算公式为

【清洗数据】

在进行bag-of-words等上述文档处理步骤前，首先要进行清洗数据，将不必要的信息条带化。

调用python的regular expression（regex）(正则表达式) 库，re, 进行数据清洗工作。

import re
def preprocessor(text):
    text = re.sub('<[^>]*>','', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)                           
    text = (re.sub('[\W]+', ' ', text.lower()) +
            ' '.join(emoticons).replace('-', ''))
    return text

df['review'] = df['review'].apply(preprocessor)

【将文档处理成tokens】

（1）在空白字符处对清洗过的文档进行分割

def tokenizer(text):
    return text.split()

tokenizer('runners like running and thus they run')

（2）运用Porter stemmer algorithm返回单词的词干

#  word stemming 词干
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

tokenizer_porter('runners like running and thus they run')

【去除stop-words】

stop-words在所有文档中都极其常见，且基本不包含便于区分类别的有用信息

可以从NLTK中直接加载127个英语中的stop-words。

#removing stop-words
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
stop = stopwords.words('english')
[w for w in tokenizer_porter('runners like running and thus they run')[-10:] if w not in stop]

【训练文本分类的逻辑回归模型】

【将前25000划分为训练集，后25000划分为测试集】

#training LR model
#split train and test dataset
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test =  df.loc[25000:, 'review'].values
y_test =  df.loc[25000:, 'sentiment'].values

【运用grid search 寻找5-hold的LR模型的最佳参数集】

# train LR  mmodel
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)

gs_lr_tfidf.fit(X_train, y_train)

print('Best parameter set: %s ' % gs_lr_tfidf.best_params_)
print('CV Accuracy: %.3f' % gs_lr_tfidf.best_score_)

clf = gs_lr_tfidf.best_estimator_
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))

【out-of-core 处理大数据集】

定义tokenizer进行数据清洗以及分离单词token

# construct word tokens
import numpy as np
import re
from nltk.corpus import stopwords

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized


#read in and return documents at a time
def stream_docs(path):
    with open(path, 'r', encoding='utf-8') as csv:
        next(csv)  # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

获取指定长度的文本

#  take a document stream from the stream_docs function 
#  and return a particular number of documents specified by the size parameter
def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.apped(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

运用HashingVectorizer进行文本预处理

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier
vect = HashingVectorizer(decode_error = 'ignore', 
                         n_features = 2 ** 21,
                         preprocessor = None,
                         tokenizer = tokenizer)
clf = SGDClassifier(loss = 'log', random_state = 1, n_iter = 1)
doc_stream = stream_docs(path =
 'G:\Machine Learning\python machine learning\python machine learning code\code\ch08\movie_data.csv')

【进行out-of-core】

分批次加载训练集数据，每次加载1000条，共45*1000

import pyprind
pbar = pyprind.ProgBar(45)
classes = np.array([0,1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)  
    if not X_train:
        break
    X_train  = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes = classes)
    pbar.update()

加载5000条测试集

X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print('Accuracy: %.3f' % clf.score(X_test, y_test))

【Topic modeling with Latent Dirichlet Allocation (LDA)】

主题建模：将主题分配给未标记文本的广泛任务。

采用Latent Dirichlet Allocation (LDA) 隐狄利赫雷分布进行主题建模。

【LDA基本思想】

LDA是一种生成概率模型，试图找到在不同文档中高频出现的单词组，这些单词组能够反映文档主题。

输入为bag-of-words矩阵，输出为LDA将其分解为两个新的矩阵：与主题矩阵对应的文档；与主题矩阵对应的单词

主题个数为LDA的超参数，必须事先设置好。

【sklearn实现LDA】

利用LDA分解影评数据集，并对影评进行不同主题的分类。

【1. 读入数据】

import pandas as pd
df = pd.read_csv("G:\Machine Learning\python machine learning\python machine learning code\code\ch08\movie_data.csv",
                encoding = 'utf-8')
df.head()

【2. 创建bag-of-words 输入】

from sklearn.feature_extraction.text import CountVectorizer
# max_df = .1,max document frequency 最高文档频率占总文本的10%，
count = CountVectorizer(stop_words = 'english',
                        max_df = .1,
                        max_features = 5000)
X = count.fit_transform(df['review'].values)

【3. LDA】

#LDA modeling 
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_topics = 10,
                               random_state = 123,
                               learning_method = 'batch')
X_topics = lda.fit_transform(X)

找出top5的单词

#five most important words for each of the 10 topics. 
n_top_words = 5
feature_names = count.get_feature_names()
for topic_idx, topic in enumerate(lda.components_):
    print('Topic %d :' % (topic_idx + 1))
    print(" ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1 :-1]]))

推测电影类别

根据内容验证一下推测

horror = X_topics[:,5].argsort()[:: -1]
for iter_idx, movie_idx in enumerate(horror[:3]):
    print('\nHorror movie #%d:' % (iter_idx + 1))
    print(df['review'][movie_idx][:300], '...')

Amy_mm

关注

8
点赞
踩
22

收藏

觉得还不错? 一键收藏
1
评论
机器学习——情感分析

《python machine learning》 chapter 8 Applying Machine Learning to Sentiment Analysisgit源码：https://github.com/xuman-Amy/sentimental-analysis项目说明：根据Internet Movie Database (IMDb)上获取的50000个影评，预测影评是积极的...
复制链接

扫一扫