利用机器学习进行情感分析

利用机器学习进行情感分析

1.导入电影评论数据集

此电影评论数据集包含有50000条评论信息,其中已经标记评论信息在6颗星以上的为positive, 评论信息在5颗星以下的为negative.

原始数据下载地址在这里。

import pandas as pd

df = pd.read_csv('movie_data.csv', encoding='utf-8')
df.head(3)
reviewsentiment
0In 1974, the teenager Martha Moxley (Maggie Gr...1
1OK... so... I really like Kris Kristofferson a...0
2***SPOILER*** Do not read this, if you think a...0
import matplotlib.pyplot as plt
import seaborn as sns
# 查看类别(类间)是否平衡
sns.countplot(x='sentiment', data=df)
plt.show()

在这里插入图片描述

df.shape
(50000, 2)

2.词袋模型bag-of-words model

类似于前面的类别型特征编码而实现数值化,这里通过词袋模型对文本数据进行编码,将其表示为数值型的特征向量。

思想:

1.从整个文档集合中创建一个唯一标记的词汇表,以单词为例;

2.从每个文档构造一个特征向量,该特征向量包含每个单词在特定文档中出现的频率计数;

由于文档中的唯一单词仅代表词袋中所有单词的一小部分,因此特征向量大多数是由零组成,所以它是稀疏的。

2.1将单词转换为特征向量

# 使用sklearn中实现的CountVectorizer类,根据各个文档中的字数构建词袋模型,CountVectorizer获取一组文本数据,可以是文档或者句子,
# 示例
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()   # 这里默认使用ngram_range=(1, 1),
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(docs)
# 打印vocabulary,结果是以字典形式表示的映射
bag
<3x9 sparse matrix of type '<class 'numpy.int64'>'
	with 17 stored elements in Compressed Sparse Row format>

这里的整数并非代表每个词的统计计数,而是代表词到特征索引的映射

https://stackoverflow.com/questions/45104936/countvectorizer-giving-wrong-counts-for-words

print(count.vocabulary_)
{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}

例如:索引0所对应的‘and’仅仅在第三个文本内容中出现,因此其编码为002,2为出现次数;

索引1所对应的‘is’在三个文本内容中均出现,且出现次数分别为1,1,3。因此其编码为113;

这里总共有三个文本,且所有的唯一词的个数为9,因此矩阵为3行,9列,

print(bag.toarray())
[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]

上面创建的词袋模型也叫作1-gram模型,即文档被划分为单独的一个个词;更一般地,连续n个词构成n-gram模型;

1-gram:‘the’, ‘sun’, ‘is’,‘shining’

2-gram:‘the sun’, ‘sun is’, ‘is shining’

可以通过执行CountVectorize实例的参数为:ngram_range=(2.2),从而使用2-gram。

2.2通过词频和逆文档频率指数评估词汇相关性

np.set_printoptions(precision=2)

tf-idf可以用来降低那些频繁出现在特征向量中词的权重:

其定义为词频和逆文档频率指数的乘积:

tf-idf ( t , d ) = tf (t,d) × idf ( t , d ) \text{tf-idf}(t,d)=\text{tf (t,d)}\times \text{idf}(t,d) tf-idf(t,d)=tf (t,d)×idf(t,d)

这里的 t f ( t , d ) tf(t, d) tf(t,d)代表词频,逆文档频率指数$ idf(t, d)$ 计算如下:

idf ( t , d ) = log n d 1 + df ( d , t ) \text{idf}(t,d) = \text{log}\frac{n_d}{1+\text{df}(d, t)} idf(t,d)=log1+df(d,t)nd

这里的 n d n_d nd为文档的总数, d f ( d , t ) df(d, t) df(d,t)代表的是包含短语 t t t的文档 d d d的数量。

分母加上1的作用是保证分母不为零

sklearn库中实现了另一种转换器,TfidfTransformer它将CountVectorizer类中的原始短语频率作为输入,并将其转换为TF-IDF

from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True, 
                         norm='l2', 
                         smooth_idf=True)
print(tfidf.fit_transform(count.fit_transform(docs))
      .toarray())
[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]

结合bag.toarray()的输出信息,单词“is”在第三个文本句子中具有最大的词频,即出现最为频繁。

然而,通过将其转换为tf-idf之后发现,单词“is”现在关联到了一个比较小的tf-idf值,0.45。在第三个句子中,其相对较小。

因为“is”同时出现在了第一和第二个文本中,因此其不太可能包含有有用的区分性信息。

然而,注意到上述TfidfTransformer计算得出的tf-idf相较于上述定义的计算公式得出的结果有所不同

**原因在于:**计算公式定义的不同

sklearn中idf的计算过程如下:

idf ( t , d ) = l o g 1 + n d 1 + df ( d , t ) \text{idf} (t,d) = log\frac{1 + n_d}{1 + \text{df}(d, t)} idf(t,d)=log1+df(d,t)1+nd

sklearn中实现的tf-idf 计算过程如下:

tf-idf ( t , d ) = tf ( t , d ) × ( idf ( t , d ) + 1 ) \text{tf-idf}(t,d) = \text{tf}(t,d) \times (\text{idf}(t,d)+1) tf-idf(t,d)=tf(t,d)×(idf(t,d)+1)

可以在计算tf-idf之前对原始词频进行归一化,但这里直接通过TfidfTransformer对TF-IDF进行归一化。

默认情况下使用“norm=L2”,该归一化通过将非归一化特征向量除以其L2范数来返回长度为1的向量。

v norm = v ∣ ∣ v ∣ ∣ 2 = v v 1 2 + v 2 2 + ⋯ + v n 2 = v ( ∑ i = 1 n v i 2 ) 1 2 v_{\text{norm}} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v_{1}^{2} + v_{2}^{2} + \dots + v_{n}^{2}}} = \frac{v}{\big (\sum_{i=1}^{n} v_{i}^{2}\big)^\frac{1}{2}} vnorm=v2v=v12+v22++vn2 v=(i=1nvi2)21v

IF-IDF计算示例:

单词“is”在第三个文本 d 3 d_3 d3中出现频次为3,即 t f = 3 tf=3 tf=3,同时由于“is”出现在了所有的文本中,因此 d f = 3 df=3 df=3

计算idf:
idf ( " i s " , d 3 ) = l o g 1 + 3 1 + 3 = 0 \text{idf}("is", d_3) = log \frac{1+3}{1+3} = 0 idf("is",d3)=log1+31+3=0

计算tf-idf:

tf-idf ( " i s " , d 3 ) = 3 × ( 0 + 1 ) = 3 \text{tf-idf}("is", d_3)= 3 \times (0+1) = 3 tf-idf("is",d3)=3×(0+1)=3

tf_is = 3
n_docs = 3
idf_is = np.log((n_docs+1) / (3+1))
tfidf_is = tf_is * (idf_is + 1)
print('tf-idf of term "is" = %.2f' % tfidf_is)
tf-idf of term "is" = 3.00

对文本中所有的短语重复执行上述计算过程,得到tf-idf向量为:[3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0 , 1.69, 1.29].

但是这个结果,不同于“print(tfidf.fit_transform(count.fit_transform(docs)).toarray())”的输出结果。

原因是,正则化过程。计算如下:

此计算为元素级操作:

tfi-df n o r m = [ 3.39 , 3.0 , 3.39 , 1.29 , 1.29 , 1.29 , 2.0 , 1.69 , 1.29 ] [ 3.3 9 2 , 3. 0 2 , 3.3 9 2 , 1.2 9 2 , 1.2 9 2 , 1.2 9 2 , 2. 0 2 , 1.6 9 2 , 1.2 9 2 ] \text{tfi-df}_{norm} = \frac{[3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0 , 1.69, 1.29]}{\sqrt{[3.39^2, 3.0^2, 3.39^2, 1.29^2, 1.29^2, 1.29^2, 2.0^2 , 1.69^2, 1.29^2]}} tfi-dfnorm=[3.392,3.02,3.392,1.292,1.292,1.292,2.02,1.692,1.292] [3.39,3.0,3.39,1.29,1.29,1.29,2.0,1.69,1.29]

= [ 0.5 , 0.45 , 0.5 , 0.19 , 0.19 , 0.19 , 0.3 , 0.25 , 0.19 ] =[0.5, 0.45, 0.5, 0.19, 0.19, 0.19, 0.3, 0.25, 0.19] =[0.5,0.45,0.5,0.19,0.19,0.19,0.3,0.25,0.19]

⇒ tfi-df n o r m ( " i s " , d 3 ) = 0.45 \Rightarrow \text{tfi-df}_{norm}("is", d3) = 0.45 tfi-dfnorm("is",d3)=0.45

# 未进行正则化的计算
tfidf = TfidfTransformer(use_idf=True, norm=None, smooth_idf=True)
raw_tfidf = tfidf.fit_transform(count.fit_transform(docs)).toarray()[-1]
raw_tfidf 
array([3.39, 3.  , 3.39, 1.29, 1.29, 1.29, 2.  , 1.69, 1.29])
# tfidf = TfidfTransformer(use_idf=True, norm=None, smooth_idf=True)
# raw_tfidf = tfidf.fit_transform(count.fit_transform(docs)).toarray()
# raw_tfidf 
array([[0.  , 1.  , 0.  , 1.29, 1.29, 0.  , 1.  , 0.  , 0.  ],
       [0.  , 1.  , 0.  , 0.  , 0.  , 1.29, 1.  , 0.  , 1.29],
       [3.39, 3.  , 3.39, 1.29, 1.29, 1.29, 2.  , 1.69, 1.29]])
l2_tfidf = raw_tfidf / np.sqrt(np.sum(raw_tfidf**2))
l2_tfidf
array([0.5 , 0.45, 0.5 , 0.19, 0.19, 0.19, 0.3 , 0.25, 0.19])

3.文本数据清洗

df.loc[0, 'review'][-50:]
'is seven.<br /><br />Title (Brazil): Not Available'
# 使用正则表达式,去除相应的标点符号
# re.sub用于替换字符串中的匹配项
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text)
    text = (re.sub('[\W]+', ' ', text.lower()) + # 移除所有非单词的字符
            ' '.join(emoticons).replace('-', ''))
    return text
# 处理函数示例
preprocessor(df.loc[0, 'review'][-50:])
'is seven title brazil not available'
# 处理函数示例
preprocessor("</a>This :) is :( a test :-)!")
'this is a test :) :( :)'
# 将上述处理函数应用于整个DataFrame
df['review'] = df['review'].apply(preprocessor)

4.将文本处理成文标记符

处理完电影评论数据集之后,需要将文本语料库拆分为单独的元素。标记文档的一种方法是通过将已清理的文档按其空白字符拆分为单个单词。

from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def tokenizer(text):
    return text.split()


def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()] # 使用列表生成式

在标记上下文过程中,另一种有用的技术是词干分析,它是将**单词转换为其词根形式的过程。**它可以实现将相关单词映射到同一个词干上。

Python自然语言处理工具包提供的NLTK工具实现了词干提取算法。

tokenizer('runners like running and thus they run')
['runners', 'like', 'running', 'and', 'thus', 'they', 'run']
tokenizer_porter('runners like running and thus they run')
['runner', 'like', 'run', 'and', 'thu', 'they', 'run']
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

tokenizer_porter('runners like running and thus they run')
['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

去除停用词:stop-word removal

停用词是指在各种文本中非常常见的词,可能没有或仅有一点点有用的信息,区分度不高。常见的停用词如:is, and, has, like and,

如果处理的是原始的或者标准化的短语词频,那么去除停用词可能会很有用。因为tf-idf已经降低了经常出现的词的权重。

# 为了去除电影评论数据中的停用词,需要使用到127个英文停用词集合
# 执行此段代码之后,出现报错信息,下载失败
import nltk

nltk.download('stopwords')
[nltk_data] Error loading stopwords: <urlopen error [WinError 10054]
[nltk_data]     远程主机强迫关闭了一个现有的连接。>





False
# 在上述执行失败情况下,手动下载数据,并置于指定位置,执行成功
# 参考解决方案:https://blog.csdn.net/AIHUBEI/article/details/107947593?spm=1001.2014.3001.5501
from nltk.corpus import stopwords

stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot')[-10:]
if w not in stop]
['runner', 'like', 'run', 'run', 'lot']


5.训练一个逻辑回归模型用于文本分类

建立基于词袋模型的逻辑回归分类器用于对电影评论数据进行分类。这里首先划分训练集和测试集

X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

利用管道pipeline建立完整的工作流程,这里使用5折交叉验证,使用网格搜索进行模型调参。

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.model_selection import GridSearchCV

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

# param_grid = [{'vect__ngram_range': [(1, 1)],
#                'vect__stop_words': [stop, None],
#                'vect__tokenizer': [tokenizer, tokenizer_porter],
#                'clf__penalty': ['l1', 'l2'],
#                'clf__C': [1.0, 10.0, 100.0]},
#               {'vect__ngram_range': [(1, 1)],
#                'vect__stop_words': [stop, None],
#                'vect__tokenizer': [tokenizer, tokenizer_porter],
#                'vect__use_idf':[False],
#                'vect__norm':[None],
#                'clf__penalty': ['l1', 'l2'],
#                'clf__C': [1.0, 10.0, 100.0]},
#               ]


param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0]},
              ]


lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0, solver='liblinear'))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=2,
                           n_jobs=-1)
gs_lr_tfidf.fit(X_train, y_train)
Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   33.1s
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:   43.5s finished





GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('vect',
                                        TfidfVectorizer(lowercase=False)),
                                       ('clf',
                                        LogisticRegression(random_state=0,
                                                           solver='liblinear'))]),
             n_jobs=-1,
             param_grid=[{'clf__C': [1.0, 10.0], 'clf__penalty': ['l1', 'l2'],
                          'vect__ngram_range': [(1, 1)],
                          'vect__stop_words': [['i', 'me', 'my', 'myself', 'we',
                                                'our', 'ours', 'ourselves',
                                                'you', "you're", "you've",
                                                "you'll", "you'd", 'your',
                                                'yours', 'yourself',
                                                'yourselves', 'he', 'him',
                                                'his', 'himself', 'she',
                                                "she's", 'her', 'hers',
                                                'herself', 'it', "it's", 'its',
                                                'itself', ...],
                                               None],
                          'vect__tokenizer': [<function tokenizer at 0x000002010FBAA948>]}],
             scoring='accuracy', verbose=2)
print('Best parameter set: %s ' % gs_lr_tfidf.best_params_)
print('CV Accuracy: %.3f' % gs_lr_tfidf.best_score_)
Best parameter set: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x000002010FBAA948>} 
CV Accuracy: 0.897
clf = gs_lr_tfidf.best_estimator_
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))
Test Accuracy: 0.899


from sklearn.linear_model import LogisticRegression
import numpy as np

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

np.random.seed(0)
np.set_printoptions(precision=6)

y = [np.random.randint(3) for i in range(25)]
X = (y + np.random.randn(25)).reshape(-1, 1)

cv5_idx = list(StratifiedKFold(n_splits=5, shuffle=False).split(X, y))
    
lr = LogisticRegression(random_state=123, multi_class='ovr', solver='lbfgs')
cross_val_score(lr, X, y, cv=cv5_idx)
array([0.4, 0.2, 0.6, 0.2, 0.4])
from sklearn.model_selection import GridSearchCV

lr = LogisticRegression(solver='lbfgs', multi_class='ovr', random_state=1)
gs = GridSearchCV(lr, {}, cv=cv5_idx, verbose=3).fit(X, y) 
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV]  ................................................................
[CV] .................................... , score=0.400, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.200, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.600, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.200, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.400, total=   0.0s


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.0s finished
gs.best_score_
0.36000000000000004
lr = LogisticRegression(solver='lbfgs', multi_class='ovr', random_state=1)
cross_val_score(lr, X, y, cv=cv5_idx).mean()
0.36000000000000004

6.面向更大的数据集–在线算法和核外学习

核外学习out-of-core learning:

通过使用核外学习,实现在数据集的小批量上逐步拟合分类器来处理大规模的数据集。

import numpy as np
import re
from nltk.corpus import stopwords


stop = stopwords.words('english')

"""
定义tokenizer函数,对数据进行清理,将文本划分为词的表示,同时去除停用词。
"""

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

"""
定义一个生成器函数,一次读入并返回一个文本。
"""
def stream_docs(path):
    with open(path, 'r', encoding='utf-8') as csv:
        next(csv)  # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label
# 验证stream_docs函数正常工作
next(stream_docs(path='movie_data.csv'))
('"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />""Murder in Greenwich"" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich family used their influence to cover the murder for more than twenty years. However, a snoopy detective and convicted perjurer in disgrace was able to disclose how the hideous crime was committed. The screenplay shows the investigation of Mark and the last days of Martha in parallel, but there is a lack of the emotion in the dramatization. My vote is seven.<br /><br />Title (Brazil): Not Available"',
 1)
# 定义函数get_minibatch,可以通过size参数实现对一次性读取的文本的数量控制
def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

不能使用CountVectorizer进行核外学习,因为其需要在内存中保存完整的词汇表。此外,TfidfVectorizer需要将训练数据集的所有特征向量

保存到内存中,从而计算逆文档频率。

sklearn中提供了一种名为HashingVectorizer的文本处理方式,HashingVectorizer是数据独立的。

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier


vect = HashingVectorizer(decode_error='ignore', 
                         n_features=2**21,
                         preprocessor=None, 
                         tokenizer=tokenizer)

选择一个较大的特征个数值,减小了引起哈希冲突的机会,但是这也增加了逻辑回归模型中的系数个数。

from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version

clf = SGDClassifier(loss='log', random_state=1)


doc_stream = stream_docs(path='movie_data.csv')
import pyprind
pbar = pyprind.ProgBar(45) # 初始化进度条对象为45次迭代

classes = np.array([0, 1])
# 进行45轮迭代,每个mini-batch包含1000个文本,完成增量学习之后,余下5000个文本用来评估模型的性能
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:39
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print('Accuracy: %.3f' % clf.score(X_test, y_test))
Accuracy: 0.868
clf = clf.partial_fit(X_test, y_test)

相较于前面的网格搜索调参,这里的模型accuracy有轻微地下降,但其优点是,内存占用少,同时训练效率高。

7.主题模型

主题模型描述了将主题分配给未标记文本文档的广泛任务。例如:对报纸文章的大型文本语料库中的文档进行分类。

可以将主题模型看做是聚类任务,一个非监督学习的子范畴。

Latent Dirichlet Allocation(LDA):是一种流行的主题模型,又称为隐狄利克雷分布。这个不同于线性判别分析LDA(Linear discriminant analysis)

7.1使用LDA分解文本文档

LDA是一种生成式概率模型,其试图找到不同文档中经常出现的单词组。这些经常出现的词表示主题,假设每个文档是不同单词的混合物。

LDA的输入是词袋模型,给定一个词袋矩阵作为输入,LDA将其分解为两个新的矩阵:

1.文本-主题 矩阵

2.单词-主题 矩阵

判别式模型:直接使用条件概率建模,如SVM、神经网络、线性判别分析LDA、线性回归等

生成式模型:使用联合概率分布建模、如朴素贝叶斯Naive Bayes、隐马尔科夫、马尔科夫随机场等

如果将上述两个矩阵相乘,则可以以尽可能低的误差重现输入,即词袋矩阵。但是使用LDA的缺点可能是必须实现定义主题的数量

主题的数量作为一个超参数,必须通过手动预先指定。

7.2通过sklearn使用LDA

# 导入数据
import pandas as pd

df = pd.read_csv('movie_data.csv', encoding='utf-8')
df.head(3)
reviewsentiment
0In 1974, the teenager Martha Moxley (Maggie Gr...1
1OK... so... I really like Kris Kristofferson a...0
2***SPOILER*** Do not read this, if you think a...0
# 使用countVectorizer创建词袋矩阵,以作为LDA的输入
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english',
                        max_df=.1,
                        max_features=5000)
X = count.fit_transform(df['review'].values)
from sklearn.decomposition import LatentDirichletAllocation

# 这里假定具有10个主题,这个超参数是可以进行调整的
# 设置单词在文档中出现的最大频率为10%,从而将文档中出现过于频繁的词排除掉
# 原因是:这些单词可能是出现在所有文档中的常见单词,因此不太可能与给定文档中的特定类别相关联
lda = LatentDirichletAllocation(n_components=10,
                                random_state=123,
                                learning_method='batch') # 通过指定learning_mathod= 'batch',实现将lda模型基于所有的训练数据做评估
                                                        # 这种方法可能相较于learning_mechod= 'online'更慢一些,但可以获得更准确的结果
X_topics = lda.fit_transform(X)
# 拟合LDA之后,通过LDA的components_属性,得到10个主题中每个主题的单词重要性,并按照递增顺序排列;
lda.components_.shape
(10, 5000)
# 为了打印出最重要的5个单词,需要对主题数组进行逆序排序
n_top_words = 5
feature_names = count.get_feature_names()

for topic_idx, topic in enumerate(lda.components_):
    print("Topic %d:" % (topic_idx + 1))
    print(" ".join([feature_names[i]
                    for i in topic.argsort()\
                        [:-n_top_words - 1:-1]]))
Topic 1:
worst minutes awful script stupid
Topic 2:
family mother father children girl
Topic 3:
american war dvd music tv
Topic 4:
human audience cinema art sense
Topic 5:
police guy car dead murder
Topic 6:
horror house sex girl woman
Topic 7:
role performance comedy actor performances
Topic 8:
series episode war episodes tv
Topic 9:
book version original read novel
Topic 10:
action fight guy guys cool

基于上述各主题的5个最重要的单词,可以基本猜测电影类型如下:

  1. Generally bad movies (not really a topic category)
  2. Movies about families
  3. War movies
  4. Art movies
  5. Crime movies
  6. Horror movies
  7. Comedies
  8. Movies somehow related to TV shows
  9. Movies based on books
  10. Action movies

sklearn库的LDA实现使用期望最大化EM算法迭代更新其参数估计

为了确认这些归类是合理的,这里打印恐怖电影类别中的三部电影(恐怖电影属于第六类,排名第五)

horror = X_topics[:, 5].argsort()[::-1]

for iter_idx, movie_idx in enumerate(horror[:3]):
    print('\nHorror movie #%d:' % (iter_idx + 1))
    print(df['review'][movie_idx][:300], '...')
Horror movie #1:
House of Dracula works from the same basic premise as House of Frankenstein from the year before; namely that Universal's three most famous monsters; Dracula, Frankenstein's Monster and The Wolf Man are appearing in the movie together. Naturally, the film is rather messy therefore, but the fact that ...

Horror movie #2:
Okay, what the hell kind of TRASH have I been watching now? "The Witches' Mountain" has got to be one of the most incoherent and insane Spanish exploitation flicks ever and yet, at the same time, it's also strangely compelling. There's absolutely nothing that makes sense here and I even doubt there  ...

Horror movie #3:
<br /><br />Horror movie time, Japanese style. Uzumaki/Spiral was a total freakfest from start to finish. A fun freakfest at that, but at times it was a tad too reliant on kitsch rather than the horror. The story is difficult to summarize succinctly: a carefree, normal teenage girl starts coming fac ...
  • 13
    点赞
  • 86
    收藏
    觉得还不错? 一键收藏
  • 5
    评论
评论 5
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值