《学术小白的实战之路》01 LDA-Word2Vec-TF-IDF组合特征的机器学习情感分类模型研究

最新推荐文章于 2021-12-13 11:02:49 发布

驭风少年君

最新推荐文章于 2021-12-13 11:02:49 发布

阅读量2k

点赞数 9

分类专栏：学术小白的实战之路文章标签： word2vec python 自然语言处理

本文链接：https://blog.csdn.net/qq_44951759/article/details/120682246

版权

学术小白的实战之路专栏收录该内容

3 篇文章 1 订阅

订阅专栏

书山有路勤为径，学海无涯苦作舟

三更灯火五更鸡，正是男儿读书时

一、传统的机器学习分类模型

1.1 对文本的数据进行分词

数据样式
在这里插入图片描述

自定义分词词典、去除停用词，分词

#--------------------------------------------------已经分好词就不需要这个----------------------------------------

# -*- coding:utf-8 -*-
import csv
import pandas as pd
import numpy as np
import jieba
import jieba.analyse

#添加自定义词典和停用词典
jieba.load_userdict("user_dict.txt")
stop_list = pd.read_csv('stop_words.txt',
                        engine='python',
                        encoding='utf-8',
                        delimiter="\n",
                        names=['t'])['t'].tolist()

#中文分词函数
def txt_cut(juzi):
    return [w for w in jieba.lcut(juzi) if w not in stop_list and len(w) >1]

写入分词结果到新的文件

#写入分词结果到新的文件
fw = open('hotel_fenci_data.csv', "a+", newline = '',encoding = 'gb18030')
writer = csv.writer(fw)  
writer.writerow(['content','label'])

读取原始文件

# 使用csv.DictReader读取文件中的信息
labels = []
contents = []
file = r"C:\Users\N\Desktop\hotel_data.csv"
with open(file, "r", encoding="UTF-8") as f:
    reader = csv.DictReader(f)
    for row in reader:
        # 数据元素获取
        content = row['content']
        seglist = txt_cut(content)
        output = ' '.join(list(seglist))            #空格拼接
        contents.append(output)
        
        #文件写入
        tlist = []
        tlist.append(row['label'])
        tlist.append(output)
        writer.writerow(tlist)

fw.close()

1.2 构建文本矩阵

读取文件

# -*- coding:utf-8 -*-
import csv
import pandas as pd
import numpy as np
import jieba
import jieba.analyse
from scipy.sparse import coo_matrix
from sklearn import feature_extraction  
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

#----------------------------------第一步 读取文件--------------------------------
with open('hotel_fenci_data.csv', 'r', encoding='UTF-8-sig') as f:
    reader = csv.DictReader(f)
    labels = []
    contents = []
    for row in reader:
        labels.append(row['label']) #0-好评 1-差评
        contents.append(row['content'])

print(labels[:5])
print(contents[:5])

[‘0’, ‘0’, ‘0’, ‘0’, ‘0’]
['21 提交订单 26 朋友先行到达酒店人中一位未带有效证件前台足足小时解决前台 7.25 接到公安局通知每位住店客人提供证件奥运期间公安部门提出中国公民配合无可厚非酒店几个作法实在无法忍受 25 日晚致电酒店酒店提醒提交订单近一星期未有酒店携程事前通知朋友足足小时赶到酒店小时只顾客人办理手续提供解决方法号称星级酒店事件自始至终前台人员笑脸错全客人号称国际接轨星级酒店培训员工酒店提供公安部门下发相关文件前台人员回答公安部门电话下达酒店文件一名常识公民当今信息化时代政府部门信息公开原则若真诸如此类文件下达正规操作怀疑酒店说法酒店交涉酒店人员蛮横回答接待实在半岛品牌规范信誉度感到震惊室外高达 40 多度高温驱车半小时到达酒店半岛集团酒店客人得知半岛盗用香港半岛名称香港半岛酒店管理集团得知李鬼兄弟会做何感想工商管理部门水准团队享有盛名半岛品牌期间一位身着经理服装女士接待号称请示上级请示 15 20 分钟这位经理级别更有甚者此位经理出言不逊出门证件常识斗胆一句信息提前公开告知常识常识常识客人不配入住高贵酒店客人尊重客人人格侮辱息事宁人态度提出上海证件传真酒店酒店先行办理 checkin 手续前台出尔反尔一会一会不行态度强硬情况足足小时这群饥肠辘辘客人 checkin 手续办完只能房间房间打扫时间中午 12 35 国际管理当天客人入住最晚 12 实在对此酒店服务水平怀疑预定泳池湖景房看不到湖景从何而来 10 凌晨 45 男客人泳池游泳影响根本无法入睡酒店竟无一人前去劝阻归于种种细节叙述真诚携程网提出建议此种水准酒店推荐客人遭受折磨有损网站声誉补充点评 2008 酒店不负责任缺乏诚意反馈实在有愧挂出六星标志赞同楼上朋友点评对不起这六星这家酒店管理希望下次新面貌酒店承认 25 接到公安部门通知给予携程网客服人员 26 上午收到通知信息欺骗公众酒店生存酒店调出当日录像资料公司高层好好学学微笑好好培训前台前台脸上笑容明明朋友提出传真身份证件前台一会一会证件酒店接待酒店总经理事情只好感谢酒店总经理真诚服务酒店尊重公安部门公安部门国家执法机关规范严谨承认接到指示入住酒店客人出示身份证登记酒店相关工作人员住店当事人做出特别时期严肃处罚书名号公安部门发文酒店自圆其说自始至终那位前台小姐接待酒店前台员工从何而来承认 1881 半岛半岛加个 1881 香港九龙半岛酒店一脉

清洗数据

# 清洗数据
import  re

pattern = re.compile('[0-9a-zA-Z’!"#$%&\'()*+,-./:;<=>?@，。?★、…【】《》？“”‘’！[\\]^_`{|}~]+')
clearn_data = []
for content in contents:
    temp_data = re.sub(pattern,'',content)
    print(temp_data)
    clearn_data.append(temp_data)

提交订单朋友先行到达酒店人中一位未带有效证件前台足足小时解决前台接到公安局通知每位住店客人提供证件奥运期间公安部门提出中国公民配合无可厚非酒店几个作法实在无法忍受日晚致电酒店酒店提醒提交订单近一星期未有酒店携程事前通知朋友足足小时赶到酒店小时只顾客人办理手续提供解决方法号称星级酒店事件自始至终前台人员笑脸错全客人号称国际接轨星级酒店培训员工酒店提供公安部门下发相关文件前台人员回答公安部门电话下达酒店文件一名常识公民当今信息化时代政府部门信息公开原则若真诸如此类文件下达正规操作怀疑酒店说法酒店交涉酒店人员蛮横回答接待实在半岛品牌规范信誉度感到震惊室外高达多度高温驱车半小时到达酒店半岛集团酒店客人得知半岛盗用香港半岛名称香港半岛酒店管理集团得知李鬼兄弟会做何感想工商管理部门水准团队享有盛名半岛品牌期间一位身着经理服装女士接待号称请示上级请示分钟这位经理级别更有甚者此位经理出言不逊出门证件常识斗胆一句信息提前公开告知常识常识常识客人不配入住高贵酒店客人尊重客人人格侮辱息事宁人态度提出上海证件传真酒店酒店先行办理手续前台出尔反尔一会一会不行态度强硬情况足足小时这群饥肠辘辘客人手续办完只能房间房间打扫时间中午国际管理当天客人入住最晚实在对此酒店服务水平怀疑预定泳池湖景房看不到湖景从何而来凌晨男客人泳池游泳影响根本无法入睡酒店竟无一人前去劝阻归于种种细节叙述真诚携程网提出建议此种水准酒店推荐客人遭受折磨有损网站声誉补充点评酒店不负责任缺乏诚意反馈实在有愧挂出六星标志赞同楼上朋友点评对不起这六星这家酒店管理希望下次新面貌酒店承认接到公安部门通知给予携程网客服人员上午收到通知信息欺骗公众酒店生存酒店调出当日录像资料公司高层好好学学微笑好好培训前台前台脸上笑容明明朋友提出传真身份证件前台一会一会证件酒店接待酒店总经理事情只好感谢酒店总经理真诚服务酒店尊重公安部门公安部门国家执法机关规范严谨承认接到指示入住酒店客人出示身份证登记酒店相关工作人员住店当事人做出特别时期严肃处罚书名号公安部门发文酒店自圆其说自始至终那位前台小姐接待酒店前台员工从何而来承认半岛半岛加个香港九龙半岛酒店一脉相传酒店尊重客人指责客人尊重酒店酒店真诚反馈只好真诚回复酒店再也不会享用酒店六星服务奉劝酒店诚信生存切记宾馆反馈入住出示身份证登记情况酒店查看当日录像全过程当事人调查奥运会开幕确保奥运会全国平安公安部门紧急指示入住酒店客人出示身份证登记酒店相关工作人员住店当事人做出特别时期严肃处罚提出意见酒店慎重做出答复公安部门出示住店客人身份证登记紧急指示公民入住酒店携带身份证这一常识公安部门发文希望懂得朋友进店手续办好分钟等待小时不符尊重客人朋友大堂吵闹前台员工礼貌地为住店客人快速办理无锡半岛酒店隶属香港半岛酒店集团公司声明半岛酒店极少客人尊重目的更好尊重服务客人客人提供更好高品质服务环境相关经理解决问题协调分钟过程解决尊重谈不上出言不逊酒店零距离泳池入住前一天客满预定房间前日客人退房入住半岛酒店客人享有住房保留退房权利提供到来提前住店客人推至门外国际惯例公开公布入住时间酒店尊重遵守惯例诽谤诬蔑不道德半岛酒店零距离泳池透过泳池清晰全湖景特色客房游泳池相连国内前卫设计理念凌晨客人游泳调查位住泳池醉酒客人酒店时间游泳不到分钟保安劝慰回房深表遗憾补充点评做出反馈酒店实行不卑不亢微笑服务标准深知微笑服务重要性微笑只会给予值得尊重客人奥运会期间严格执行公安部门指示希望文字游戏书名号跟上奥运步伐时期出示身份证重要性真诚保护一位素质品味客人提供舒适服务环境

构建TFIDF文本矩阵

#----------------------------------第二步 数据预处理--------------------------------
#将文本中的词语转换为词频矩阵 矩阵元素a[i][j] 表示j词在i类文本下的词频
vectorizer = CountVectorizer()

#该类会统计每个词语的tf-idf权值
transformer = TfidfTransformer()

#第一个fit_transform是计算tf-idf 第二个fit_transform是将文本转为词频矩阵
tfidf = transformer.fit_transform(vectorizer.fit_transform(clearn_data))
# for n in tfidf[:5]:
#     print(n)
# print(type(tfidf))

# 获取词袋模型中的所有词语  
word = vectorizer.get_feature_names()
for n in word[:10]:
    print(n)
print("单词数量:", len(set(word)))

#将tf-idf矩阵抽取出来，元素w[i][j]表示j词在i类文本中的tf-idf权重
#X = tfidf.toarray()
X = coo_matrix(tfidf, dtype=np.float32).toarray() #稀疏矩阵 注意float

一一列举
一丁点
一上
一下下
一下床
一下肚子
一不小心
一不留神
一丝
一两
单词数量: 23557

1.3 机器学习情感分类

数据划分(如果x和y没有一一对应会报错)

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn import neighbors
from sklearn.naive_bayes import MultinomialNB


#----------------------------------第三步 数据划分--------------------------------
#使用 train_test_split 分割 X y 列表
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    labels, 
                                                    test_size=0.3, 
                                                    )

逻辑回归分类方法模型

#--------------------------------第四步 机器学习分类--------------------------------
# 逻辑回归分类方法模型
LR = LogisticRegression(solver='liblinear')
LR.fit(X_train, y_train)
print('模型的准确度:{}'.format(LR.score(X_test, y_test)))
pre = LR.predict(X_test)
print("逻辑回归分类")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
print("\n")

模型的准确度:0.86
逻辑回归分类
1200 1200
precision recall f1-score support

       0       0.83      0.89      0.86       579
       1       0.89      0.83      0.86       621
accuracy                           0.86      1200
macro avg       0.86      0.86      0.86      1200
weighted avg       0.86      0.86      0.86      1200

结果评价

#----------------------------------第五步 评价结果--------------------------------
def classification_pj(name, y_test, pre):
    print("算法评价:", name)
    
    # 正确率 Precision = 正确识别的个体总数 /  识别出的个体总数
    # 召回率 Recall = 正确识别的个体总数 /  测试集中存在的个体总数
    # F值 F-measure = 正确率 * 召回率 * 2 / (正确率 + 召回率)

    YC_B, YC_G = 0,0  #预测 bad good
    ZQ_B, ZQ_G = 0,0  #正确
    CZ_B, CZ_G = 0,0  #存在

    #0-good 1-bad 同时计算防止类标变化
    i = 0
    while i<len(pre):
        z = int(y_test[i])   #真实 
        y = int(pre[i])      #预测

        if z==0:
            CZ_G += 1
        else:
            CZ_B += 1
            
        if y==0:
            YC_G += 1
        else:
            YC_B += 1

        if z==y and z==0 and y==0:
            ZQ_G += 1
        elif z==y and z==1 and y==1:
            ZQ_B += 1
        i = i + 1

    print(ZQ_B, ZQ_G, YC_B, YC_G, CZ_B, CZ_G)
    print("")

    # 结果输出
    P_G = ZQ_G * 1.0 / YC_G
    P_B = ZQ_B * 1.0 / YC_B
    print("Precision Good 0:", P_G)
    print("Precision Bad 1:", P_B)

    R_G = ZQ_G * 1.0 / CZ_G
    R_B = ZQ_B * 1.0 / CZ_B
    print("Recall Good 0:", R_G)
    print("Recall Bad 1:", R_B)

    F_G = 2 * P_G * R_G / (P_G + R_G)
    F_B = 2 * P_B * R_B / (P_B + R_B)
    print("F-measure Good 0:", F_G)
    print("F-measure Bad 1:", F_B)

#函数调用
classification_pj("LogisticRegression", y_test, pre)

算法评价: LogisticRegression
495 549 556 644 590 610

Precision Good 0: 0.8524844720496895
Precision Bad 1: 0.8902877697841727
Recall Good 0: 0.9
Recall Bad 1: 0.8389830508474576
F-measure Good 0: 0.8755980861244019
F-measure Bad 1: 0.8638743455497382

随机森林

# 随机森林分类方法模型 n_estimators：森林中树的数量
clf = RandomForestClassifier(n_estimators=20)
clf.fit(X_train, y_train)
print('模型的准确度:{}'.format(clf.score(X_test, y_test)))
print("\n")
pre = clf.predict(X_test)
print('预测结果:', pre[:10])
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
classification_pj("RandomForest", y_test, pre)
print("\n")

SVM

# SVM分类方法模型
SVM = svm.LinearSVC() #支持向量机分类器LinearSVC
SVM.fit(X_train, y_train)
print('模型的准确度:{}'.format(SVM.score(X_test, y_test)))
pre = SVM.predict(X_test)
print("支持向量机分类")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
classification_pj("LinearSVC", y_test, pre)
print("\n")

朴素贝叶斯

#朴素贝叶斯模型
nb = MultinomialNB()
nb.fit(X_train, y_train)
print('模型的准确度:{}'.format(nb.score(X_test, y_test)))
pre = nb.predict(X_test)
print("朴素贝叶斯分类")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
classification_pj("MultinomialNB", y_test, pre)
print("\n")

KNN

#最近邻算法
knn = neighbors.KNeighborsClassifier(n_neighbors=7) 
knn.fit(X_train, y_train)
print('模型的准确度:{}'.format(knn.score(X_test, y_test)))
pre = knn.predict(X_test)
print("最近邻分类")
print(classification_report(y_test, pre))
classification_pj("KNeighbors", y_test, pre)
print("\n")

决策树

#决策树算法
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
print('模型的准确度:{}'.format(dtc.score(X_test, y_test)))
pre = dtc.predict(X_test)
print("决策树分类")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
classification_pj("DecisionTreeClassifier", y_test, pre)
print("\n")

SGD

#SGD分类模型
from sklearn.linear_model.stochastic_gradient import SGDClassifier
sgd = SGDClassifier()
sgd.fit(X_train, y_train)
print('模型的准确度:{}'.format(sgd.score(X_test, y_test)))
pre = sgd.predict(X_test)
print("SGD分类")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
classification_pj("SGDClassifier", y_test, pre)
print("\n")

MLP分类模型

from sklearn.neural_network.multilayer_perceptron import MLPClassifier
mlp = MLPClassifier()
mlp.fit(X_train, y_train)
print('模型的准确度:{}'.format(mlp.score(X_test, y_test)))
pre = mlp.predict(X_test)
print("MLP分类")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
classification_pj("MLPClassifier", y_test, pre)
print("\n")

GradientBoosting分类模型

from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)
print('模型的准确度:{}'.format(gb.score(X_test, y_test)))
pre = gb.predict(X_test)
print("GradientBoosting分类")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
classification_pj("GradientBoostingClassifier", y_test, pre)
print("\n")

AdaBoost分类模型

from sklearn.ensemble import AdaBoostClassifier
AdaBoost = AdaBoostClassifier()
AdaBoost.fit(X_train, y_train)
print('模型的准确度:{}'.format(AdaBoost.score(X_test, y_test)))
pre = AdaBoost.predict(X_test)
print("AdaBoost分类")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
classification_pj("AdaBoostClassifier", y_test, pre)
print("\n")

二、LDA -TFIDF特征下文本选择

2.1 主题数目确定

from gensim import corpora, models
 
def ldamodel(num_topics):
    cop = open(r'hotel_fenci_data2 - content.csv',encoding='utf-8')
 
 
    train = []
    for line in cop.readlines():
        line = [word.strip() for word in line.split(' ')]
        train.append(line)  # list of list 格式
 
    dictionary = corpora.Dictionary(train)
    corpus = [dictionary.doc2bow(text) for text in
              train]  # corpus里面的存储格式（0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)
    corpora.MmCorpus.serialize('corpus.mm', corpus)
    lda = models.LdaModel(corpus=corpus, id2word=dictionary, random_state=1,
                          num_topics=num_topics)  # random_state 等价于随机种子的random.seed()，使每次产生的主题一致
 
    topic_list = lda.print_topics(num_topics, 10)
    # print("主题的单词分布为：\n")
    # for topic in topic_list:
    #     print(topic)
    return lda,dictionary

import math
def perplexity(ldamodel, testset, dictionary, size_dictionary, num_topics):
    print('the info of this ldamodel: \n')
    print('num of topics: %s' % num_topics)
    prep = 0.0
    prob_doc_sum = 0.0
    topic_word_list = [] 
    for topic_id in range(num_topics):
        topic_word = ldamodel.show_topic(topic_id, size_dictionary)
        dic = {}
        for word, probability in topic_word:
            dic[word] = probability
        topic_word_list.append(dic)  
    doc_topics_ist = []  
    for doc in testset:
        doc_topics_ist.append(ldamodel.get_document_topics(doc, minimum_probability=0))
    testset_word_num = 0
    for i in range(len(testset)):
        prob_doc = 0.0  # the probablity of the doc
        doc = testset[i]
        doc_word_num = 0  
        for word_id, num in dict(doc).items():
            prob_word = 0.0  
            doc_word_num += num
            word = dictionary[word_id]
            for topic_id in range(num_topics):
                # cal p(w) : p(w) = sumz(p(z)*p(w|z))
                prob_topic = doc_topics_ist[i][topic_id][1]
                prob_topic_word = topic_word_list[topic_id][word]
                prob_word += prob_topic * prob_topic_word
            prob_doc += math.log(prob_word)  # p(d) = sum(log(p(w)))
        prob_doc_sum += prob_doc
        testset_word_num += doc_word_num
    prep = math.exp(-prob_doc_sum / testset_word_num)  # perplexity = exp(-sum(p(d)/sum(Nd))
    print("模型困惑度的值为 : %s" % prep)
    return prep

from gensim import corpora, models
import matplotlib.pyplot as plt
from matplotlib.pyplot import MultipleLocator

 
def graph_draw(topic, perplexity):  # 做主题数与困惑度的折线图
    x = topic
    y = perplexity
    plt.plot(x, y,color='c', linestyle='-',  marker='+',linewidth=2)
    
    x_major_locator=MultipleLocator(1)
    ax=plt.gca()
    #ax为两条坐标轴的实例
    ax.xaxis.set_major_locator(x_major_locator)
    plt.xlabel("Number of Topic")
    plt.ylabel("Perplexity")
    plt.savefig("主题数目.png",dpi=600)
    plt.show()

if __name__ == '__main__':
   for i in range(20,300,1): # 多少文档中抽取一篇（这里只是为了调试最优结果，可以直接设定不循环）
        print("抽样为"+str(i)+"时的perplexity")
        a=range(1,21,1) # 主题个数
        p=[]
        for num_topics in a:
 
            lda,dictionary =ldamodel(num_topics)
            corpus = corpora.MmCorpus('corpus.mm')
            testset = []
            for c in range(int(corpus.num_docs/i)):
                testset.append(corpus[c*i])
            prep = perplexity(lda, testset, dictionary, len(dictionary.keys()), num_topics)
            p.append(prep)
 
        graph_draw(a,p)

抽样为20时的perplexity
the info of this ldamodel:

num of topics: 1
模型困惑度的值为 : 997.5535012674661
the info of this ldamodel:

num of topics: 2
模型困惑度的值为 : 914.3620715395527
the info of this ldamodel:

2.2 构建LDA模型

#-------------------  第三步 计算TF-IDF值  --------------------- 
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

#设置特征数
# n_features = 2000

tf_vectorizer = TfidfVectorizer(strip_accents = 'unicode',
#                                 max_features=n_features,
                                stop_words=['的','或','等','是','有','之','与','可以','还是','比较','这里',
                                            '一个','和','也','被','吗','于','中','最','但是','图片','大家',
                                            '一下','几天','200','还有','一看','300','50','哈哈哈哈',
                                             '“','”','。','，','？','、','；','怎么','本来','发现',
                                             'and','in','of','the','我们','一直','真的','18','一次',
                                           '了','有些','已经','不是','这么','一一','一天','这个','这种',
                                           '一种','位于','之一','天空','没有','很多','有点','什么','五个',
                                           '特别'],
                                max_df = 0.99,
                                min_df = 0.002) #去除文档内出现几率过大或过小的词汇
tf = tf_vectorizer.fit_transform(clearn_data)


from sklearn.decomposition import LatentDirichletAllocation

#设置主题数
n_topics = 6

lda = LatentDirichletAllocation(n_components=n_topics,
                                max_iter=100,
                                learning_method='online',
                                learning_offset=50,
                                random_state=0)
lda.fit(tf)
print(lda)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
evaluate_every=-1, learning_decay=0.7,
learning_method=‘online’, learning_offset=50,
max_doc_update_iter=100, max_iter=100,
mean_change_tol=0.001, n_components=2, n_jobs=None,
perp_tol=0.1, random_state=0, topic_word_prior=None,
total_samples=1000000.0, verbose=0)

主题-关键词分布

# 主题-关键词分布
def print_top_words(model, tf_feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):  # lda.component相当于model.topic_word_
        print('Topic #%d:' % topic_idx)
        print(' '.join([tf_feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]))
        print("")


# 定义好函数之后 暂定每个主题输出前300个关键词
n_top_words = 7000
tf_feature_names = tf_vectorizer.get_feature_names()
# 调用函数
print_top_words(lda, tf_feature_names, n_top_words)

Topic #0:
酒店不错服务房间宾馆很好早餐入住免费环境设施下次价格干净交通餐厅选择反馈总体位置很大热情性价比装修地理位置还可以舒服点评满意客人提供安静大堂豪华地方推荐机场喜欢品种打车补充这家值得服务员分钟还会门口饭店朋友缺点卫生间味道感谢您大床希望唯一舒适晚上人员光临商务上海香港水果对面一家携程客房便宜行政步行出门火车站确实上网适合标准建议前台算是距离改进套房吃饭出租车五星级提出宽敞特色自助温馨很快很近整体态度楼层升级旅游行李硬件主动期待电脑旁边时间非常感谢地铁市中心周边游泳池员工温泉服务态度不太度假楼下周到便利好吃工作出差司机购物阳台速度风景海景安排标间到位陈旧不到机会肯定大酒店市区房价享受网站城市优点齐全自助餐宾客地点合作不远介绍宽带面积东西评价内部用餐意见支持房间内五星广场设计新装修期间挺好价钱不算景色宝贵意见更好空间四星级好多导航还好地段索引注册中餐厅景点海景房尊敬代理没什么英才诚聘整洁广告业务号楼停车场地铁站第一次超市马路别墅感谢正好杭州遗憾设备价位太小每

Topic #1:
酒店房间前台携程入住服务员晚上不好服务客人空调实在退房电话早餐预定太差隔音第二天很差价格打电话卫生间态度声音建议这家只能设施早上点评结帐毛巾小时电梯经理地毯小姐窗户招待所失望补充大床告知告诉隔壁宾馆地方时间东西洗澡人员收费热水服务态度四星大堂最差很小标准朋友星级电视好像一间打扫分钟总台半天投诉预订信用卡半夜床单一股情况效果三星味道装修上网一张客房外面走廊客户宽带估计打开硬件办法几个睡觉取消如家希望听到房价第一次马桶行李下午被子有人事情陈旧还可以不住房费不到四星级浴室郁闷极差设施陈旧管理淋浴办理只好千万卫生霉味商务洗手间饭店服务生当天中午开门确认干净餐厅回来换房这是标准间刷卡蚊子很脏太小浴缸还好不行授权同事锦江水平凌晨推荐三星级保安不值解决登记网络评价结账原因不爽房卡提前便宜点多通知没法消费糟糕一晚楼层一家找到手续暖气旅馆简陋异味床上下次厕所两次押金单人间回答两天早晨网上豪华休息发生没想到标间到达交涉没什么房间内五星顾客解释发票反正不想半个骚扰电话旁边接待差点安排质量电视机说话飞机中央空调影响气味条件家具工作人员入睡设备半小时提供选择一句恶心三个样子会员噪音舒服离开手机肯定超级唯一帮忙衣服放在网线那天马路早饭意识水龙头明明令人桌子号称浴巾门童可怜不推荐信息一楼吵醒几次订单沟通提醒奇怪不敢恭维也许环境接受那种询问垃圾怀疑经历天气之星老旧房门

key_word = "酒店 不错 服务 房间 宾馆 (上面的关键词复制下来) "

过滤没有关键词的词汇

key_word_list = key_word.split(' ')
new_data = []
for words in clearn_data:
    tem_data = []
    for word in words.split():
        if word.strip() in key_word_list:
#             print(word)
            tem_data.append(word.strip('\n'))
    new_data.append(tem_data)
new_data

[[‘订单’,
‘朋友’,
‘到达’,
‘酒店’,
‘一位’,
‘前台’,
‘足足’,
‘小时’,
‘解决’,
‘前台’,
‘接到’,
‘通知’,
‘每位’,
‘住店’,
‘客人’,
‘提供’,等

new_data2 = []
for data in new_data:
    data_list = ' '.join(data)
    new_data2.append(data_list)
print(len(new_data2))
print(new_data2)

4000
['订单朋友到达酒店一位前台足足小时解决前台接到通知每位住店客人提供证件奥运期间提出中国配合酒店几个实在无法忍受日晚致电酒店酒店提醒订单星期酒店携程通知朋友足足小时赶到酒店小时客人办理手续提供解决号称星级酒店事件前台人员笑脸客人号称国际星级酒店培训员工酒店提供相关文件前台人员回答电话酒店文件一名信息文件正规操作怀疑酒店说法酒店交涉酒店人员蛮横回答接待实在品牌规范感到室外半小时到达酒店集团酒店客人得知香港香港酒店管理集团得知部门水准团队品牌期间一位经理女士接待号称分钟这位经理级别经理出门证件一句信息提前告知客人入住酒店客人尊重客人态度提出上海证件传真酒店酒店办理手续前台一会

保存LDA特征词

# 保存特征词
fw = open('LDA_data.csv', "a+", newline = '',encoding = 'gb18030')
writer = csv.writer(fw)  

for data in new_data2:
    writer.writerow([data])
fw.close()

2.3 LDA -TFIDF特征分类任务

#----------------------------------第二步 数据预处理--------------------------------
#将文本中的词语转换为词频矩阵 矩阵元素a[i][j] 表示j词在i类文本下的词频
vectorizer = CountVectorizer()

#该类会统计每个词语的tf-idf权值
transformer = TfidfTransformer()

#第一个fit_transform是计算tf-idf 第二个fit_transform是将文本转为词频矩阵
tfidf = transformer.fit_transform(vectorizer.fit_transform(new_data2))
# for n in tfidf[:5]:
#     print(n)
# print(type(tfidf))

# 获取词袋模型中的所有词语  
word = vectorizer.get_feature_names()
for n in word[:10]:
    print(n)
print("单词数量:", len(set(word)))
#将tf-idf矩阵抽取出来，元素w[i][j]表示j词在i类文本中的tf-idf权重
#X = tfidf.toarray()
X = coo_matrix(tfidf, dtype=np.float32).toarray() #稀疏矩阵 注意float

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn import neighbors
from sklearn.naive_bayes import MultinomialNB


#----------------------------------第三步 数据划分--------------------------------
#使用 train_test_split 分割 X y 列表
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    labels, 
                                                    test_size=0.3, 
                                                    )
#--------------------------------第四步 机器学习分类--------------------------------
# 逻辑回归分类方法模型
LR = LogisticRegression(solver='liblinear')
LR.fit(X_train, y_train)
print('模型的准确度:{}'.format(LR.score(X_test, y_test)))
pre = LR.predict(X_test)
print("逻辑回归分类")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
print("\n")

模型的准确度:0.8716666666666667
逻辑回归分类
1200 1200
precision recall f1-score support
0 0.86 0.89 0.87 600
1 0.88 0.86 0.87 600
accuracy 0.87 1200
macro avg 0.87 0.87 0.87 1200
weighted avg 0.87 0.87 0.87 1200

三、LDA-Word的特征分类

3.1 训练模型

from gensim.models import Word2Vec

# 训练
model = Word2Vec(new_data)
print(model)

3.2 拿到词向量

词向量相加得到句子向量，再取得均值

import numpy as np


#先做第一行，再列合并
lda_vectors = np.zeros((1,100))
i = 0
for ls in new_data:
    i+= 1
    temp_data = np.zeros((1,100))
    n = 0
    for l in ls :
        n += 1
        vectors = model.wv[l.strip()]
        temp_data += vectors
    lda_vectors = np.vstack((lda_vectors,temp_data / n))
    print('#---------------------------{}------------------'.format(i))

词向量相加得到句子向量

import numpy as np


#先做第一行，再列合并
lda_vectors = np.zeros((1,100))
i = 0
for ls in new_data:
    i+= 1
    temp_data = np.zeros((1,100))
    for l in ls :
        n += 1
        vectors = model.wv[l.strip()]
        temp_data += vectors
    lda_vectors = np.vstack((lda_vectors,temp_data))
    print('#---------------------------{}------------------'.format(i))

词向量相乘得到句子向量

import numpy as np


#先做第一行，再列合并
lda_vectors = np.zeros((1,100))
i = 0
for ls in new_data:
    i+= 1
    temp_data = np.zeros((1,100))
    for l in ls :
        n += 1
        vectors = model.wv[l.strip()]
        temp_data *= vectors
    lda_vectors = np.vstack((lda_vectors,temp_data))
    print('#---------------------------{}------------------'.format(i))

上面的选择一个

lda_vectors

array([[ 0. , 0. , 0. , …, 0. ,
0. , 0. ],
[-0.19588671, -0.10217057, 0.11343548, …, -0.34674362,
0.14601047, -0.2223701 ],
[-0.17279944, 0.02061116, 0.19713818, …, -0.39348513,
-0.01913097, -0.06791256],
…,
[-0.18011493, 0.00626257, 0.22601405, …, -0.42249438,
-0.04171428, -0.12767173],
[-0.18232735, 0.16477492, 0.22199902, …, -0.4017048 ,
-0.01998795, 0.03262783],
[-0.19637735, -0.0591783 , 0.06737225, …, -0.31060817,
0.26222143, -0.21669391]])

删除第一行

lda_vector = np.delete(lda_vectors, 0, 0)
lda_vector.shape

(4000, 100)

3.3 LDA-Word2vec分类

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(lda_vector, 
                                                    labels, 
                                                    test_size=0.3, 
                                                    )
#--------------------------------第四步 机器学习分类--------------------------------
# 逻辑回归分类方法模型
LR = LogisticRegression(solver='liblinear')
LR.fit(X_train, y_train)
print('模型的准确度:{}'.format(LR.score(X_test, y_test)))
pre = LR.predict(X_test)
print("逻辑回归分类")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
print("\n")

模型的准确度:0.7308333333333333
逻辑回归分类
1200 1200
precision recall f1-score support
0 0.73 0.73 0.73 602
1 0.73 0.73 0.73 598
accuracy 0.73 1200
macro avg 0.73 0.73 0.73 1200
weighted avg 0.73 0.73 0.73 1200

四、LDA - tfidf-word2vec特征下文本选择

4.1 构建每个词的TFIDF值的函数案例

# -*- coding: utf-8 -*-
from collections import defaultdict
import math
import operator
 
"""
函数说明:创建数据样本
Returns:
    dataset - 实验样本切分的词条
    classVec - 类别标签向量
"""
def loadDataSet():
    dataset = [ ['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],    # 切分的词条
                   ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                   ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                   ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                   ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                   ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid'] ]
    classVec = [0, 1, 0, 1, 0, 1]  # 类别标签向量，1代表好，0代表不好
    return dataset, classVec
 
 
"""
函数说明：特征选择TF-IDF算法
Parameters:
     list_words:词列表
Returns:
     dict_feature_select:特征选择词字典
"""
def feature_select(list_words):
    #总词频统计
    doc_frequency=defaultdict(int)
    for word_list in list_words:
        for i in word_list:
            doc_frequency[i]+=1
 
    #计算每个词的TF值
    word_tf={}  #存储没个词的tf值
    for i in doc_frequency:
        word_tf[i]=doc_frequency[i]/sum(doc_frequency.values())
 
    #计算每个词的IDF值
    doc_num=len(list_words)
    word_idf={} #存储每个词的idf值
    word_doc=defaultdict(int) #存储包含该词的文档数
    for i in doc_frequency:
        for j in list_words:
            if i in j:
                word_doc[i]+=1
    for i in doc_frequency:
        word_idf[i]=math.log(doc_num/(word_doc[i]+1))
 
    #计算每个词的TF*IDF的值
    word_tf_idf={}
    for i in doc_frequency:
        word_tf_idf[i]=word_tf[i]*word_idf[i]
 
    # 对字典按值由大到小排序
    dict_feature_select=sorted(word_tf_idf.items(),key=operator.itemgetter(1),reverse=True)
    return dict_feature_select
 
if __name__=='__main__':
    data_list,label_list=loadDataSet() #加载数据
    features=feature_select(data_list) #所有词的TF-IDF值
    print(features)
    print(len(features))

[(‘to’, 0.0322394037469742), (‘stop’, 0.0322394037469742), (‘worthless’, 0.0322394037469742), (‘my’, 0.028288263356383563), (‘dog’, 0.028288263356383563), (‘him’, 0.028288263356383563), (‘stupid’, 0.028288263356383563), (‘has’, 0.025549122992281622), (‘flea’, 0.025549122992281622), (‘problems’, 0.025549122992281622), (‘help’, 0.025549122992281622), (‘please’, 0.025549122992281622), (‘maybe’, 0.025549122992281622), (‘not’, 0.025549122992281622), (‘take’, 0.025549122992281622), (‘park’, 0.025549122992281622), (‘dalmation’, 0.025549122992281622), (‘is’, 0.025549122992281622), (‘so’, 0.025549122992281622), (‘cute’, 0.025549122992281622), (‘I’, 0.025549122992281622), (‘love’, 0.025549122992281622), (‘posting’, 0.025549122992281622), (‘garbage’, 0.025549122992281622), (‘mr’, 0.025549122992281622), (‘licks’, 0.025549122992281622), (‘ate’, 0.025549122992281622), (‘steak’, 0.025549122992281622), (‘how’, 0.025549122992281622), (‘quit’, 0.025549122992281622), (‘buying’, 0.025549122992281622), (‘food’, 0.025549122992281622)]
32

TF_dict = dict(features)
TF_dict

{‘to’: 0.0322394037469742,
‘stop’: 0.0322394037469742,
‘worthless’: 0.0322394037469742,
‘my’: 0.028288263356383563,
‘dog’: 0.028288263356383563,
‘him’: 0.028288263356383563,
‘stupid’: 0.028288263356383563,
‘has’: 0.025549122992281622,
‘flea’: 0.025549122992281622,
‘problems’: 0.025549122992281622,
‘help’: 0.025549122992281622,
‘please’: 0.025549122992281622,
‘maybe’: 0.025549122992281622,
‘not’: 0.025549122992281622,
‘take’: 0.025549122992281622,
‘park’: 0.025549122992281622,
‘dalmation’: 0.025549122992281622,
‘is’: 0.025549122992281622,
‘so’: 0.025549122992281622,
‘cute’: 0.025549122992281622,
‘I’: 0.025549122992281622,
‘love’: 0.025549122992281622,
‘posting’: 0.025549122992281622,
‘garbage’: 0.025549122992281622,
‘mr’: 0.025549122992281622,
‘licks’: 0.025549122992281622,
‘ate’: 0.025549122992281622,
‘steak’: 0.025549122992281622,
‘how’: 0.025549122992281622,
‘quit’: 0.025549122992281622,
‘buying’: 0.025549122992281622,
‘food’: 0.025549122992281622}

TF_dict['to']

0.0322394037469742

4.2 LDA - tfidf-word2vec 实验

features=feature_select(new_data)
TF_dict = dict(features)

拿到tfidf-word2vec

import numpy as np


#先做第一行，再列合并
lda_vectors = np.zeros((1,100))
i = 0
for ls in new_data:
    i+= 1
    temp_data = np.zeros((1,100))
    n = 0
    for l in ls :
        vectors = model.wv[l] * TF_dict[l]
        temp_data += vectors
    lda_vectors = np.vstack((lda_vectors,temp_data))
    print('#---------------------------{}------------------'.format(i))

#---------------------------1------------------
#---------------------------2------------------
#---------------------------3------------------
#---------------------------4------------------
#---------------------------5------------------
#---------------------------6------------------
#---------------------------7------------------
#---------------------------8------------------
#---------------------------9------------------
#---------------------------10------------------
#---------------------------11------------------
#---------------------------12------------------
#---------------------------13------------------
#---------------------------14------------------
#---------------------------15------------------
#---------------------------16------------------
#---------------------------17------------------

4.3 LDA - tfidf-word2vec分类

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(lda_vector, 
                                                    labels, 
                                                    test_size=0.3, 
                                                    )
#--------------------------------第四步 机器学习分类--------------------------------
# 逻辑回归分类方法模型
LR = LogisticRegression(solver='liblinear')
LR.fit(X_train, y_train)
print('模型的准确度:{}'.format(LR.score(X_test, y_test)))
pre = LR.predict(X_test)
print("逻辑回归分类")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
print("\n")

模型的准确度:0.73
逻辑回归分类
1200 1200
precision recall f1-score support
0 0.73 0.74 0.73 603
1 0.73 0.72 0.73 597
accuracy 0.73 1200
macro avg 0.73 0.73 0.73 1200
weighted avg 0.73 0.73 0.73 1200