题目要求:
请对三份报告统计其使用词语的词频(即词语在文章中出现的次数除以文章中所有词语出现的总数),列出每份报告中
词频最高的10个词语;以及三份报告累计词频最高的10个词语(即将三份报告合并成一个文档进行统计)。
将每个词语看做一个项,每句话看做是一个交易记录。求出每份报告中提升度最高的20条关联规则。
针对上述挖掘结果,给出模式评估并分析能够得到哪些知识。
将答案以zip压缩包提交,压缩包内包含:1.预处理后的报告数据;2.程序源码;3实验报告,包含挖掘结果以及模式评
估的分析和讨论。
阅读须知:
同学们在做实验之前可以先搜索统计词频的基本知识
(直接搜“统计词频”即可,了解中文分词工具“jieba”和其基本用法)
以下程序完整且能跑
(每台电脑的文件目录不一样,所以牵涉到路径的问题,同学们需要自行更改)
统计词频完全可以,统计交易记录(purchase)也是可以的
(但是时间复杂度太大了,运行起来时间特特特特特特特特特别长)
不过最后提升度最高的二十条关联规则结果不对
(不管了不管了)
仅仅提供一个解题的思路,供大家参考
(说实话是屎山代码,臃肿的很)
中文分词工具:jieba
(当然可以用别的分词工具,不过以下代码是使用jieba的)
用到的停用词.txt为网上搜的然后复制粘贴到一个txt文档中
三合一即为三个文本合成一个txt
import jieba
from collections import Counter
import re
from mlxtend.frequent_patterns import association_rules
from mlxtend.frequent_patterns import apriori
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
with open(r"C:\Users\78042\Desktop\十八大.txt",encoding="UTF-8") as f:
text18 = f.read()
with open(r"C:\Users\78042\Desktop\十九大.txt",encoding="UTF-8") as f:
text19 = f.read()
with open(r"C:\Users\78042\Desktop\二十大.txt",encoding="UTF-8") as f:
text20 = f.read()
with open(r"C:\Users\78042\Desktop\三合一.txt",encoding="UTF-8") as f:
text3 = f.read()
with open(r"C:\Users\78042\Desktop\停用词.txt",encoding="UTF-8") as f:
stop_words = f.read().split()
stop_words.extend(['\n','\u3000'])
stop_words = set(stop_words)
cut_word = jieba.lcut(text18)
all_words = [word for word in cut_word if len(word) > 1 and word not in stop_words]
wordcount = Counter(all_words)
wordcount.most_common(10)
print("十八大:")
print(wordcount.most_common(10),"\n")
cut_word = jieba.lcut(text19)
all_words = [word for word in cut_word if len(word) > 1 and word not in stop_words]
wordcount = Counter(all_words)
wordcount.most_common(10)
print("十九大:")
print(wordcount.most_common(10),"\n")
cut_word = jieba.lcut(text20)
all_words = [word for word in cut_word if len(word) > 1 and word not in stop_words]
wordcount = Counter(all_words)
wordcount.most_common(10)
print("二十大:")
print(wordcount.most_common(10),"\n")
cut_word = jieba.lcut(text3)
all_words = [word for word in cut_word if len(word) > 1 and word not in stop_words]
wordcount = Counter(all_words)
wordcount.most_common(10)
print("三合一:")
print(wordcount.most_common(10),"\n")
result_list = re.split(r'。',text18)
print('result_list:',result_list)
cut_word = jieba.lcut(text18)
all_words = [word for word in cut_word if len(word) > 1 and word not in stop_words]
wordcount = Counter(all_words)
'''
将每个项(中文分词)选择出来,目的:将all_words去重(由8744个减少到2498个)
'''
test = []
for i, j in wordcount.most_common():
test.append(i)
print('test:',test)
length_test = len(test)
length_result_list = len(result_list)
'''
统计句子数目,句子的数目:586个
'''
'''
初始化purchases(交易),长度为句子的长度
'''
purchases = [[] for i in range(length_result_list)]
'''
外循环为每条句子,内循环为所有单词,做字符串匹配,如果有匹配到的就加入交易之中
'''
i = 0
n = 0
while i < length_result_list:
for n in range(len(test)):
m = re.search(test[n],result_list[i])
if m is not None:
purchases[i].append(m.group())
i = i + 1
print('purchases:',purchases)
te = TransactionEncoder()
oht_ary = te.fit(purchases).transform(purchases,sparse=True)
sparse_df = pd.DataFrame.sparse.from_spmatrix(oht_ary,columns=te.columns_)
te_array = te.fit(purchases).transform(purchases)
frequent_itemsets = apriori(sparse_df, min_support=0.1, use_colnames=True,verbose=1)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
rules = association_rules(frequent_itemsets, metric='lift', min_threshold=1)
rules_h = rules.sort_values(by='lift',ascending=False).loc[0:39]
print('rules_h:\n',rules_h)
result_list = re.split(r'。',text19)
print('result_list:',result_list)
cut_word = jieba.lcut(text19)
all_words = [word for word in cut_word if len(word) > 1 and word not in stop_words]
wordcount = Counter(all_words)
'''
将每个项(中文分词)选择出来,目的:将all_words去重(由8744个减少到2498个)
'''
test = []
for i, j in wordcount.most_common():
test.append(i)
print('test:',test)
length_test = len(test)
length_result_list = len(result_list)
'''
统计句子数目,句子的数目:586个
'''
'''
初始化purchases(交易),长度为句子的长度
'''
purchases = [[] for i in range(length_result_list)]
'''
外循环为每条句子,内循环为所有单词,做字符串匹配,如果有匹配到的就加入交易之中
'''
i = 0
n = 0
while i < length_result_list:
for n in range(len(test)):
m = re.search(test[n],result_list[i])
if m is not None:
purchases[i].append(m.group())
i = i + 1
print('purchases:',purchases)
te = TransactionEncoder()
oht_ary = te.fit(purchases).transform(purchases,sparse=True)
sparse_df = pd.DataFrame.sparse.from_spmatrix(oht_ary,columns=te.columns_)
te_array = te.fit(purchases).transform(purchases)
frequent_itemsets = apriori(sparse_df, min_support=0.0525, use_colnames=True,verbose=1)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
rules = association_rules(frequent_itemsets, metric='lift', min_threshold=1)
print('rules:',rules)
rules_h = rules.sort_values(by='lift',ascending=False).loc[0:39]
print('rules_h:\n',rules_h)
result_list = re.split(r'。',text20)
print('result_list:',result_list)
cut_word = jieba.lcut(text20)
all_words = [word for word in cut_word if len(word) > 1 and word not in stop_words]
wordcount = Counter(all_words)
'''
将每个项(中文分词)选择出来,目的:将all_words去重(由8744个减少到2498个)
'''
test = []
for i, j in wordcount.most_common():
test.append(i)
print('test:',test)
length_test = len(test)
length_result_list = len(result_list)
'''
统计句子数目,句子的数目:586个
'''
'''
初始化purchases(交易),长度为句子的长度
'''
purchases = [[] for i in range(length_result_list)]
'''
外循环为每条句子,内循环为所有单词,做字符串匹配,如果有匹配到的就加入交易之中
'''
i = 0
n = 0
while i < length_result_list:
for n in range(len(test)):
m = re.search(test[n],result_list[i])
if m is not None:
purchases[i].append(m.group())
i = i + 1
print('purchases:',purchases)
te = TransactionEncoder()
oht_ary = te.fit(purchases).transform(purchases,sparse=True)
sparse_df = pd.DataFrame.sparse.from_spmatrix(oht_ary,columns=te.columns_)
te_array = te.fit(purchases).transform(purchases)
frequent_itemsets = apriori(sparse_df, min_support=0.0525, use_colnames=True,verbose=1)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
rules = association_rules(frequent_itemsets, metric='lift', min_threshold=1)
rules_h = rules.sort_values(by='lift',ascending=False).loc[0:39]
lift_rules = rules_h.loc[0:39]
print('lift_rules:\n',lift_rules)