【数据挖掘】吉林大学数据挖掘大作业——词频统计

题目要求:
请对三份报告统计其使用词语的词频(即词语在文章中出现的次数除以文章中所有词语出现的总数),列出每份报告中
词频最高的10个词语;以及三份报告累计词频最高的10个词语(即将三份报告合并成一个文档进行统计)。
将每个词语看做一个项,每句话看做是一个交易记录。求出每份报告中提升度最高的20条关联规则。
针对上述挖掘结果,给出模式评估并分析能够得到哪些知识。
将答案以zip压缩包提交,压缩包内包含:1.预处理后的报告数据;2.程序源码;3实验报告,包含挖掘结果以及模式评
估的分析和讨论。


阅读须知:

同学们在做实验之前可以先搜索统计词频的基本知识
(直接搜“统计词频”即可,了解中文分词工具“jieba”和其基本用法)

以下程序完整且能跑
(每台电脑的文件目录不一样,所以牵涉到路径的问题,同学们需要自行更改)

统计词频完全可以,统计交易记录(purchase)也是可以的
(但是时间复杂度太大了,运行起来时间特特特特特特特特特别长)

不过最后提升度最高的二十条关联规则结果不对
(不管了不管了)

仅仅提供一个解题的思路,供大家参考
(说实话是屎山代码,臃肿的很)

中文分词工具:jieba
(当然可以用别的分词工具,不过以下代码是使用jieba的)

用到的停用词.txt为网上搜的然后复制粘贴到一个txt文档中

三合一即为三个文本合成一个txt
# -*- coding: utf-8 -*-
# time: 2022/10/23 11:14
# file: dm.py
# author: Bill
import jieba
from collections import Counter
import re
from mlxtend.frequent_patterns import association_rules
from mlxtend.frequent_patterns import apriori
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
#问题一:请对三份报告统计其使用词语的词频(即词语在文章中出现的次数除以文章中所有词语出现的总数),
#       列出每份报告中词频最高的10个词语;
#加载(十八)大
with open(r"C:\Users\78042\Desktop\十八大.txt",encoding="UTF-8") as f:
    text18 = f.read()
#加载(十九)大
with open(r"C:\Users\78042\Desktop\十九大.txt",encoding="UTF-8") as f:
    text19 = f.read()
#加载(二十)大
with open(r"C:\Users\78042\Desktop\二十大.txt",encoding="UTF-8") as f:
    text20 = f.read()
#加载三合一
with open(r"C:\Users\78042\Desktop\三合一.txt",encoding="UTF-8") as f:
    text3 = f.read()
#加载停用词
with open(r"C:\Users\78042\Desktop\停用词.txt",encoding="UTF-8") as f:
    stop_words = f.read().split()
stop_words.extend(['\n','\u3000'])
#使用set集合过滤停用词
stop_words = set(stop_words)
cut_word = jieba.lcut(text18)
all_words = [word for word in cut_word if len(word) > 1 and word not in stop_words]
wordcount = Counter(all_words)
wordcount.most_common(10)
print("十八大:")
print(wordcount.most_common(10),"\n")
cut_word = jieba.lcut(text19)
all_words = [word for word in cut_word if len(word) > 1 and word not in stop_words]
wordcount = Counter(all_words)
wordcount.most_common(10)
print("十九大:")
print(wordcount.most_common(10),"\n")
cut_word = jieba.lcut(text20)
all_words = [word for word in cut_word if len(word) > 1 and word not in stop_words]
wordcount = Counter(all_words)
wordcount.most_common(10)
print("二十大:")
print(wordcount.most_common(10),"\n")
cut_word = jieba.lcut(text3)
all_words = [word for word in cut_word if len(word) > 1 and word not in stop_words]
wordcount = Counter(all_words)
wordcount.most_common(10)
print("三合一:")
print(wordcount.most_common(10),"\n")

#第二问:将每个词语看做一个项,每句话看做是一个交易记录
#求出每份报告中提升度最高的20条关联规则。

#十八大
#按句号"。"将文章切分
result_list = re.split(r'。',text18)
print('result_list:',result_list)
cut_word = jieba.lcut(text18)
all_words = [word for word in cut_word if len(word) > 1 and word not in stop_words]
wordcount = Counter(all_words)

'''
将每个项(中文分词)选择出来,目的:将all_words去重(由8744个减少到2498个)
'''
test = []
for i, j in wordcount.most_common():
    test.append(i)
print('test:',test)
length_test = len(test)

length_result_list = len(result_list)
'''
统计句子数目,句子的数目:586个
'''

'''
初始化purchases(交易),长度为句子的长度
'''
purchases = [[] for i in range(length_result_list)]

'''
外循环为每条句子,内循环为所有单词,做字符串匹配,如果有匹配到的就加入交易之中
'''
i = 0
n = 0
while i < length_result_list:
    for n in range(len(test)):
        m = re.search(test[n],result_list[i])
        if m is not None:
            purchases[i].append(m.group())
    i = i + 1
print('purchases:',purchases)
#print("@^_^@")
te = TransactionEncoder()
oht_ary = te.fit(purchases).transform(purchases,sparse=True)
sparse_df = pd.DataFrame.sparse.from_spmatrix(oht_ary,columns=te.columns_)

te_array = te.fit(purchases).transform(purchases)
# 采用apriori算法生成频繁项集
frequent_itemsets = apriori(sparse_df, min_support=0.1, use_colnames=True,verbose=1)
#print('frequent_itemsets:',frequent_itemsets)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
rules = association_rules(frequent_itemsets, metric='lift', min_threshold=1)
rules_h = rules.sort_values(by='lift',ascending=False).loc[0:39]
#print('rules_h:',rules.sort_values(by='lift',ascending=False).loc[0:39])
#lift_rules = rules_h.loc[0:39]
print('rules_h:\n',rules_h)
#print("@^_^@")

#十九大
result_list = re.split(r'。',text19)
print('result_list:',result_list)
cut_word = jieba.lcut(text19)
all_words = [word for word in cut_word if len(word) > 1 and word not in stop_words]
wordcount = Counter(all_words)
'''
将每个项(中文分词)选择出来,目的:将all_words去重(由8744个减少到2498个)
'''
test = []
for i, j in wordcount.most_common():
    test.append(i)
print('test:',test)
length_test = len(test)
#print("length_test:",length_test)
length_result_list = len(result_list)
'''
统计句子数目,句子的数目:586个
'''
#print("length_result_list:",length_result_list)
'''
初始化purchases(交易),长度为句子的长度
'''
purchases = [[] for i in range(length_result_list)]
#print(purchases)
'''
外循环为每条句子,内循环为所有单词,做字符串匹配,如果有匹配到的就加入交易之中
'''
i = 0
n = 0
while i < length_result_list:
    for n in range(len(test)):
        m = re.search(test[n],result_list[i])
        if m is not None:
            purchases[i].append(m.group())
    i = i + 1
print('purchases:',purchases)
#print("@^_^@")
te = TransactionEncoder()
oht_ary = te.fit(purchases).transform(purchases,sparse=True)
sparse_df = pd.DataFrame.sparse.from_spmatrix(oht_ary,columns=te.columns_)
te_array = te.fit(purchases).transform(purchases)
# 采用apriori算法生成频繁项集
frequent_itemsets = apriori(sparse_df, min_support=0.0525, use_colnames=True,verbose=1)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
rules = association_rules(frequent_itemsets, metric='lift', min_threshold=1)
print('rules:',rules)
rules_h = rules.sort_values(by='lift',ascending=False).loc[0:39]
print('rules_h:\n',rules_h)
#print("@^_^@")

#二十大
result_list = re.split(r'。',text20)
print('result_list:',result_list)
cut_word = jieba.lcut(text20)
all_words = [word for word in cut_word if len(word) > 1 and word not in stop_words]
wordcount = Counter(all_words)
'''
将每个项(中文分词)选择出来,目的:将all_words去重(由8744个减少到2498个)
'''
test = []
for i, j in wordcount.most_common():
    test.append(i)
print('test:',test)
length_test = len(test)
length_result_list = len(result_list)
'''
统计句子数目,句子的数目:586个
'''
'''
初始化purchases(交易),长度为句子的长度
'''
purchases = [[] for i in range(length_result_list)]
'''
外循环为每条句子,内循环为所有单词,做字符串匹配,如果有匹配到的就加入交易之中
'''
i = 0
n = 0
while i < length_result_list:
    for n in range(len(test)):
        m = re.search(test[n],result_list[i])
        if m is not None:
            purchases[i].append(m.group())
    i = i + 1
print('purchases:',purchases)
te = TransactionEncoder()
oht_ary = te.fit(purchases).transform(purchases,sparse=True)
sparse_df = pd.DataFrame.sparse.from_spmatrix(oht_ary,columns=te.columns_)
te_array = te.fit(purchases).transform(purchases)
# 采用apriori算法生成频繁项集
frequent_itemsets = apriori(sparse_df, min_support=0.0525, use_colnames=True,verbose=1)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
rules = association_rules(frequent_itemsets, metric='lift', min_threshold=1)
rules_h = rules.sort_values(by='lift',ascending=False).loc[0:39]
lift_rules = rules_h.loc[0:39]
print('lift_rules:\n',lift_rules)
#print("@^_^@")

  • 6
    点赞
  • 17
    收藏
    觉得还不错? 一键收藏
  • 13
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 13
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值