【数据挖掘】吉林大学数据挖掘大作业——词频统计

布道ღ

已于 2022-11-16 10:46:04 修改

阅读量1.3k

点赞数 6

分类专栏：数据挖掘文章标签： 1024程序员节 python 数据仓库数据挖掘

于 2022-10-24 13:28:44 首次发布

本文链接：https://blog.csdn.net/qq_51519554/article/details/127490749

版权

数据挖掘专栏收录该内容

1 篇文章 0 订阅

订阅专栏

题目要求：
请对三份报告统计其使用词语的词频（即词语在文章中出现的次数除以文章中所有词语出现的总数），列出每份报告中
词频最高的10个词语；以及三份报告累计词频最高的10个词语（即将三份报告合并成一个文档进行统计）。
将每个词语看做一个项，每句话看做是一个交易记录。求出每份报告中提升度最高的20条关联规则。
针对上述挖掘结果，给出模式评估并分析能够得到哪些知识。
将答案以zip压缩包提交，压缩包内包含：1.预处理后的报告数据;2.程序源码;3实验报告，包含挖掘结果以及模式评
估的分析和讨论。

阅读须知：

同学们在做实验之前可以先搜索统计词频的基本知识
(直接搜“统计词频”即可，了解中文分词工具“jieba”和其基本用法)

以下程序完整且能跑
（每台电脑的文件目录不一样，所以牵涉到路径的问题，同学们需要自行更改）

统计词频完全可以，统计交易记录（purchase）也是可以的
（但是时间复杂度太大了，运行起来时间特特特特特特特特特别长）

不过最后提升度最高的二十条关联规则结果不对
(不管了不管了)

仅仅提供一个解题的思路，供大家参考
（说实话是屎山代码，臃肿的很）

中文分词工具：jieba
（当然可以用别的分词工具，不过以下代码是使用jieba的）

用到的停用词.txt为网上搜的然后复制粘贴到一个txt文档中

三合一即为三个文本合成一个txt

# -*- coding: utf-8 -*-
# time: 2022/10/23 11:14
# file: dm.py
# author: Bill
import jieba
from collections import Counter
import re
from mlxtend.frequent_patterns import association_rules
from mlxtend.frequent_patterns import apriori
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
#问题一：请对三份报告统计其使用词语的词频（即词语在文章中出现的次数除以文章中所有词语出现的总数），
#       列出每份报告中词频最高的10个词语；
#加载（十八）大
with open(r"C:\Users\78042\Desktop\十八大.txt",encoding="UTF-8") as f:
    text18 = f.read()
#加载(十九)大
with open(r"C:\Users\78042\Desktop\十九大.txt",encoding="UTF-8") as f:
    text19 = f.read()
#加载（二十）大
with open(r"C:\Users\78042\Desktop\二十大.txt",encoding="UTF-8") as f:
    text20 = f.read()
#加载三合一
with open(r"C:\Users\78042\Desktop\三合一.txt",encoding="UTF-8") as f:
    text3 = f.read()
#加载停用词
with open(r"C:\Users\78042\Desktop\停用词.txt",encoding="UTF-8") as f:
    stop_words = f.read().split()
stop_words.extend(['\n','\u3000'])
#使用set集合过滤停用词
stop_words = set(stop_words)
cut_word = jieba.lcut(text18)
all_words = [word for word in cut_word if len(word) > 1 and word not in stop_words]
wordcount = Counter(all_words)
wordcount.most_common(10)
print("十八大:")
print(wordcount.most_common(10),"\n")
cut_word = jieba.lcut(text19)
all_words = [word for word in cut_word if len(word) > 1 and word not in stop_words]
wordcount = Counter(all_words)
wordcount.most_common(10)
print("十九大:")
print(wordcount.most_common(10),"\n")
cut_word = jieba.lcut(text20)
all_words = [word for word in cut_word if len(word) > 1 and word not in stop_words]
wordcount = Counter(all_words)
wordcount.most_common(10)
print("二十大:")
print(wordcount.most_common(10),"\n")
cut_word = jieba.lcut(text3)
all_words = [word for word in cut_word if len(word) > 1 and word not in stop_words]
wordcount = Counter(all_words)
wordcount.most_common(10)
print("三合一:")
print(wordcount.most_common(10),"\n")

#第二问：将每个词语看做一个项，每句话看做是一个交易记录
#求出每份报告中提升度最高的20条关联规则。

#十八大
#按句号"。"将文章切分
result_list = re.split(r'。',text18)
print('result_list:',result_list)
cut_word = jieba.lcut(text18)
all_words = [word for word in cut_word if len(word) > 1 and word not in stop_words]
wordcount = Counter(all_words)

'''
将每个项（中文分词）选择出来，目的：将all_words去重(由8744个减少到2498个)
'''
test = []
for i, j in wordcount.most_common():
    test.append(i)
print('test:',test)
length_test = len(test)

length_result_list = len(result_list)
'''
统计句子数目，句子的数目：586个
'''

'''
初始化purchases（交易），长度为句子的长度
'''
purchases = [[] for i in range(length_result_list)]

'''
外循环为每条句子，内循环为所有单词，做字符串匹配，如果有匹配到的就加入交易之中
'''
i = 0
n = 0
while i < length_result_list:
    for n in range(len(test)):
        m = re.search(test[n],result_list[i])
        if m is not None:
            purchases[i].append(m.group())
    i = i + 1
print('purchases:',purchases)
#print("@^_^@")
te = TransactionEncoder()
oht_ary = te.fit(purchases).transform(purchases,sparse=True)
sparse_df = pd.DataFrame.sparse.from_spmatrix(oht_ary,columns=te.columns_)

te_array = te.fit(purchases).transform(purchases)
# 采用apriori算法生成频繁项集
frequent_itemsets = apriori(sparse_df, min_support=0.1, use_colnames=True,verbose=1)
#print('frequent_itemsets:',frequent_itemsets)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
rules = association_rules(frequent_itemsets, metric='lift', min_threshold=1)
rules_h = rules.sort_values(by='lift',ascending=False).loc[0:39]
#print('rules_h:',rules.sort_values(by='lift',ascending=False).loc[0:39])
#lift_rules = rules_h.loc[0:39]
print('rules_h:\n',rules_h)
#print("@^_^@")

#十九大
result_list = re.split(r'。',text19)
print('result_list:',result_list)
cut_word = jieba.lcut(text19)
all_words = [word for word in cut_word if len(word) > 1 and word not in stop_words]
wordcount = Counter(all_words)
'''
将每个项（中文分词）选择出来，目的：将all_words去重(由8744个减少到2498个)
'''
test = []
for i, j in wordcount.most_common():
    test.append(i)
print('test:',test)
length_test = len(test)
#print("length_test:",length_test)
length_result_list = len(result_list)
'''
统计句子数目，句子的数目：586个
'''
#print("length_result_list:",length_result_list)
'''
初始化purchases（交易），长度为句子的长度
'''
purchases = [[] for i in range(length_result_list)]
#print(purchases)
'''
外循环为每条句子，内循环为所有单词，做字符串匹配，如果有匹配到的就加入交易之中
'''
i = 0
n = 0
while i < length_result_list:
    for n in range(len(test)):
        m = re.search(test[n],result_list[i])
        if m is not None:
            purchases[i].append(m.group())
    i = i + 1
print('purchases:',purchases)
#print("@^_^@")
te = TransactionEncoder()
oht_ary = te.fit(purchases).transform(purchases,sparse=True)
sparse_df = pd.DataFrame.sparse.from_spmatrix(oht_ary,columns=te.columns_)
te_array = te.fit(purchases).transform(purchases)
# 采用apriori算法生成频繁项集
frequent_itemsets = apriori(sparse_df, min_support=0.0525, use_colnames=True,verbose=1)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
rules = association_rules(frequent_itemsets, metric='lift', min_threshold=1)
print('rules:',rules)
rules_h = rules.sort_values(by='lift',ascending=False).loc[0:39]
print('rules_h:\n',rules_h)
#print("@^_^@")

#二十大
result_list = re.split(r'。',text20)
print('result_list:',result_list)
cut_word = jieba.lcut(text20)
all_words = [word for word in cut_word if len(word) > 1 and word not in stop_words]
wordcount = Counter(all_words)
'''
将每个项（中文分词）选择出来，目的：将all_words去重(由8744个减少到2498个)
'''
test = []
for i, j in wordcount.most_common():
    test.append(i)
print('test:',test)
length_test = len(test)
length_result_list = len(result_list)
'''
统计句子数目，句子的数目：586个
'''
'''
初始化purchases（交易），长度为句子的长度
'''
purchases = [[] for i in range(length_result_list)]
'''
外循环为每条句子，内循环为所有单词，做字符串匹配，如果有匹配到的就加入交易之中
'''
i = 0
n = 0
while i < length_result_list:
    for n in range(len(test)):
        m = re.search(test[n],result_list[i])
        if m is not None:
            purchases[i].append(m.group())
    i = i + 1
print('purchases:',purchases)
te = TransactionEncoder()
oht_ary = te.fit(purchases).transform(purchases,sparse=True)
sparse_df = pd.DataFrame.sparse.from_spmatrix(oht_ary,columns=te.columns_)
te_array = te.fit(purchases).transform(purchases)
# 采用apriori算法生成频繁项集
frequent_itemsets = apriori(sparse_df, min_support=0.0525, use_colnames=True,verbose=1)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
rules = association_rules(frequent_itemsets, metric='lift', min_threshold=1)
rules_h = rules.sort_values(by='lift',ascending=False).loc[0:39]
lift_rules = rules_h.loc[0:39]
print('lift_rules:\n',lift_rules)
#print("@^_^@")