6.1 使用关联规则找到调查问卷中的规则_问卷分析关联法则-CSDN博客

本文链接：https://blog.csdn.net/qq_45047246/article/details/107608472

关联分析

使用关联分析，分析一份调查问卷，对其中的十几个单选题尽进行关联分析，发现其中的规则

## 加载包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy as sp
## 图像在jupyter notebook中显示
%matplotlib inline
## 显示的图片格式（mac中的高清格式），还可以设置为"bmp"等格式
%config InlineBackend.figure_format = "retina"
## 输出图显示中文
from matplotlib.font_manager import FontProperties
fonts = FontProperties(fname = "D:\Desktop\python在机器学习中的应用\方正粗黑宋简体.ttf",size=14)
## 引入3D坐标系
from mpl_toolkits.mplot3d import Axes3D
## cm模块提供大量的colormap函数
from matplotlib import cm
import matplotlib as mpl
## 挖掘频繁项集和关联规则
from mlxtend.frequent_patterns import apriori,association_rules  
from mlxtend.preprocessing import TransactionEncoder

## 读取数据
datadf = pd.read_excel("D:\Desktop\python在机器学习中的应用\调查问卷2.xls")
datadf.head()

在这里插入图片描述

## 查看所有的选项中，每个选项的出现次数
dataflatten = np.array(datadf.iloc[:,1::]).flatten()
dataflatten = pd.DataFrame({"value":dataflatten})
dataflatten

在这里插入图片描述

## 计算出现的频次
datafre = dataflatten.groupby(by=["value"])["value"].count()
## 整理未数据表
datafre = pd.DataFrame({"Item":datafre.index,"Freq":datafre.values}).sort_values("Freq",ascending=0)
## 绘制直方图
datafre.plot(kind = "bar",figsize = (12,6),legend=None)
plt.title("选项出现的频次",fontproperties = fonts)
plt.ylabel("频次",fontproperties = fonts,size = 12)
plt.xlabel("")
plt.xticks(range(len(datafre)),datafre["Item"],rotation=90,fontproperties = fonts,size = 9)
plt.show()

在这里插入图片描述

## 对数据集进行编码
datanew = np.array(datadf.iloc[:,1::])
oht = TransactionEncoder() # 相应类别若含有实例则为true，否则为false
oht_ary = oht.fit(datanew).transform(datanew)
## 将编码后的数据集做成数据表，每列为各个选项
df = pd.DataFrame(oht_ary, columns=oht.columns_)
df.head()

在这里插入图片描述

## 发现频繁项集，最小支持度为0.3
df_fre = apriori(df, min_support=0.3,use_colnames=True)
## 为找到的频繁项目添加项目长度
df_fre["length"] = df_fre["itemsets"].apply(lambda x: len(x))
print(df_fre.shape)
## 可以发现我们找到了出现最小支持度＝0.3的规则有138个
## 查看频繁项集中至少包含两个元素的项目
df_fre_len2 = df_fre[df_fre["length"]>1]
print(df_fre_len2.shape)
df_fre_len2.sample(5)
## 至少包含两个项目的频繁项集有120个

在这里插入图片描述

## 找到关联规则，通过提升度阈值发现规则
rule1 = association_rules(df_fre, metric="lift", min_threshold=1.1)
## 计算前提（antecedants）的长度
rule1["antelen"] = rule1.antecedents.apply(lambda x:len(x))
rule1

在这里插入图片描述
提升度lift大于1.1的规则一共有12条，结果中海油指出度、置信度和提升度的取值

## 找到置信度 >0.5,前提长度>1的规则
rule1[(rule1.antelen>1)&(rule1.confidence>0.7)]

在这里插入图片描述
可以发现男性如果对真人秀的节目并没有很高的兴趣（一遍就够或偶尔讨论）通常都是理工男