最近了解了一些Python数据挖掘方面的内容,主要学习了《Python数据挖掘入门与实践》这本书的内容,在这里对书中的内容以及我遇到的一些问题进行整理。
数据挖掘旨在让计算机根据已有的数据作出决策。
数据挖掘的第一步一般是创建数据集,数据集主要包括:
(1)样本:表示真实世界中的物体
(2)特征:描述数据集中样本
学习的第一步接触的就是亲和性分析,亲和性分析是通过样本个体之间的相似度确定它们之间关系的亲疏。
这个例子中采用商品购买的一个数据集,商品共有:面包,牛奶,奶酪,苹果,香蕉这几种。
这里每个特征都有且只可能有0或者1两个值——表示是否购买该商品,而非购买的数量。
在得到样品及特征后,我们要找出规则,比如“购买了X,那么可能会购买Y”
找出规则后还需要判断其优劣,这里涉及到两个指标——支持度和置信度。
代码如下:
"""
《Python数据挖掘入门与实践》
亲和性分析
数据集每一列代表:是否购买——面包、牛奶、奶酪、苹果、香蕉
支持度support——规则应验的次数
置信度confidence——规则应验的比例
"""
import numpy as np
from collections import defaultdict #默认字典——如果没有对应的键,返回默认值0
from operator import itemgetter #针对字典进行排序
dataset_filename = r'F:\Python\pycharm\DataAnalysis_test\data\affinity_dataset.txt'
X = np.loadtxt(dataset_filename)
# print(X[:15])#显示前15行数据
features = ["bread", "milk", "cheese", "apple", "banana"]#特征列表
"""查看有多少人购买了苹果"""
# num_apple_buy = 0
# for sample in X:
# if sample[3] == 1:
# num_apple_buy +=1
# print("{0} people bought Apples".format(num_apple_buy))
"""构建规则字典"""
valid_rules = defaultdict(int)#规则应验
invalid_rules = defaultdict(int)#规则无效
num_occurances = defaultdict(int)#符合A条件(如果。。。)的所有情况
n_features = 5#共有几项特征
for sample in X:
for premise in range(n_features):
if sample[premise] == 0:
continue
else:
num_occurances[premise] += 1#符合A条件的情况+1
for conclusion in range(n_features):
if premise == conclusion:
continue
else:
if sample[conclusion] == 1:
valid_rules[(premise, conclusion)] +=1 #规则应验
else:
invalid_rules[(premise, conclusion)] +=1 #规则无效
#计算每条规则的置信度(confidence规则的准确率如何)和支持度(support规则应验的次数)
support = valid_rules
confidence = defaultdict(float)
for (premise, conclusion) in valid_rules.keys():
rule = (premise, conclusion)
confidence[rule] = valid_rules[rule] / num_occurances[premise]
"""定义输出每条规则及其置信度和支持度的函数"""
def print_rule(premise, conclusion, support, confidence, features):
premise_name = features[premise]
conclusion_name = features[conclusion]
print("rule: if a person buys {0} they will also buy {1}".format(premise_name, conclusion_name))
print("置信度confidence: {0:.3f}".format(confidence[(premise, conclusion)]))
print("支持度support:{0}".format(support[(premise, conclusion)]))
"""排序找出最佳规则"""
def best_rule():
sorted_support = sorted(support.items(),
key=itemgetter(1), #以字典的值的次序进行排序
reverse=True)#降序
sorted_confidence = sorted(confidence.items(), key=itemgetter(1), reverse=True)
for index in range(5):#输出排序最高的五个规则
print("RULE #{0}".format(index + 1))
premise, conclusion = sorted_support[index][0]
print_rule(premise, conclusion, support, confidence, features)
if __name__ == '__main__':
premise = 2
conclusion = 4
# print_rule(premise, conclusion, support, confidence, features)
best_rule()
# print(valid_rules)
输出结果为规则的评价结果:
RULE #1
rule: if a person buys cheese they will also buy banana
置信度confidence: 0.659
支持度support:27
RULE #2
rule: if a person buys banana they will also buy cheese
置信度confidence: 0.458
支持度support:27
RULE #3
rule: if a person buys apple they will also buy cheese
置信度confidence: 0.694
支持度support:25
RULE #4
rule: if a person buys cheese they will also buy apple
置信度confidence: 0.610
支持度support:25
RULE #5
rule: if a person buys banana they will also buy apple
置信度confidence: 0.356
支持度support:21
这个例子中的数据集下载链接:商品购买数据集下载