使用python进行关联性分析
依据同时购买两种商品的概率进行相关程度的度量,据此确定哪些商品适合放在一起出售
基于python 3.6.4,在进行分析之前,安装numpy库,scipy库和scikit—learn
导入数据集,affi.txt,在百度文库中已上传,自行转换为txt格式即可https://wenku.baidu.com/view/5ba316c9710abb68a98271fe910ef12d2af9a987
统计数据集中交易信息的个数,并对数据集中的数据重新命名。sample表示一条交易信息,sample[3]表示apples
统计交易信息中购买apples的人数,通过检测sample[3]是否为1判断。
rule_valid,表示购买了苹果sample[3]同时购买了sample[4]香蕉的人数,rule_invalid,表示购买了苹果sample[3]但没有购买sample[4]香蕉的人数,据此可以得到,苹果,香蕉之间的一条想关性规则
规则的优劣一般用 支持度(support)和置信度(confidence)来衡量,可以依据规则在数据集中出现的次数对其进行计算
得到关于apples,bananas这条规则的支持度和置信度。为了统计数据集中所有的相关规则,依据有效规则和
无效规则这两种情况创建字典来存放计算结果。字典的键由条件premise和结论conclusion组成,这里使用dafaultdict,
避免了查找的键值不存在时报错。
计算过程采用循环结构,依次对每条购物信息及每条信息中的特征值进行处理。
第一个特征为规则的前提条件,顾客购买了某一物品sample[premise]
检测个体是否购买了某样商品,如果没有,continue,继续检测下一个条件,
如果由购买行为,该条件出现次数加1。在遍历过程中要跳过条件和结果相同的部分,如“如果购买了苹果,也购买了苹果”
这种规则没有意义。如果规则适用于个体,valid_rules字典中,增加一次,反之invalid_rules增加一次。
输出支持度,置信度计算结果
输出特定的规则。
规则中的很多支持度很低,并没有实际应用价值,考虑将支持度结果进行排序,输出前五个。
Python 字典(Dictionary) items() 函数以列表返回可遍历的(键, 值) 元组数组
输出按支持度排序的前五个
输出按置信度排序的前五个
从运行结果,apples,cheese,bananas之间的关联程度较高,在实际中,将这几种商品放置在一起进行销售可以方便顾客
同时,如果有促销活动,对于apples和cheese其置信度最高,即使有折扣,这两件商品的销量也不会有很大提升,因为客户
本身就倾向于同时购买两种产品。完整代码如下:
import numpy as np dataset_filename = "affi.txt" X = np.loadtxt(dataset_filename) n_samples, n_features = X.shape print(X[:10]) print("this dataset has {0}samples and {1} features".format(n_samples, n_features)) features = ["bread", "milk", "cheese", "apples", "bananas"] # name of the feature # example to calculate apples purchase num_apple_purchase = 0 for sample in X: if sample[3] == 1: num_apple_purchase += 1 print("{0} people bought Apples".format(num_apple_purchase)) # cases that person bought apple and bananas at same time rule_valid = 0 rule_invalid = 0 for sample in X: if sample[3] == 1: if sample[4] == 1: rule_valid += 1 else: rule_invalid += 1 print("{0}cases of the rule being valid were discovered".format(rule_valid)) print("{0}cases of the rule being invalid were discovered".format(rule_invalid)) print("*"*100) # compute the support and confidence support = rule_valid confidence = rule_valid/ num_apple_purchase print("the support is {0} and the confidence is {1:.3f}".format(support, confidence)) # .3f means let number in 3 significant digits print(confidence) print("*"*100) # compute all possible rules from collections import defaultdict # create default data dictionary,if key not exist,use default value valid_rules = defaultdict(int) invalid_rules = defaultdict(int) num_occurences = defaultdict(int) # rules include premise and conclusion for sample in X: for premise in range(n_features): if sample[premise] == 0: continue num_occurences[premise] += 1 for conclusion in range(n_features): if premise == conclusion: continue if sample[conclusion] == 1: valid_rules[(premise, conclusion)] += 1 else: invalid_rules[(premise, conclusion)] += 1 support = valid_rules confidence = defaultdict(float) for premise, conclusion in valid_rules.keys(): rule = (premise, conclusion) confidence[rule] = valid_rules[rule] / num_occurences[premise] for premise, conclusion in confidence: premise_name = features[premise] conclusion_name = features[conclusion] print("Rule if person buy {0},they will also buy {1}".format(premise_name, conclusion_name)) print("- confidence:{0:.3f}".format(confidence[(premise, conclusion)])) print("- support:{0}".format(support[(premise, conclusion)])) print("") print("*"*100) # output rule in def def print_rule(premise, conclusion, support, confidence, features): premise_name = features[premise] conclusion_name = features[conclusion] print("Rule if person buy {0},they will also buy {1}".format(premise_name, conclusion_name)) print("- confidence:{0:.3f}".format(confidence[(premise, conclusion)])) print("- support:{0}".format(support[(premise, conclusion)])) print("") print("*"*100) # buy milk and apple at same time premise = 1 conclusion = 3 print_rule(premise, conclusion, support, confidence, features) # sort by support # dict.items return the key value from pprint import pprint pprint(list(support.items())) from operator import itemgetter sorted_support = sorted(support.items(), key = itemgetter(1), reverse = True) for index in range(5): print("Rule #{0}".format(index+1)) (premise,conclusion) = sorted_support[index][0] print_rule(premise, conclusion, support, confidence, features) # sort by confidence from pprint import pprint pprint(list(confidence.items())) from operator import itemgetter sorted_confidence = sorted(confidence.items(), key = itemgetter(1), reverse = True) for index in range(5): print("Rule #{0}".format(index+1)) (premise,conclusion) = sorted_confidence[index][0] print_rule(premise, conclusion, support, confidence, features)