亲和性分析:
数据挖掘有个常见的应用场景,即顾客在购买一件商品时,商家可以趁机了解他们还想买什么,以便把多数顾客愿意同时购买的商品放到一起销售以提升销售额。当商家收集到足够多的数据时,就可以对其进行亲和性分析,以确定哪些商品适合放在一起出售。
本质:根据样本个体(物体)之间的相似度,确定它们关系的亲疏。
接下来这个实例是对5种商品,人们买了其中一种,还会买另外一种商品的可能
要发掘可能,就必须得创建规则
如果顾客购买了商品X,那么他们可能愿意购买商品Y(这就是我们的规则)
每条规则都存在支持度和可信度(置信度)
支持度:给定规则应验的比例 例:一共有10个人,喜欢苹果的有5人,那么苹果的支持度就是5/10
可信度:规则准确率如何 (每条规则的正确可能性)
具体代码如下:
#coding:utf-8
import numpy as np
dataset_filename = "affinity_dataset.txt" #导入的样本集
X=np.loadtxt(dataset_filename)
n_samples, n_features = X.shape #n_samples 样本数 n_features 样本的特征
#print n_features,n_samples
# The names of the features, for your reference.
#面包 牛奶 奶酪 苹果 香蕉
features = ["bread", "milk", "cheese", "apples", "bananas"]
print X[:5]
[[0. 0. 1. 1. 1.]
[1. 1. 0. 1. 0.]
[1. 0. 1. 1. 0.]
[0. 0. 1. 1. 1.]
[0. 1. 0. 0. 1.]]
#First, how many rows contain our premise: that a person buying apples
#premise 前提
#计算出苹果的支持度
num_apple_purchases = 0
for sample in X:
if sample[3] == 1: #This person bought Apples
num_apple_purchases += 1
print "{0} people bought Apples".format(num_apple_purchases)#苹果的支持度=num_apple_purchases/sum_person
36 people bought Apples
#将规则和每个特征出现的次数设置成整形字典
from collections import defaultdict
valid_rules = defaultdict(int)
invalid_rules = defaultdict(int)
num_occurances = defaultdict(int)
#样本遍历,计算出每个特征的有效规则和无效规则
for sample in X:
for premise in range(n_features):
if sample[premise]==0:
continue
num_occurances[premise] += 1
for conclusion in range(n_features):
if premise == conclusion: #将因果相同的去除,例:买了苹果最可能买的还是苹果
continue
if sample[conclusion] == 1:
valid_rules[(premise,conclusion)]+=1
else:
invalid_rules[(premise, conclusion)]+=1
#有效规则即支持度
support = valid_rules
confidence = defaultdict(float)
#置信度的计算
for premise, conclusion in valid_rules.keys():
rule = (premise, conclusion) #前提 结果的特征索引值
confidence[rule] = float(valid_rules[rule]) / num_occurances[premise]
这是打印所有的规则
for premise, conclusion in confidence:
premise_name = features[premise]
conclusion_name = features[conclusion]
print "Rule: If a person buys {0}\
they will also buy {1}".format(premise_name,conclusion_name)
print " - Support: {0}".format(support[(premise,conclusion)])
print " - COndidence: {0:.3f}".format(confidence[(premise,conclusion)])
Rule: If a person buys bread they will also buy milk
- Support: 14
- COndidence: 0.519
Rule: If a person buys milk they will also buy cheese
- Support: 7
- COndidence: 0.152
Rule: If a person buys apples they will also buy cheese
- Support: 25
- COndidence: 0.694
Rule: If a person buys milk they will also buy apples
- Support: 9
- COndidence: 0.196
Rule: If a person buys bread they will also buy apples
- Support: 5
- COndidence: 0.185
Rule: If a person buys apples they will also buy bread
- Support: 5
- COndidence: 0.139
Rule: If a person buys apples they will also buy bananas
- Support: 21
- COndidence: 0.583
Rule: If a person buys apples they will also buy milk
- Support: 9
- COndidence: 0.250
Rule: If a person buys milk they will also buy bananas
- Support: 19
- COndidence: 0.413
Rule: If a person buys cheese they will also buy bananas
- Support: 27
- COndidence: 0.659
Rule: If a person buys cheese they will also buy bread
- Support: 4
- COndidence: 0.098
Rule: If a person buys cheese they will also buy apples
- Support: 25
- COndidence: 0.610
Rule: If a person buys cheese they will also buy milk
- Support: 7
- COndidence: 0.171
Rule: If a person buys bananas they will also buy apples
- Support: 21
- COndidence: 0.356
Rule: If a person buys bread they will also buy bananas
- Support: 17
- COndidence: 0.630
Rule: If a person buys bananas they will also buy cheese
- Support: 27
- COndidence: 0.458
Rule: If a person buys milk they will also buy bread
- Support: 14
- COndidence: 0.304
Rule: If a person buys bananas they will also buy milk
- Support: 19
- COndidence: 0.322
Rule: If a person buys bread they will also buy cheese
- Support: 4
- COndidence: 0.148
Rule: If a person buys bananas they will also buy bread
- Support: 17
- COndidence: 0.288
#打印特定规则的置信度和支持度
def print_rule(premise, conclusion, support, confidence, features):
premise_name = features[premise]
conclusion_name = features[conclusion]
print "Rule: If a person buys {0}\
they will also buy {1}".format(premise_name,conclusion_name)
print " - Support: {0}".format(support[(premise,conclusion)])
print " - COndidence: {0:.3f}".format(confidence[(premise,conclusion)])
打印指定规则
premise = 1
conclusion = 3
print_rule(premise,conclusion,support,confidence,features)
Rule: If a person buys milk they will also buy apples
- Support: 9
- COndidence: 0.196
#按照支持度由高到低排序 规则支持数
from operator import itemgetter
sorted_support = sorted(support.items(), key=itemgetter(1),reverse=True)
print support
#support.items() 将字典转换成列表 【(),()】
#itemgetter(1)表示以字典的值(非键)作为排序根据 即支持度
#reverse 相反 以相反的顺序进行排序,即降序(默认升序)
defaultdict(<type 'int'>, {(0, 1): 14, (1, 2): 7, (3, 2): 25, (1, 3): 9, (3, 0): 5, (4, 1): 19, (3, 1): 9, (1, 4): 19, (0, 2): 4, (2, 0): 4, (2, 3): 25, (2, 1): 7, (4, 3): 21, (0, 4): 17, (1, 0): 14, (4, 2): 27, (0, 3): 5, (3, 4): 21, (2, 4): 27, (4, 0): 17})
打印支持度最高的5个规则
for index in range(5):
print "Rule #{0}".format(index+1)
premise,conclusion = sorted_support[index][0]
print_rule(premise,conclusion,support,confidence,features)
Rule #1
Rule: If a person buys bananas they will also buy cheese
- Support: 27
- COndidence: 0.458
Rule #2
Rule: If a person buys cheese they will also buy bananas
- Support: 27
- COndidence: 0.659
Rule #3
Rule: If a person buys apples they will also buy cheese
- Support: 25
- COndidence: 0.694
Rule #4
Rule: If a person buys cheese they will also buy apples
- Support: 25
- COndidence: 0.610
Rule #5
Rule: If a person buys bananas they will also buy apples
- Support: 21
- COndidence: 0.356
#按照置信度由高到低进行排序 规则可信度
sorted_confidence = sorted(confidence.items(),key=itemgetter(1),reverse=True)
打印置信度最高的5个规则
for index in range(5):
print("Rule #{0}".format(index+1))
premise,conclusion = sorted_confidence[index][0]
print_rule(premise,conclusion,support,confidence,features)
Rule #1
Rule: If a person buys apples they will also buy cheese
- Support: 25
- COndidence: 0.694
Rule #2
Rule: If a person buys cheese they will also buy bananas
- Support: 27
- COndidence: 0.659
Rule #3
Rule: If a person buys bread they will also buy bananas
- Support: 17
- COndidence: 0.630
Rule #4
Rule: If a person buys cheese they will also buy apples
- Support: 25
- COndidence: 0.610
Rule #5
Rule: If a person buys apples they will also buy bananas
- Support: 21
- COndidence: 0.583
从排序结果来看,“顾客买苹果,也会买奶酪”和“顾客买奶酪,也会买香蕉”,这两条规则的支持度和置信度都很高。超市经理可以根据这些规则来调整商品摆放位置。例如,如果本周苹果促销,就在旁边摆上奶酪。但是香蕉和奶酪同时搞促销就没有多大意义了,因为我们发现购买奶酪的顾客中,接近66%的人即使不搞促销也会买香蕉——即使搞促销,也不会给销量带来多大提升。
参考书籍:Python书籍挖掘入门和实践