第一章开启数据挖掘之旅
1.3 亲和性分析
1.3.1 什么是亲和性分析?
简而言之,就是顾客买了A之后是否会买B(A、B之前的购买规则之间的联系)的概率。
本部分涉及到支持度和置信度,这两个概念在例子中介绍。
1.3.2 案例
不同顾客购买五种商品的清单,分别是面包、牛奶、奶酪、苹果、香蕉,采用二维数组保存:行代表顾客一次购买记录,列代表某商品的购买记录,每个项表示某顾客有没有购买某商品 1表示购买,0表示没有购买;
以此为根据分析出合理的促销方案,即将哪两种商品放在一起同时销售的机会更大。
PS:我没有使用书上的很多专业词汇,一是我也是刚学,这些记得不是很熟;二是没有必要,学到最后必然会记住,不急于一时,大神勿喷。
1.3.3 代码分析及实现
例子1:输出数据集的前五行(在Numpy中采集数据)
代码:
import numpy as np
dataset_filename = "affinity_dataset.txt"
X = np.loadtxt(dataset_filename)
n_samples, n_features = X.shape
print("This dataset has {0} samples and {1}features".format(n_samples, n_features))
print(X[:5])
输出结果:
[[ 0. 0. 1. 1. 1.]
[ 1. 1. 0. 1. 0.]
[ 1. 0. 1. 1. 0.]
[ 0. 0. 1. 1. 1.]
[ 0. 1. 0. 0. 1.]]
总结:
单纯通过Numpy加在数据集并输出前5行。
例子2:输出购买了苹果的人的个数
代码:
# 定义特征的名字
features = ["bread", "milk","cheese", "apples", "bananas"]
# 输出购买了苹果的人的个数
num_apple_purchases = 0
for sample in X:
if sample[3] == 1: # 购买苹果的人
num_apple_purchases += 1
# 输出格式
print("{0} people boughtApples".format(num_apple_purchases))
输出:
36 people bought Apples
例子 3:输出购买了苹果之后又购买了香蕉的人
代码:
rule_valid = 0
rule_invalid = 0
for sample in X:
if sample[3] ==1: # This person bought Apples
if sample[4] == 1:
# This personbought both Apples and Bananas
rule_valid +=1
else:
# This personbought Apples, but not Bananas
rule_invalid+= 1
print("{0} cases of the rule being valid werediscovered".format(rule_valid))
print("{0} cases of the rule being invalid were discovered".format(rule_invalid))
输出:
21 cases of the rule being valid were discovered
15 cases of the rule being invalid were discovered
总结:
这就是一种规则 premise--->conclusion,语法与Java类似
例子 4:由上面的例子,输出支持度和置信度
支持度:
条件(premise)生效的次数
由上面的例子,21即为生效次数(前提premise是买了苹果)
置信度:
生效次数与条件(premise)出现次数的比值,即准确度。
EG:21/36
代码:
# 支持度和置信度
support = rule_valid
confidence = rule_valid / num_apple_purchases
print("The support is {0} and the confidence is{1:.3f}.".format(support, confidence))
# 将置信度显示为百分数
print("As a percentage, that is {0:.1f}%.".format(100* confidence))
输出:
The support is 21 and the confidence is 0.583.
As a percentage, that is 58.3%.
实例 1:计算支持度和置信度
代码:
from collections import defaultdict
# 生效次数、失效次数、条件出现次数,防止键不存在报错,使用defaultdict()
valid_rules = defaultdict(int)
invalid_rules = defaultdict(int)
num_occurences = defaultdict(int)
for sample in X:
# 条件:premise
for premise in range(n_features):
# 条件都不存在
if sample[premise]== 0: continue
# 记录该条件出现的次数
num_occurences[premise] += 1
# 结论:conclusion
for conclusion inrange(n_features):
# 即由某人买苹果推出他会买苹果,没有意义 X -> X.
if premise ==conclusion:
Continue
ifsample[conclusion] == 1:
# 买premise后又买了conclusion,该规则生效一次
valid_rules[(premise, conclusion)]+= 1
else:
# 买premise后没买了conclusion,该规则失效一次
invalid_rules[(premise, conclusion)] += 1
# 支持度、置信度
support = valid_rules
confidence = defaultdict(float)
# 用前提和结论作为键去查询,(premise,conclusion)
for premise, conclusion in valid_rules.keys():
confidence[(premise, conclusion)] =valid_rules[(premise, conclusion)] / num_occurences[premise]
# 输出结果
for premise, conclusion in confidence:
premise_name =features[premise]
conclusion_name =features[conclusion]
print("Rule: If aperson buys {0} they will also buy {1}".format(premise_name,conclusion_name))
print(" -Confidence: {0:.3f}".format(confidence[(premise, conclusion)]))
print(" -Support: {0}".format(support[(premise, conclusion)]))
print("")
输出:取部分
Rule: If a person buys bread they will also buy milk
- Confidence: 0.519
- Support: 14
Rule: If a person buys milk they will also buy cheese
- Confidence: 0.152
- Support: 7
Rule: If a person buys apples they will also buy cheese
- Confidence: 0.694
- Support: 25
Rule: If a person buys milk they will also buy apples
- Confidence: 0.196
- Support: 9
Rule: If a person buys bread they will also buy apples
- Confidence: 0.185
- Support: 5
总结:
均通过定义计算,主要注意Python语法。
实例 2:定义输出函数
代码:
def print_rule(premise, conclusion, support, confidence,features):
premise_name = features[premise]
conclusion_name = features[conclusion]
# 简单代码
print("Rule: If a person buys {0} they will also buy{1}".format(premise_name, conclusion_name))
print(" - Confidence: {0:.3f}".format(confidence[(premise,conclusion)]))
print(" - Support: {0}".format(support[(premise,conclusion)]))
print("")
实例 3:排序找出最佳规则,分别针对支持度和置信度
说明:
针对支持度字典和置信度字典
items()函数返回字典的全部元素列表。
Itemgetter()作为键,itemgetter(1)表示支持度
Reverse=True 表示降序排列
代码:
# 支持度为键排序
from operator import itemgetter
sorted_support = sorted(support.items(), key=itemgetter(1),reverse=True)
# 输出支持度
for index in range(5):
print("Rule#{0}".format(index + 1))
(premise, conclusion)= sorted_support[index][0]
print_rule(premise, conclusion, support,confidence, features)
# 置信度为键排序
fsorted_confidence = sorted(confidence.items(),key=itemgetter(1), reverse=True)
# 输出置信度
for index in range(5):
print("Rule#{0}".format(index + 1))
(premise, conclusion)= sorted_confidence[index][0]
print_rule(premise,conclusion, support, confidence, features)
输出:
支持度:
Rule #1
Rule: If a person buys cheese they will also buy bananas
- Confidence: 0.659
- Support: 27
Rule #2
Rule: If a person buys bananas they will also buy cheese
- Confidence: 0.458
- Support: 27
Rule #3
Rule: If a person buys apples they will also buy cheese
- Confidence: 0.694
- Support: 25
Rule #4
Rule: If a person buys cheese they will also buy apples
- Confidence: 0.610
- Support: 25
Rule #5
Rule: If a person buys bananas they will also buy apples
- Confidence: 0.356
- Support: 21
置信度:
Rule #1
Rule:If a person buys apples they will also buy cheese
- Confidence: 0.694
- Support: 25
Rule #2
Rule:If a person buys cheese they will also buy bananas
- Confidence: 0.659
- Support: 27
Rule #3
Rule:If a person buys bread they will also buy bananas
- Confidence: 0.630
- Support: 17
Rule #4
Rule:If a person buys cheese they will also buy apples
- Confidence: 0.610
- Support: 25
Rule #5
Rule:If a person buys apples they will also buy bananas
- Confidence: 0.583
- Support: 21
1.3.4 学习总结
通过置信度,我们可以发现一些商品被同时购买的几率很高,比如苹果和奶酪,那么我们可以在促销苹果的旁白摆上奶酪,这样买苹果的人又会去买奶酪(69.4%的人会这样),那么促销苹果的同时也提高了奶酪的销量。