亲和性分析方法的应用场景:欺诈检测、顾客区分、软件优化、产品推荐等。
一、Apriori算法简介
Apriori算法--经典的亲和性分析算法。它只从数据集中频繁出现的商品中选取共同出现的商品组成频繁项集,一旦找到频繁项集,生成关联规则就很容易了。
频繁项集是一组达到最小支持度的项目,而关联规则由前提和结论组成。
Apriori算法原理:首先,确保了规则在数据集中有足够的支持度。比如,要生成包含商品A、B的频繁项集(A, B),要求支持度至少为30,那么A和B都必须至少在数据集中出现30次。
更大的频繁项集也要遵守该项约定,比如要生成频繁项集(A, B, C, D),那么子集(A, B, C)必须是频繁项集(当然D自己也要满足最小支持度标准)。生成频繁项集后,将不再考虑其他可能的却不够频繁的项集(这样的集合有很多),从而大大减少测试新规则所需的时间。
Apriori算法步骤:
(1)把各项目放到只包含自己的项集中,生成最初的频繁项集。只使用达到最小支持度的项目。
(2)查找现有的频繁项集的超集,发现新的频繁项集,并用其生成新的备选项目。
(3)测试新生成的备选项集的频繁程度,如果不够频繁,则舍弃。如果没有新的频繁项集,就跳到最后一步。
(4) 存储新发现的频繁项集,跳到步骤(2)。
(5)返回发现的所有频繁项集。
二、Apriori算法代码实现电影推荐
加载数据集
数据集链接:http://grouplens.org/datasets/movielens/
all_ratings = pd.read_csv('./ml-100k/u.data',
delimiter='\t',
header=None,
names=["UserID", "MovieID", "Rating", "Datetime"])
# 用pd.to_datetime解析时间戳数据
all_ratings["Datetime"] = pd.to_datetime(all_ratings['Datetime'], unit='s')
print(all_ratings[:5])
本章数据挖掘的目标是生成如下形式的规则:如果用户喜欢某些电影,那么他们也会喜欢这部电影。作为对上述规则的扩展,我们还将讨论喜欢某几部电影的用户,是否喜欢另一部电影。创建新特征Favorable,若用户喜欢该电影,值为True。
# 1、创建新特征Favorable
all_ratings["Favorable"] = all_ratings["Rating"] > 3
print(all_ratings[10:15])
# 2、从数据集中选取一部分数据(排名前200名的用户)用作训练集。
ratings = all_ratings[all_ratings["UserID"].isin(range(200))]
# 新建一个数据集,只包括用户喜欢某部电影的数据行
favorable_ratings = ratings[ratings["Favorable"]]
# 3、创建字典favorable_reviews_by_users记录每个用户各喜欢哪些电影(记录用户打分情况)
# 代码把v.values存储为frozenset,便于快速判断用户是否为某部电影打过分
favorable_reviews_by_users = dict((k, frozenset(v.values)) for k, v in favorable_ratings.groupby("UserID")["MovieID"])
"""
用户给哪些电影打分情况:favorable_reviews_by_users
用户id为1的为电影id为1, 3, 6, 7, 9, 12,...的打分(喜欢它)...
{1: frozenset({1, 3, 6, 7, 9, 12, ...}),
2: frozenset({257, 1, 13, 14, 269, ...}),
3: frozenset({320, 321, 260, 327, ...}),
...
}
"""
# 4、创建数据框num_favorable_by_movie,以便了解每部电影的影迷数量。即“Favorable”代表支持度。
num_favorable_by_movie = ratings[["MovieID", "Favorable"]].groupby("MovieID").sum()
num_favorable_by_movie:
Apriori代码实现
2.1 实现Apriori算法第一步:为每一部电影生成只包含它自己的项集,检测它是否够频繁。电影编号使用frozenset,后面要用到集合操作。此外,它们也可以用作字典的键(普通集合不可以)。
# 设置频繁项集保存的字典(键为项集长度)和最小支持度。
frequent_itemsets = {}
min_support = 50
# 1、为每一部电影生成只包含它自己的项集,检测它是否够频繁。电影编号使用frozenset,后面要用到集合操作。此外,它们也可以用作字典的键(普通集合不可以)
frequent_itemsets[1] = dict((frozenset((movie_id,)), row["Favorable"])
for movie_id, row in num_favorable_by_movie.iterrows() if row["Favorable"] > min_support)
"""
频繁项集保存 frequent_itemsets: 字典嵌套字典,内字典的键是frozenset类型。
项集长度为1的,电影id为1的有66人喜欢,电影id为7的有67人喜欢...
{1: {frozenset({1}): 66,
frozenset({7}): 67,
frozenset({9}): 53,
frozentset({50}): 100,
frozenset({56}): 67,
...}
}
"""
2.2 接着,用一个函数来实现步骤(2)和(3),它接收新发现的频繁项集,创建超集,检测频繁程 度。
# 2.创建一个函数,实现步骤2和3,接收新发现的频繁项集,创建超集,检测频繁程度。(参数为:用户打分情况,新发现的频繁项集,最小支持度)
from collections import defaultdict
def find_frequent_itemsets(favorable_reviews_by_users, k_1_itemsets, min_support):
counts = defaultdict(int)
for user, reviews in favorable_reviews_by_users.items(): # user:1, reviews:{1,3,6,7,9,12,13,14,15,16...}
for itemset in k_1_itemsets: # itemset:{1}, k_1_itemsets:[{1}:66,{7}:67,{9}:53,{50}:100, {56}:67,...]
if itemset.issubset(reviews):
for other_reviewed_movie in reviews - itemset: # other_reviewed_movie:{3} reviews - itemset:{3,6,7,9,12,13,14,15,16...}
current_superset = itemset | frozenset((other_reviewed_movie,)) # 超集:{1,3}:1,{1,6}:1,...{1,7}:2,{3,3}:1...
# 对超集计数,检测频繁程度。
counts[current_superset] += 1
return dict([(itemset, frequency) for itemset, frequency in counts.items() if frequency >= min_support])
for k in range(2, 20):
cur_frequent_itemsets = find_frequent_itemsets(favorable_reviews_by_users, frequent_itemsets[k-1], min_support)
frequent_itemsets[k] = cur_frequent_itemsets
if len(cur_frequent_itemsets) == 0:
print("Did not find any frequent itemsets of length {}".format(k))
sys.stdout.flush()
break
else:
print("I found {} frequent itemsets of length {}".format(len(cur_frequent_itemsets), k))
# 按频繁项集的支持度排序,输出前5的频繁项集
# cur_frequent_itemsets = sorted(cur_frequent_itemsets.items(), key=lambda x: x[1], reverse=True)
# print(cur_frequent_itemsets[:5])
"""
I found 93 frequent itemsets of length 2
[(frozenset({50, 181}), 138), (frozenset({50, 174}), 128), (frozenset({50, 100}), 110), (frozenset({50, 172}), 108), (frozenset({172, 174}), 104)]
I found 295 frequent itemsets of length 3
[(frozenset({50, 172, 174}), 147), (frozenset({50, 181, 174}), 144), (frozenset({50, 172, 181}), 135), (frozenset({56, 50, 174}), 132), (frozenset({50, 174, 98}), 126)]
I found 593 frequent itemsets of length 4
[(frozenset({50, 172, 181, 174}), 164), (frozenset({56, 50, 172, 174}), 136), (frozenset({56, 50, 174, 98}), 132), (frozenset({56, 50, 174, 7}), 128), (frozenset({56, 50, 100, 174}), 128)]
I found 785 frequent itemsets of length 5
[(frozenset({50, 181, 56, 172, 174}), 145), (frozenset({50, 7, 56, 172, 174}), 140), (frozenset({50, 181, 172, 174, 79}), 140), (frozenset({50, 98, 181, 172, 174}), 140), (frozenset({50, 181, 7, 172, 174}), 130)]
I found 677 frequent itemsets of length 6
[(frozenset({50, 181, 7, 56, 172, 174}), 144), (frozenset({50, 98, 181, 56, 172, 174}), 138), (frozenset({64, 50, 98, 181, 172, 174}), 138), (frozenset({50, 7, 56, 172, 174, 79}), 132), (frozenset({64, 50, 98, 7, 56, 174}), 126)]
I found 373 frequent itemsets of length 7
[(frozenset({50, 181, 7, 56, 172, 174, 79}), 133), (frozenset({50, 98, 181, 7, 56, 172, 174}), 133), (frozenset({64, 50, 98, 181, 56, 172, 174}), 133), (frozenset({50, 98, 258, 181, 56, 172, 174}), 133), (frozenset({64, 50, 98, 7, 56, 172, 174}), 126)]
I found 126 frequent itemsets of length 8
[(frozenset({50, 98, 181, 7, 56, 172, 174, 79}), 136), (frozenset({64, 50, 98, 181, 7, 56, 172, 174}), 128), (frozenset({64, 50, 98, 7, 56, 172, 174, 79}), 120), (frozenset({64, 50, 98, 181, 7, 56, 174, 79}), 120), (frozenset({64, 50, 181, 7, 56, 172, 174, 79}), 120)]
I found 24 frequent itemsets of length 9
[(frozenset({64, 98, 7, 172, 174, 79, 50, 181, 56}), 126), (frozenset({98, 100, 7, 172, 174, 79, 50, 181, 56}), 117), (frozenset({98, 258, 7, 172, 174, 79, 50, 181, 56}), 117), (frozenset({1, 98, 7, 172, 174, 79, 50, 181, 56}), 108), (frozenset({64, 98, 258, 7, 172, 174, 50, 181, 56}), 108)]
I found 2 frequent itemsets of length 10
[(frozenset({64, 1, 98, 7, 172, 174, 79, 50, 181, 56}), 100), (frozenset({64, 98, 100, 7, 172, 174, 79, 50, 181, 56}), 100)]
"""
sys.stdout.flush()
del frequent_itemsets[1]
抽取规则并计算置信度
# 4. 抽取关联规则并计算置信度
candidate_rules = []
for itemset_length, itemset_counts in frequent_itemsets.items():
for itemset in itemset_counts.keys():
# 遍历项集中的每一部电影,把它作为结论。项集中的其他电影作为前提,用前提和结论组成备选规则
for conclusion in itemset:
premise = itemset - set((conclusion,))
candidate_rules.append((premise, conclusion))
print(candidate_rules[:5])
# 计算规则置信度
valid_rules = defaultdict(int) # 存储规则应验
invalid_rules = defaultdict(int) # 存储规则不适用
for user, reviews in favorable_reviews_by_users.items():
for candidate_rule in candidate_rules:
premise, conclusion = candidate_rule
if premise.issubset(reviews):
if conclusion in reviews:
valid_rules[candidate_rule] += 1
else:
invalid_rules[candidate_rule] += 1
rule_confidence = {candidate_rule: valid_rules[candidate_rule] / float(invalid_rules[candidate_rule] + valid_rules[candidate_rule])
for candidate_rule in candidate_rules}
sorted_confidence = sorted(rule_confidence.items(), key=lambda x: x[1], reverse=True)
for index in range(5):
print("Rule #{0}".format(index + 1))
(premise, conclusion) = sorted_confidence[index][0]
print("Rule: If a person recommends {0} they wil also recommand {1}".format(premise, conclusion))
print("-Confidence:{0}\t".format(rule_confidence[(premise, conclusion)]))
输出结果用电影名代替电影编号
movie_name_data = pd.read_csv('./ml-100k/u.item', delimiter='|', header=None, encoding='mac-roman')
movie_name_data.columns = ["movie id", "movie title", "release date", "video release date",
"IMDb URL", "unknown", "Action", "Adventure", "Animation", "Children's",
"Comedy", "Crime", "Documentary", "Drama", "Fantasy",
"Film-Noir", "Horror", "Musical", "Mystery", "Romance", "Sci-Fi",
"Thriller", "War", "Western"]
def get_movie_name(movie_id):
title_object = movie_name_data[movie_name_data["movie id"] == movie_id]["movie title"]
title = title_object.values[0]
return title
for index in range(5):
print("Rule #{0}".format(index+1))
(premise, conclusion) = sorted_confidence[index][0]
premise_name = ", ".join(get_movie_name(idx) for idx in premise)
conclusion_name = get_movie_name(conclusion)
print("Rule: If a person recommends {0} they wil also recommand {1}".format(premise_name, conclusion_name))
print("-Confidence:{0}\t".format(rule_confidence[(premise, conclusion)]))
评估
训练集用前200名的用户打分情况。测试集用剩下的。
# 训练集用前200名的用户打分情况。测试集用剩下的。
# 逻辑非 Pandas中用符号 ~ (键盘左上角)表示逻辑非,对逻辑语句取反。
test_dataset = all_ratings[~all_ratings["UserID"].isin(range(200))]
test_favorable = test_dataset[test_dataset["Favorable"]]
test_favorable_by_users = dict((k, frozenset(v.values)) for k, v in test_favorable.groupby("UserID")["MovieID"])
valid_rules = defaultdict(int)
invalid_rules = defaultdict(int)
for user, reviews in test_favorable_by_users.items():
for candidate_rule in candidate_rules:
premise, conclusion = candidate_rule
if(premise.issubset(reviews)):
if conclusion in reviews:
valid_rules[candidate_rule] += 1
else:
invalid_rules[candidate_rule] += 1
test_confidence = {candidate_rule: valid_rules[candidate_rule] / float(valid_rules[candidate_rule] + invalid_rules[candidate_rule])
for candidate_rule in candidate_rules}
for index in range(5):
print("Rule #{0}".format(index + 1))
(premise, conclusion) = sorted_confidence[index][0]
premise_name = ', '.join(get_movie_name(idx) for idx in premise)
conclusion_name = get_movie_name(conclusion)
print("Rule: If a person recommends {0} they wil also recommand {1}".format(premise_name, conclusion_name))
print(" - Train Confidence:{0:.3f}".format(rule_confidence.get((premise, conclusion),-1)))
print(" - Test Confidence:{0:.3f}".format(test_confidence.get((premise, conclusion),-1)))
print("")
前5条 规则在训练集上置信度为1,在测试集上置信度也很高,用它们来推荐电影效果不错 。
小结
本章把亲和性分析用到电影推荐上,从大量电影打分数据中找到可用于电影推荐的关联规则。整个过程分为两大阶段。首先,借助Apriori算法寻找数据中的频繁项集。然后,根据找到的频繁项集,生成关联规则。
由于数据集较大,时间复杂度呈指数级增长,Apriori算法就很有必要 。
我们用一部分数据作为训练集以发现关联规则,在剩余数据——测试集上进行测试。可以用 前几章的交叉检验方法对每条规则的效果进行更为充分的评估。