本次数据挖掘项目是电影推荐问题,目的是找出对象同时出现的情况,也就是寻找用户同时喜欢几部电影的情况。
使用最基础的Apriori算法。
import os
import pandas as pd
import numpy as np
import sys
from operator import itemgetter
from collections import defaultdict
一、加载数据并观察
# 文件的后缀就是.data,后面不要再加.csv了,否则会报错
all_ratings = pd.read_csv("u.data", delimiter='\t', header=None, names=["UserID","MovieID","Rating","Datetime"])
# 让我们看看冰山的一角吧
all_ratings.head()
UserID | MovieID | Rating | Datetime | |
---|---|---|---|---|
0 | 196 | 242 | 3 | 881250949 |
1 | 186 | 302 | 3 | 891717742 |
2 | 22 | 377 | 1 | 878887116 |
3 | 244 | 51 | 2 | 880606923 |
4 | 166 | 346 | 1 | 886397596 |
# 很好奇每一列都是什么数据类型,有没有缺失的记录呢?
all_ratings.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 UserID 100000 non-null int64
1 MovieID 100000 non-null int64
2 Rating 100000 non-null int64
3 Datetime 100000 non-null int64
dtypes: int64(4)
memory usage: 3.1 MB
每一列都是int类型且没有缺失,很好,数据处理不用再做缺失值处理了。
# 再看看数据大小吧,其实从info()里也能看出来的,不过这样更直接,一眼就看出来啦
all_ratings.shape
(100000, 4)
# 去个重吧,万一有重复的数据
print("去重前数据大小:{0}".format(all_ratings.shape))
all_ratings.drop_duplicates(keep="first",inplace=True)
print("去重后数据大小:{0}".format(all_ratings.shape))
去重前数据大小:(100000, 4)
去重后数据大小:(100000, 4)
没有重复的呢。
# 看看都有哪些用户爱看电影还给电影打了分吧(反正我看电影是不喜欢打分的)
all_ratings["UserID"].value_counts()
405 737
655 685
13 636
450 540
276 518
...
147 20
19 20
572 20
636 20
895 20
Name: UserID, Length: 943, dtype: int64
一共有943名用户,用户#405竟然给737部电影打了分,最少的也有20部。
# 让我们看看都有哪些电影被打了分吧
all_ratings["MovieID"].value_counts()
50 583
258 509
100 508
181 507
294 485
...
1648 1
1571 1
1329 1
1457 1
1663 1
Name: MovieID, Length: 1682, dtype: int64
一共有1682部电影被观看并打分,电影#50被打分次数最多有583次,有的电影就有点惨了只被打过一次分比如电影#1663。
# 看下电影分数都有哪几档
all_ratings["Rating"].unique().tolist()
[3, 1, 2, 4, 5]
# 解析时间戳,日期看着比一串数字方便多了
all_ratings["Datetime"] = pd.to_datetime(all_ratings["Datetime"], unit='s')
all_ratings.head()
UserID | MovieID | Rating | Datetime | |
---|---|---|---|---|
0 | 196 | 242 | 3 | 1997-12-04 15:55:49 |
1 | 186 | 302 | 3 | 1998-04-04 19:22:22 |
2 | 22 | 377 | 1 | 1997-11-07 07:18:36 |
3 | 244 | 51 | 2 | 1997-11-27 05:02:03 |
4 | 166 | 346 | 1 | 1998-02-02 05:33:16 |
看样子都是很古老的数据啊
二、Apriori算法的实现
# 首先确定用户是不是喜欢某一部电影,评分高于3的认定为喜欢
all_ratings['Favorable'] = all_ratings['Rating'] > 3
# 从数据集选取一部分用作训练集,能减少搜索空间,提升算法的速度
# 我尝试用UserID前200的数据训练,电脑直接卡住,建议先用50以内的数值尝试,如果执行起来很流畅再调高
ratings = all_ratings[all_ratings['UserID'].isin(range(20))]
# 选取用户喜欢某部电影的数据行
favorable_ratings = ratings[ratings['Favorable']]
# 要知道用户喜欢哪些电影,把v.values存储为frozenset便于快速判断用户是否为某部电影打过分
favorable_reviews_by_users = dict((k, frozenset(v.values)) for k,v in favorable_ratings.groupby('UserID')['MovieID'])
# 每部电影有多少人喜欢
num_favorable_by_movie = ratings[['MovieID','Favorable']].groupby('MovieID').sum()
num_favorable_by_movie.sort_values(by='Favorable', axis=0, ascending=False)[:5]
Favorable | |
---|---|
MovieID | |
50 | 14.0 |
100 | 12.0 |
174 | 11.0 |
127 | 10.0 |
56 | 10.0 |
1、实现
frequent_itemsets = {}
# 最小支持度
min_support = 10
# 每一部电影作为一个项集,判断它是否够频繁
frequent_itemsets[1] = dict((frozenset((MovieID,)), row['Favorable'])
for MovieID, row in num_favorable_by_movie.iterrows() if row['Favorable'] > min_support)
frequent_itemsets[1]
{frozenset({50}): 14.0, frozenset({100}): 12.0, frozenset({174}): 11.0}
# 接收k-1项频繁项集,创建超集,检测频繁程度,生成k项频繁项集
def find_frequent_itemsets(favorable_reviews_by_users, k_1_itemsets, min_support):
counts = defaultdict(int)
for user, reviews in favorable_reviews_by_users.items():
for itemset in k_1_itemsets:
if itemset.issubset(reviews):
# 遍历用户喜欢并打过分但是没出现在k-1频繁项集里的电影,用它生成超集,更新该项集的计数
for other_reviewed_movie in reviews - itemset:
current_superset = itemset | frozenset((other_reviewed_movie,))
counts[current_superset] += 1
return dict([(itemset, frequency) for itemset, frequency in counts.items() if frequency >= min_support])
find_frequent_itemsets(favorable_reviews_by_users, frequent_itemsets[1], min_support)
{frozenset({50, 100}): 18, frozenset({50, 174}): 20, frozenset({100, 174}): 18}
# 存储运行中发现的新频繁项集
for k in range(2, 5):
cur_frequent_itemsets = find_frequent_itemsets(favorable_reviews_by_users, frequent_itemsets[k-1],min_support)
frequent_itemsets[k] = cur_frequent_itemsets
if len(cur_frequent_itemsets) == 0:
print("did not find any frequent itemsets of length {0}".format(k))
# 确保代码还在执行时,把缓冲区内容输出到终端,不易过多使用,flush操作(输出也是)所带来的计算会拖慢程序的运行速度
sys.stdout.flush()
break
else:
print("i found {0} frequent itemsets of length {1}".format(len(cur_frequent_itemsets),k))
sys.stdout.flush()
i found 3 frequent itemsets of length 2
i found 1 frequent itemsets of length 3
did not find any frequent itemsets of length 4
del frequent_itemsets[1]
frequent_itemsets
{2: {frozenset({50, 100}): 18,
frozenset({50, 174}): 20,
frozenset({100, 174}): 18},
3: {frozenset({50, 100, 174}): 24},
4: {}}
2、抽取关联规则
频繁项集是一组达到最小支持度的项目,从中抽取的关联规则形如:如果用户喜欢前提中的所有电影,那么他们也会喜欢结论中的电影。
candidate_rules = []
# 遍历不同长度的频繁项集,为每个项集生成规则
for itemset_length, itemset_counts in frequent_itemsets.items():
# 遍历每个项集
for itemset in itemset_counts.keys():
for conclusion in itemset:
premise = itemset - set((conclusion,))
candidate_rules.append((premise, conclusion))
candidate_rules
[(frozenset({100}), 50),
(frozenset({50}), 100),
(frozenset({174}), 50),
(frozenset({50}), 174),
(frozenset({174}), 100),
(frozenset({100}), 174),
(frozenset({100, 174}), 50),
(frozenset({50, 174}), 100),
(frozenset({50, 100}), 174)]
correct_counts = defaultdict(int)
incorrect_counts = defaultdict(int)
# 遍历用户和打分数据,统计规则的次数,当规则的前提符合时,看一下用户是否喜欢结论中的电影
for user, reviews in favorable_reviews_by_users.items():
for candidate_rule in candidate_rules:
premise, conclusion = candidate_rule
if premise.issubset(reviews):
if conclusion in reviews:
correct_counts[candidate_rule] += 1
else:
incorrect_counts[candidate_rule] += 1
correct_counts
defaultdict(int,
{(frozenset({100}), 50): 9,
(frozenset({50}), 100): 9,
(frozenset({174}), 50): 10,
(frozenset({50}), 174): 10,
(frozenset({174}), 100): 9,
(frozenset({100}), 174): 9,
(frozenset({100, 174}), 50): 8,
(frozenset({50, 174}), 100): 8,
(frozenset({50, 100}), 174): 8})
incorrect_counts
defaultdict(int,
{(frozenset({50}), 174): 4,
(frozenset({100}), 174): 3,
(frozenset({50, 100}), 174): 1,
(frozenset({50}), 100): 5,
(frozenset({174}), 100): 2,
(frozenset({50, 174}), 100): 2,
(frozenset({100}), 50): 3,
(frozenset({174}), 50): 1,
(frozenset({100, 174}), 50): 1})
# 计算规则的置信度
rule_confidence = {candidate_rule:
correct_counts[candidate_rule]/(correct_counts[candidate_rule] + incorrect_counts[candidate_rule])
for candidate_rule in candidate_rules}
rule_confidence
{(frozenset({100}), 50): 0.75,
(frozenset({50}), 100): 0.6428571428571429,
(frozenset({174}), 50): 0.9090909090909091,
(frozenset({50}), 174): 0.7142857142857143,
(frozenset({174}), 100): 0.8181818181818182,
(frozenset({100}), 174): 0.75,
(frozenset({100, 174}), 50): 0.8888888888888888,
(frozenset({50, 174}), 100): 0.8,
(frozenset({50, 100}), 174): 0.8888888888888888}
# 置信度降序
sorted_confidence = sorted(rule_confidence.items(), key=itemgetter(1), reverse=True)
sorted_confidence
[((frozenset({174}), 50), 0.9090909090909091),
((frozenset({100, 174}), 50), 0.8888888888888888),
((frozenset({50, 100}), 174), 0.8888888888888888),
((frozenset({174}), 100), 0.8181818181818182),
((frozenset({50, 174}), 100), 0.8),
((frozenset({100}), 50), 0.75),
((frozenset({100}), 174), 0.75),
((frozenset({50}), 174), 0.7142857142857143),
((frozenset({50}), 100), 0.6428571428571429)]
# 输出置信度最高的前5条规则
for index in range(5):
print('Rule #{0}'.format(index + 1))
premise, conclusion = sorted_confidence[index][0]
print('Rule: If a person recommends: {0}, they will also recommend: {1}'.format(premise, conclusion))
print('- Confidence: {0:.3f}'.format(rule_confidence[(premise, conclusion)]))
print("")
Rule #1
Rule: If a person recommends frozenset({174}), they will also recommend 50
- Confidence: 0.909
Rule #2
Rule: If a person recommends frozenset({100, 174}), they will also recommend 50
- Confidence: 0.889
Rule #3
Rule: If a person recommends frozenset({50, 100}), they will also recommend 174
- Confidence: 0.889
Rule #4
Rule: If a person recommends frozenset({174}), they will also recommend 100
- Confidence: 0.818
Rule #5
Rule: If a person recommends frozenset({50, 174}), they will also recommend 100
- Confidence: 0.800
输出结果只有电影编号没有电影名称,不太友好,接下来给电影编号匹配电影名称。
movie_name_data = pd.read_csv("u.item", delimiter='|', header=None, encoding='mac_roman')
movie_name_data = movie_name_data.iloc[:, :2]
movie_name_data.columns = ['MovieID','Title']
def get_movie_name(movie_id):
title_object = movie_name_data[movie_name_data['MovieID'] == movie_id]['Title']
title = title_object.values[0]
return title
# 输出置信度最高的前5条规则
for index in range(5):
print('Rule #{0}'.format(index + 1))
premise, conclusion = sorted_confidence[index][0]
premise_name = ', '.join(get_movie_name(idx) for idx in premise)
conclusion_name = get_movie_name(conclusion)
print('Rule: If a person recommends: {0}, \nthey will also recommend: {1}'.format(premise_name, conclusion_name))
print('- Confidence: {0:.3f}'.format(rule_confidence[(premise, conclusion)]))
print("")
Rule #1
Rule: If a person recommends: Raiders of the Lost Ark (1981),
they will also recommend: Star Wars (1977)
- Confidence: 0.909
Rule #2
Rule: If a person recommends: Fargo (1996), Raiders of the Lost Ark (1981),
they will also recommend: Star Wars (1977)
- Confidence: 0.889
Rule #3
Rule: If a person recommends: Star Wars (1977), Fargo (1996),
they will also recommend: Raiders of the Lost Ark (1981)
- Confidence: 0.889
Rule #4
Rule: If a person recommends: Raiders of the Lost Ark (1981),
they will also recommend: Fargo (1996)
- Confidence: 0.818
Rule #5
Rule: If a person recommends: Star Wars (1977), Raiders of the Lost Ark (1981),
they will also recommend: Fargo (1996)
- Confidence: 0.800
三、评估
# 用#100到#109的用户打分数据做测试
test_dataset = all_ratings[all_ratings['UserID'].isin(range(100,110,1))]
test_favorable = test_dataset[test_dataset['Favorable']]
test_favorable_by_users = dict((k, frozenset(v.values)) for k, v in test_favorable.groupby('UserID')['MovieID'])
correct_counts = defaultdict(int)
incorrect_counts = defaultdict(int)
# 计算规则在测试集上的应验数量
for user, reviews in test_favorable_by_users.items():
for candidate_rule in candidate_rules:
premise, conclusion = candidate_rule
if premise.issubset(reviews):
if conclusion in reviews:
correct_counts[candidate_rule] += 1
else:
incorrect_counts[candidate_rule] += 1
# 计算置信度
test_confidence = {candidate_rule:
correct_counts[candidate_rule]/(correct_counts[candidate_rule] + incorrect_counts[candidate_rule])
for candidate_rule in candidate_rules}
# 输出规则
for index in range(5):
print('Rule #{0}'.format(index + 1))
premise, conclusion = sorted_confidence[index][0]
premise_name = ', '.join(get_movie_name(idx) for idx in premise)
conclusion_name = get_movie_name(conclusion)
print('Rule: If a person recommends: {0}, \nthey will also recommend: {1}'.format(premise_name, conclusion_name))
print('- Train Confidence: {0:.3f}'.format(rule_confidence[(premise, conclusion)]))
print('- Test Confidence: {0:.3f}'.format(test_confidence[(premise, conclusion)]))
print("")
Rule #1
Rule: If a person recommends: Raiders of the Lost Ark (1981),
they will also recommend: Star Wars (1977)
- Train Confidence: 0.909
- Test Confidence: 1.000
Rule #2
Rule: If a person recommends: Fargo (1996), Raiders of the Lost Ark (1981),
they will also recommend: Star Wars (1977)
- Train Confidence: 0.889
- Test Confidence: 1.000
Rule #3
Rule: If a person recommends: Star Wars (1977), Fargo (1996),
they will also recommend: Raiders of the Lost Ark (1981)
- Train Confidence: 0.889
- Test Confidence: 0.333
Rule #4
Rule: If a person recommends: Raiders of the Lost Ark (1981),
they will also recommend: Fargo (1996)
- Train Confidence: 0.818
- Test Confidence: 0.500
Rule #5
Rule: If a person recommends: Star Wars (1977), Raiders of the Lost Ark (1981),
they will also recommend: Fargo (1996)
- Train Confidence: 0.800
- Test Confidence: 0.500
四、总结
从电影打分数据中找到可用于电影推荐的关联规则,整个过程可分为2个阶段:首先借助Apriori算法寻找数据中的频繁项集,然后根据找到的频繁项集生成关联规则。我们用一部分数据作为训练集来发现关联规则,在测试集上进行测试。