电影推荐——基于关联分析Apriori算法

本次数据挖掘项目是电影推荐问题,目的是找出对象同时出现的情况,也就是寻找用户同时喜欢几部电影的情况。

使用最基础的Apriori算法。

import os
import pandas as pd
import numpy as np
import sys
from operator import itemgetter
from collections import defaultdict

一、加载数据并观察

# 文件的后缀就是.data,后面不要再加.csv了,否则会报错
all_ratings = pd.read_csv("u.data", delimiter='\t', header=None, names=["UserID","MovieID","Rating","Datetime"])
# 让我们看看冰山的一角吧
all_ratings.head()
UserIDMovieIDRatingDatetime
01962423881250949
11863023891717742
2223771878887116
3244512880606923
41663461886397596
# 很好奇每一列都是什么数据类型,有没有缺失的记录呢?
all_ratings.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
 #   Column    Non-Null Count   Dtype
---  ------    --------------   -----
 0   UserID    100000 non-null  int64
 1   MovieID   100000 non-null  int64
 2   Rating    100000 non-null  int64
 3   Datetime  100000 non-null  int64
dtypes: int64(4)
memory usage: 3.1 MB

每一列都是int类型且没有缺失,很好,数据处理不用再做缺失值处理了。

# 再看看数据大小吧,其实从info()里也能看出来的,不过这样更直接,一眼就看出来啦
all_ratings.shape
(100000, 4)
# 去个重吧,万一有重复的数据
print("去重前数据大小:{0}".format(all_ratings.shape))
all_ratings.drop_duplicates(keep="first",inplace=True)
print("去重后数据大小:{0}".format(all_ratings.shape))
去重前数据大小:(100000, 4)
去重后数据大小:(100000, 4)

没有重复的呢。

# 看看都有哪些用户爱看电影还给电影打了分吧(反正我看电影是不喜欢打分的)
all_ratings["UserID"].value_counts()
405    737
655    685
13     636
450    540
276    518
      ... 
147     20
19      20
572     20
636     20
895     20
Name: UserID, Length: 943, dtype: int64

一共有943名用户,用户#405竟然给737部电影打了分,最少的也有20部。

# 让我们看看都有哪些电影被打了分吧
all_ratings["MovieID"].value_counts()
50      583
258     509
100     508
181     507
294     485
       ... 
1648      1
1571      1
1329      1
1457      1
1663      1
Name: MovieID, Length: 1682, dtype: int64

一共有1682部电影被观看并打分,电影#50被打分次数最多有583次,有的电影就有点惨了只被打过一次分比如电影#1663。

# 看下电影分数都有哪几档
all_ratings["Rating"].unique().tolist()
[3, 1, 2, 4, 5]
# 解析时间戳,日期看着比一串数字方便多了
all_ratings["Datetime"] = pd.to_datetime(all_ratings["Datetime"], unit='s')
all_ratings.head()
UserIDMovieIDRatingDatetime
019624231997-12-04 15:55:49
118630231998-04-04 19:22:22
22237711997-11-07 07:18:36
32445121997-11-27 05:02:03
416634611998-02-02 05:33:16

看样子都是很古老的数据啊

二、Apriori算法的实现

# 首先确定用户是不是喜欢某一部电影,评分高于3的认定为喜欢
all_ratings['Favorable'] = all_ratings['Rating'] > 3
# 从数据集选取一部分用作训练集,能减少搜索空间,提升算法的速度
# 我尝试用UserID前200的数据训练,电脑直接卡住,建议先用50以内的数值尝试,如果执行起来很流畅再调高
ratings = all_ratings[all_ratings['UserID'].isin(range(20))]
# 选取用户喜欢某部电影的数据行
favorable_ratings = ratings[ratings['Favorable']]
# 要知道用户喜欢哪些电影,把v.values存储为frozenset便于快速判断用户是否为某部电影打过分
favorable_reviews_by_users = dict((k, frozenset(v.values)) for k,v in favorable_ratings.groupby('UserID')['MovieID'])
# 每部电影有多少人喜欢
num_favorable_by_movie = ratings[['MovieID','Favorable']].groupby('MovieID').sum()
num_favorable_by_movie.sort_values(by='Favorable', axis=0, ascending=False)[:5]
Favorable
MovieID
5014.0
10012.0
17411.0
12710.0
5610.0

1、实现

frequent_itemsets = {}
# 最小支持度
min_support = 10
# 每一部电影作为一个项集,判断它是否够频繁
frequent_itemsets[1] = dict((frozenset((MovieID,)), row['Favorable']) 
                            for MovieID, row in num_favorable_by_movie.iterrows() if row['Favorable'] > min_support)
frequent_itemsets[1]
{frozenset({50}): 14.0, frozenset({100}): 12.0, frozenset({174}): 11.0}
# 接收k-1项频繁项集,创建超集,检测频繁程度,生成k项频繁项集
def find_frequent_itemsets(favorable_reviews_by_users, k_1_itemsets, min_support):
    counts = defaultdict(int)
    for user, reviews in favorable_reviews_by_users.items():
        for itemset in k_1_itemsets:
            if itemset.issubset(reviews):
                # 遍历用户喜欢并打过分但是没出现在k-1频繁项集里的电影,用它生成超集,更新该项集的计数
                for other_reviewed_movie in reviews - itemset:
                    current_superset = itemset | frozenset((other_reviewed_movie,))
                    counts[current_superset] += 1
    return dict([(itemset, frequency) for itemset, frequency in counts.items() if frequency >= min_support])
find_frequent_itemsets(favorable_reviews_by_users, frequent_itemsets[1], min_support)
{frozenset({50, 100}): 18, frozenset({50, 174}): 20, frozenset({100, 174}): 18}
# 存储运行中发现的新频繁项集
for k in range(2, 5):
    cur_frequent_itemsets = find_frequent_itemsets(favorable_reviews_by_users, frequent_itemsets[k-1],min_support)
    frequent_itemsets[k] = cur_frequent_itemsets
    if len(cur_frequent_itemsets) == 0:
        print("did not find any frequent itemsets of length {0}".format(k))
        # 确保代码还在执行时,把缓冲区内容输出到终端,不易过多使用,flush操作(输出也是)所带来的计算会拖慢程序的运行速度
        sys.stdout.flush()
        break
    else:
        print("i found {0} frequent itemsets of length {1}".format(len(cur_frequent_itemsets),k))
        sys.stdout.flush()
i found 3 frequent itemsets of length 2
i found 1 frequent itemsets of length 3
did not find any frequent itemsets of length 4
del frequent_itemsets[1]
frequent_itemsets
{2: {frozenset({50, 100}): 18,
  frozenset({50, 174}): 20,
  frozenset({100, 174}): 18},
 3: {frozenset({50, 100, 174}): 24},
 4: {}}

2、抽取关联规则

频繁项集是一组达到最小支持度的项目,从中抽取的关联规则形如:如果用户喜欢前提中的所有电影,那么他们也会喜欢结论中的电影。

candidate_rules = []
# 遍历不同长度的频繁项集,为每个项集生成规则
for itemset_length, itemset_counts in frequent_itemsets.items():
    # 遍历每个项集
    for itemset in itemset_counts.keys():
        for conclusion in itemset:
            premise = itemset - set((conclusion,))
            candidate_rules.append((premise, conclusion))
candidate_rules
[(frozenset({100}), 50),
 (frozenset({50}), 100),
 (frozenset({174}), 50),
 (frozenset({50}), 174),
 (frozenset({174}), 100),
 (frozenset({100}), 174),
 (frozenset({100, 174}), 50),
 (frozenset({50, 174}), 100),
 (frozenset({50, 100}), 174)]
correct_counts = defaultdict(int)
incorrect_counts = defaultdict(int)
# 遍历用户和打分数据,统计规则的次数,当规则的前提符合时,看一下用户是否喜欢结论中的电影
for user, reviews in favorable_reviews_by_users.items():
    for candidate_rule in candidate_rules:
        premise, conclusion = candidate_rule
        if premise.issubset(reviews):
            if conclusion in reviews:
                correct_counts[candidate_rule] += 1
            else:
                incorrect_counts[candidate_rule] += 1
correct_counts
defaultdict(int,
            {(frozenset({100}), 50): 9,
             (frozenset({50}), 100): 9,
             (frozenset({174}), 50): 10,
             (frozenset({50}), 174): 10,
             (frozenset({174}), 100): 9,
             (frozenset({100}), 174): 9,
             (frozenset({100, 174}), 50): 8,
             (frozenset({50, 174}), 100): 8,
             (frozenset({50, 100}), 174): 8})
incorrect_counts
defaultdict(int,
            {(frozenset({50}), 174): 4,
             (frozenset({100}), 174): 3,
             (frozenset({50, 100}), 174): 1,
             (frozenset({50}), 100): 5,
             (frozenset({174}), 100): 2,
             (frozenset({50, 174}), 100): 2,
             (frozenset({100}), 50): 3,
             (frozenset({174}), 50): 1,
             (frozenset({100, 174}), 50): 1})
# 计算规则的置信度
rule_confidence = {candidate_rule: 
                   correct_counts[candidate_rule]/(correct_counts[candidate_rule] + incorrect_counts[candidate_rule])
                   for candidate_rule in candidate_rules}
rule_confidence
{(frozenset({100}), 50): 0.75,
 (frozenset({50}), 100): 0.6428571428571429,
 (frozenset({174}), 50): 0.9090909090909091,
 (frozenset({50}), 174): 0.7142857142857143,
 (frozenset({174}), 100): 0.8181818181818182,
 (frozenset({100}), 174): 0.75,
 (frozenset({100, 174}), 50): 0.8888888888888888,
 (frozenset({50, 174}), 100): 0.8,
 (frozenset({50, 100}), 174): 0.8888888888888888}
# 置信度降序
sorted_confidence = sorted(rule_confidence.items(), key=itemgetter(1), reverse=True)
sorted_confidence
[((frozenset({174}), 50), 0.9090909090909091),
 ((frozenset({100, 174}), 50), 0.8888888888888888),
 ((frozenset({50, 100}), 174), 0.8888888888888888),
 ((frozenset({174}), 100), 0.8181818181818182),
 ((frozenset({50, 174}), 100), 0.8),
 ((frozenset({100}), 50), 0.75),
 ((frozenset({100}), 174), 0.75),
 ((frozenset({50}), 174), 0.7142857142857143),
 ((frozenset({50}), 100), 0.6428571428571429)]
# 输出置信度最高的前5条规则
for index in range(5):
    print('Rule #{0}'.format(index + 1))
    premise, conclusion = sorted_confidence[index][0]
    print('Rule: If a person recommends: {0}, they will also recommend: {1}'.format(premise, conclusion))
    print('- Confidence: {0:.3f}'.format(rule_confidence[(premise, conclusion)]))
    print("")
Rule #1
Rule: If a person recommends frozenset({174}), they will also recommend 50
- Confidence: 0.909

Rule #2
Rule: If a person recommends frozenset({100, 174}), they will also recommend 50
- Confidence: 0.889

Rule #3
Rule: If a person recommends frozenset({50, 100}), they will also recommend 174
- Confidence: 0.889

Rule #4
Rule: If a person recommends frozenset({174}), they will also recommend 100
- Confidence: 0.818

Rule #5
Rule: If a person recommends frozenset({50, 174}), they will also recommend 100
- Confidence: 0.800

输出结果只有电影编号没有电影名称,不太友好,接下来给电影编号匹配电影名称。

movie_name_data = pd.read_csv("u.item", delimiter='|', header=None, encoding='mac_roman')
movie_name_data = movie_name_data.iloc[:, :2]
movie_name_data.columns = ['MovieID','Title']
def get_movie_name(movie_id):
    title_object = movie_name_data[movie_name_data['MovieID'] == movie_id]['Title']
    title = title_object.values[0]
    return title
# 输出置信度最高的前5条规则
for index in range(5):
    print('Rule #{0}'.format(index + 1))
    premise, conclusion = sorted_confidence[index][0]
    premise_name = ', '.join(get_movie_name(idx) for idx in premise)
    conclusion_name = get_movie_name(conclusion)
    print('Rule: If a person recommends: {0}, \nthey will also recommend: {1}'.format(premise_name, conclusion_name))
    print('- Confidence: {0:.3f}'.format(rule_confidence[(premise, conclusion)]))
    print("")
Rule #1
Rule: If a person recommends: Raiders of the Lost Ark (1981), 
they will also recommend: Star Wars (1977)
- Confidence: 0.909

Rule #2
Rule: If a person recommends: Fargo (1996), Raiders of the Lost Ark (1981), 
they will also recommend: Star Wars (1977)
- Confidence: 0.889

Rule #3
Rule: If a person recommends: Star Wars (1977), Fargo (1996), 
they will also recommend: Raiders of the Lost Ark (1981)
- Confidence: 0.889

Rule #4
Rule: If a person recommends: Raiders of the Lost Ark (1981), 
they will also recommend: Fargo (1996)
- Confidence: 0.818

Rule #5
Rule: If a person recommends: Star Wars (1977), Raiders of the Lost Ark (1981), 
they will also recommend: Fargo (1996)
- Confidence: 0.800

三、评估

# 用#100到#109的用户打分数据做测试
test_dataset = all_ratings[all_ratings['UserID'].isin(range(100,110,1))]
test_favorable = test_dataset[test_dataset['Favorable']]
test_favorable_by_users = dict((k, frozenset(v.values)) for k, v in test_favorable.groupby('UserID')['MovieID'])
correct_counts = defaultdict(int)
incorrect_counts = defaultdict(int)
# 计算规则在测试集上的应验数量
for user, reviews in test_favorable_by_users.items():
    for candidate_rule in candidate_rules:
        premise, conclusion = candidate_rule
        if premise.issubset(reviews):
            if conclusion in reviews:
                correct_counts[candidate_rule] += 1
            else:
                incorrect_counts[candidate_rule] += 1
# 计算置信度
test_confidence = {candidate_rule: 
                   correct_counts[candidate_rule]/(correct_counts[candidate_rule] + incorrect_counts[candidate_rule])
                   for candidate_rule in candidate_rules}
# 输出规则
for index in range(5):
    print('Rule #{0}'.format(index + 1))
    premise, conclusion = sorted_confidence[index][0]
    premise_name = ', '.join(get_movie_name(idx) for idx in premise)
    conclusion_name = get_movie_name(conclusion)
    print('Rule: If a person recommends: {0}, \nthey will also recommend: {1}'.format(premise_name, conclusion_name))
    print('- Train Confidence: {0:.3f}'.format(rule_confidence[(premise, conclusion)]))
    print('- Test Confidence: {0:.3f}'.format(test_confidence[(premise, conclusion)]))
    print("")
Rule #1
Rule: If a person recommends: Raiders of the Lost Ark (1981), 
they will also recommend: Star Wars (1977)
- Train Confidence: 0.909
- Test Confidence: 1.000

Rule #2
Rule: If a person recommends: Fargo (1996), Raiders of the Lost Ark (1981), 
they will also recommend: Star Wars (1977)
- Train Confidence: 0.889
- Test Confidence: 1.000

Rule #3
Rule: If a person recommends: Star Wars (1977), Fargo (1996), 
they will also recommend: Raiders of the Lost Ark (1981)
- Train Confidence: 0.889
- Test Confidence: 0.333

Rule #4
Rule: If a person recommends: Raiders of the Lost Ark (1981), 
they will also recommend: Fargo (1996)
- Train Confidence: 0.818
- Test Confidence: 0.500

Rule #5
Rule: If a person recommends: Star Wars (1977), Raiders of the Lost Ark (1981), 
they will also recommend: Fargo (1996)
- Train Confidence: 0.800
- Test Confidence: 0.500

四、总结

从电影打分数据中找到可用于电影推荐的关联规则,整个过程可分为2个阶段:首先借助Apriori算法寻找数据中的频繁项集,然后根据找到的频繁项集生成关联规则。我们用一部分数据作为训练集来发现关联规则,在测试集上进行测试。

  • 21
    点赞
  • 131
    收藏
    觉得还不错? 一键收藏
  • 9
    评论
项目完整可用,配合压缩包内数据库可直接运行使用。 eclipse+mysql5.7+jdk1.8 功能:推荐引擎利用特殊的信息过滤(IF,Information Filtering)技术,将不同的内容(例如电影、音乐、书籍、新闻、图片、网页等)推荐给可能感兴趣的用户。通常情况下,推荐引擎的实现是通过将用户的个人喜好与特定的参考特征进行比较,并试图预测用户对一些未评分项目的喜好程度。参考特征的选取可能是从项目本身的信息中提取的,或是基于用户所在的社会或社团环境。 根据如何抽取参考特征,我们可以将推荐引擎分为以下四大类: • 基于内容的推荐引擎:它将计算得到并推荐给用户一些与该用户已选择过的项目相似的内容。例如,当你在网上购书时,你总是购买与历史相关的书籍,那么基于内容的推荐引擎就会给你推荐一些热门的历史方面的书籍。 • 基于协同过滤的推荐引擎:它将推荐给用户一些与该用户品味相似的其他用户喜欢的内容。例如,当你在网上买衣服时,基于协同过滤的推荐引擎会根据你的历史购买记录或是浏览记录,分析出你的穿衣品位,并找到与你品味相似的一些用户,将他们浏览和购买的衣服推荐给你。 • 基于关联规则推荐引擎:它将推荐给用户一些采用关联规则发现算法计算出的内容。关联规则的发现算法有很多,如 AprioriAprioriTid、DHP、FP-tree 等。 • 混合推荐引擎:结合以上各种,得到一个更加全面的推荐效果。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 9
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值