赛题理解+Baseline

最新推荐文章于 2022-07-04 20:59:50 发布

migushu3

最新推荐文章于 2022-07-04 20:59:50 发布

阅读量287

点赞数 1

分类专栏：推荐系统文章标签：推荐系统

本文链接：https://blog.csdn.net/migushu3/article/details/110147711

版权

推荐系统专栏收录该内容

4 篇文章 0 订阅

订阅专栏

推荐系统入门

一、常用评测指标：

用户满意度
预测准确度：又分为

评分预测：
预测准确度一般通过RMSE和MAE来进行计算，
TopN推荐：
预测准确率指标：精确率和召回率

覆盖率：
信息熵定义和基尼系数定义覆盖率
多样性
新颖性
AUC曲线：
包括TP、FN、FP、TN

二、推荐系统核心算法层

召回层：
缩小候选集规模，数据量大，用少量特征+简单模型
主流方法：

多路召回策略，具体策略与业务相关
Embedding召回
Embedding目的：把稀疏向量转换为稠密向量，相当于对one-hot编码进行平滑
常见的Embedding技术
text embedding：
技术包括（静态向量）word2vec、fasttext、glove、（动态向量）ELMO、GPT、BERT
image embedding
graph embedding
2. 排序层：
对缩小后的候选集进行精准排序，用更多特征+复杂模型

三、协同过滤算法
1. 分类
- UserCF：基于用户的协同过滤算法
- ItemCF：基于物品的协同过滤算法
1. 相似性度量方法
- Jaccard相似系数
- 余弦相似度
- 皮尔逊相关系数

UserCF算法

找到和目标用户兴趣相似的集合
找到这个集合中的用户喜欢的，且目标用户没有听说过的物品推荐给目标用户
实现UserCF算法：

 import pandas as pd

def loadData():
    items = {
             'A': {1: 5, 2: 3, 3: 4, 4: 3, 5: 1},
             'B': {1: 3, 2: 1, 3: 3, 4: 3, 5: 5},
             'C': {1: 4, 2: 2, 3: 4, 4: 1, 5: 5},
             'D': {1: 4, 2: 3, 3: 3, 4: 5, 5: 2},
             'E': {2: 3, 3: 5, 4: 4, 5: 1}
    }
    users={
        1: {'A': 5, 'B': 3, 'C': 4, 'D': 4},
        2: {'A': 3, 'B': 1, 'C': 2, 'D': 3, 'E': 3},
        3: {'A': 4, 'B': 3, 'C': 4, 'D': 3, 'E': 5},
        4: {'A': 3, 'B': 3, 'C': 1, 'D': 5, 'E': 4},
        5: {'A': 1, 'B': 5, 'C': 5, 'D': 2, 'E': 1}
    }

    return items,users

if __name__ == '__main__':
    items,users=loadData()
    item_df=pd.DataFrame(items).T
    user_df=pd.DataFrame(users).T
    print(item_df)
    print(user_df)

结果

 1    2    3    4    5
A  5.0  3.0  4.0  3.0  1.0
B  3.0  1.0  3.0  3.0  5.0
C  4.0  2.0  4.0  1.0  5.0
D  4.0  3.0  3.0  5.0  2.0
E  NaN  3.0  5.0  4.0  1.0
     A    B    C    D    E
1  5.0  3.0  4.0  4.0  NaN
2  3.0  1.0  2.0  3.0  3.0
3  4.0  3.0  4.0  3.0  5.0
4  3.0  3.0  1.0  5.0  4.0
5  1.0  5.0  5.0  2.0  1.0

求用户相似性矩阵

import numpy as np
    
    sm_matrix=pd.DataFrame(np.zeros((len(users),len(users))),index=[1,2,3,4,5],columns=[1,2,3,4,5]);

    for userID in users:
        for otheruserId in users:
            vec_user=[]
            vec_otheruser=[]
            if userID!=otheruserId:
                for itemID in items:
                    itemRatings=items[itemID]
                    if userID in itemRatings and otheruserId in itemRatings:
                        vec_user.append(itemRatings[userID])
                        vec_otheruser.append(itemRatings[otheruserId])
            sm_matrix[userID][otheruserId]=np.corrcoef(np.array(vec_user),np.array(vec_otheruser))

寻找Top2相似的

n=2
        sim_users=sm_matrix[1].sort_values(ascending=False)[:n].index.tolist()

计算最终得分


        base_score=np.mean(np.array([value for value in users[1].values()]))
        weighted_scores=0.
        cor_values_sum=0.

        for user in sim_users:
            corr_value=sm_matrix[1][user]
            mean_user_score=np.mean(np.array([value for value in users[user].values()]))
            weighted_scores+=corr_value*(users[user]['E']-mean_user_score)
            cor_values_sum+=corr_value
        final_scores=base_score+weighted_scores/cor_values_sum
        print('A对E打分',final_scores)
        user_df.loc[1]['E']=final_scores
        user_df

UserCF优缺点

数据稀疏性
算法扩展性：
需要维护用户相似度矩阵，不适合用户数据量大的情况使用

基于物品的协同过滤步骤

分析用户的行为记录来计算物品相似度
根据用户的历史行为为用户生成推荐列表

协同过滤算法的缺点：
泛化能力弱
推荐系统头部效应明显，处理稀疏向量的能力弱
完全没用利用到物品或用户自身的属性

四、MF矩阵分解模型

–算法原理
通过分解协同过滤的共现矩阵来得到用户和物品的隐向量。
在这里插入图片描述
矩阵分解算法将 m × n 维的共享矩阵 R 分解成 m × k 维的用户矩阵 U 和 k × n 维的物品矩阵 V 相
乘的形式。其中 m 是用户数量， n 是物品数量， k 是隐向量维度，也就是隐含特征个数，只不过这里的隐
含特征变得不可解释了，即我们不知道具体含义了，要模型自己去学。 k 的大小决定了隐向量表达能力的
强弱， k 越大，表达信息就越强，理解起来就是把用户的兴趣和物品的分类划分的越具体