推荐系统学习笔记01-协同过滤之基于物品的歌曲推荐

最新推荐文章于 2024-08-15 15:53:55 发布

Jennie_J

最新推荐文章于 2024-08-15 15:53:55 发布

阅读量1.1k

点赞数 1

分类专栏：推荐系统

本文链接：https://blog.csdn.net/weixin_43685844/article/details/104321094

版权

推荐系统专栏收录该内容

5 篇文章 0 订阅

订阅专栏

最近业务需要用到推荐系统，遂调研了推荐系统，在此作为学习笔记，以便日后查阅。

一，概述
提及最多的便是协同过滤
协同过滤可以分为如下三类：
基于用户的、基于物品的、基于模型的

本文仅为基于物品的推荐系统初步构建参考

二，项目背景与目标
根据用户听过的历史歌单，为特定用户推荐合适的歌曲。

三，数据情况
现有用户历史歌单： train_triplets.txt，数据量为48373586条数据，每条数据包含‘user’_id,‘song_id’,'play_count’三个字段。

歌曲详细信息表：track_metadata.db，含百万首歌曲信息，每首歌曲含’track_id’, ‘title’, ‘song_id’,‘artist’, 'release’等信息。

四，数据探索（此处略去具体探索过程）
注意：
（1）.txt文件的读取

triplet_dataset = pd.read_csv(filepath_or_buffer=data_home+'train_triplets.txt',
                             sep='\t',header=None,
                             names=['user','song','play_count'])

triplet_dataset.shape

(48373586, 3)

之后的处理方式和csv一样

（2）.bd文件的处理

conn = sqlite3.connect(data_home+'track_metadata.db')
cur = conn.cursor()
cur.execute("SELECT name FROM sqlite_master WHERE type='table'")
cur.fetchall()

[(‘songs’,)]

track_metadata_df = pd.read_sql(con=conn,sql='select * from songs')

之后的处理和csv一样

（3）考虑到计算量的问题，此处仅仅取了一小部分数据进行建模

五，模型构建（基于物品的协同过滤模型）
（1）针对目标用户user_id，从用户历史歌单中，即train_data.csv，找出该用户以前听过的所有歌曲user_songs
（2）针对该用户听过的所有歌曲，找出每首歌的所有听众user_songs_listeners
（3）找出训练数据train_data中所有歌曲all_songs，及这些歌曲中每一首歌的所有听众，all_songs_listeners
（4）计算相似度矩阵：
通过user_songs_listeners和all_songs_listeners计算user_songs和all_songs的相似度矩阵
（5）通过相似度矩阵获得all_songs中每首歌的打分score，根据score排序，选择top K首歌曲（未出现在user_songs中的），推荐给目标用户
手写矩阵计算过程

矩阵计算过程
六，代码
1，编写模块RecommendersJsq.py
使用时调用：

import RecommendersJsq as RecommendersJsq

is_model = RecommendersJsq.item_similarity_recommender_py()
is_model.creat(train_data,'user','title')

user_id = list(train_data.user)[7]
recommendations = is_model.recommend(user_id)
print(recommendations)

输出如下结果：
在这里插入图片描述

2，模块RecommendersJsq.py的主要method如下：
（1）recommend（）

    def recommend(self,user_id):
        ################################
        #A get the all songs of the user
        ###############################
        user_songs = self.get_user_items(user_id)
        print('by JSQ No. of unique songs for the user: %d' % len(user_songs))

        #################################
        #B get all songs from train_data
        ################################
        all_songs = self.get_all_items_train_data()
        print('by JSQ No. of unique songs in the training set: %d' % len(all_songs))

        ###############################################
        # C. Construct item cooccurence matrix of size
        # len(user_songs) X len(songs)
        ###############################################
        cooccurence_matrix = self.get_cooccurence_matrix(user_songs,all_songs)

        #######################################################
        #D. Use the cooccurence matrix to make recommendations
        #######################################################
        df_recommendations = self.get_top_recommendations(user_id,cooccurence_matrix,all_songs,user_songs)

        return df_recommendations

（2）get_user_items（）
针对目标用户user_id，从用户历史歌单中，即train_data.csv，找出该用户以前听过的所有歌曲user_songs

    def get_user_items(self,user_id):
        user_items_data = self.train_data[self.train_data[self.user_id] == user_id]
        user_items = list((user_items_data[self.item_id]).unique())
        return user_items

（3） get_all_items_train_data（）
找出训练数据train_data中所有歌曲all_songs

    # Get unique items (songs) in the training data
    def get_all_items_train_data(self):
        all_items = list(self.train_data[self.item_id].unique())
        return all_items

（4）get_cooccurence_matrix（）
计算相似度矩阵（len(user_songs) X len(all_songs)）
通过user_songs_listeners和all_songs_listeners计算user_songs和all_songs的相似度矩阵
对应位置的数值=【 user_songs中第i首歌的听众与all_songs中第j首歌的听众的交集】/【前述两者的并集】

 # Construct cooccurence matrix
    def get_cooccurence_matrix(self,user_songs,all_songs):

        ######################################################
        # get the listeners for each song of the user songs
        #####################################################
        user_songs_listerners = []
        for i in range(0,len(user_songs)):
            user_songs_listerners.append(self.get_item_users(user_songs[i]))

        ##############################################################
        #get the listeners for each song of all songs from train_data
        ##############################################################
        all_songs_listeners = []
        for i in range(len(all_songs)):
            all_songs_listeners.append(self.get_item_users(all_songs[i]))

        ###############################################
        # Initialize the item cooccurence matrix of size
        # len(user_songs)Xlen(all_songa)
        ##################################################
        cooccurence_matrix = np.matrix(np.zeros((len(user_songs),len(all_songs))))

        #############################################################
        # Calculate similarity between user songs and all unique songs
        # in the training data
        #############################################################
        for i in range(0,len(user_songs)):
            #Get unique listeners (users) of song (item) i
            user_i = user_songs_listerners[i]

            for j in range(0,len(all_songs)):
                # Get unique listeners (users) of song (item) j
                user_j = all_songs_listeners[j]

                # Calculate intersection of listeners of songs i and j
                interaction = user_i.intersection(user_j)

                # Calculate cooccurence_matrix[i,j] as Jaccard Index
                if len(interaction) != 0:
                    # Calculate union of listeners of songs i and j
                    union = user_i.union(user_j)
                    cooccurence_matrix[i,j] = float(len(interaction))/float(len(union))
                else:
                    cooccurence_matrix[i, j] = 0

        return cooccurence_matrix

其中涉及到对特定的一首歌，找出其所有听众：get_item_users（）

   # Get unique users for a given item (song)
    def get_item_users(self,item_id):
        train_data_sub = self.train_data[self.train_data[self.item_id] == item_id]
        item_users = set(train_data_sub[self.user_id].unique())
        return item_users

（5）get_top_recommendations（）
通过相似度矩阵获得all_songs中每首歌的打分score，根据score排序，选择top K首歌曲（未出现在user_songs中的），推荐给目标用户
这里的 score = 矩阵的每一列求和/矩阵总的行数，即待推荐歌曲与user_songs中每一首歌的相似度打分的均值。具体可见上述手写图。

    def get_top_recommendations(self,user,cooccurence_matrix,all_songs,user_songs):
        non_zero = np.count_nonzero(cooccurence_matrix)
        print('No. the non zero is %d' %non_zero)

        # Calculate a weighted average of the scores in cooccurence matrix for all user songs.
        score = cooccurence_matrix.sum(axis=0)/float(cooccurence_matrix.shape[0])
        score = np.array(score)[0].tolist()

        #Sort the indices of scores based upon their value
        #Also maintain the corresponding score
        sorted_index = sorted(((e,i) for i,e in enumerate(score)),reverse=True)

        # Create a dataframe from the following
        column = ['user','song','score','rank']
        df = pd.DataFrame(columns=column)

        # Fill the dataframe with top 10 item based recommendations
        rank = 1
        for i in range(0,len(sorted_index)):
            if ~np.isnan(sorted_index[i][0]) and all_songs[sorted_index[i][1]] not in user_songs and rank <= 10:
                df.loc[df.shape[0]] = [user,all_songs[sorted_index[i][1]],sorted_index[i][0],rank]
                rank += 1

        if df.shape[0] == 0:
            print('No songs could be recommended for user: %s' %user)
            return -1
        else:
            return df