最近业务需要用到推荐系统,遂调研了推荐系统,在此作为学习笔记,以便日后查阅。
一,概述
提及最多的便是协同过滤
协同过滤可以分为如下三类:
基于用户的、基于物品的、基于模型的
本文仅为基于物品的推荐系统初步构建参考
二,项目背景与目标
根据用户听过的历史歌单,为特定用户推荐合适的歌曲。
三,数据情况
现有用户历史歌单: train_triplets.txt,数据量为48373586条数据,每条数据包含‘user’_id,‘song_id’,'play_count’三个字段。
歌曲详细信息表:track_metadata.db,含百万首歌曲信息,每首歌曲含’track_id’, ‘title’, ‘song_id’,‘artist’, 'release’等信息。
四,数据探索(此处略去具体探索过程)
注意:
(1).txt文件的读取
triplet_dataset = pd.read_csv(filepath_or_buffer=data_home+'train_triplets.txt',
sep='\t',header=None,
names=['user','song','play_count'])
triplet_dataset.shape
(48373586, 3)
之后的处理方式和csv一样
(2).bd文件的处理
conn = sqlite3.connect(data_home+'track_metadata.db')
cur = conn.cursor()
cur.execute("SELECT name FROM sqlite_master WHERE type='table'")
cur.fetchall()
[(‘songs’,)]
track_metadata_df = pd.read_sql(con=conn,sql='select * from songs')
之后的处理和csv一样
(3)考虑到计算量的问题,此处仅仅取了一小部分数据进行建模
五,模型构建(基于物品的协同过滤模型)
(1)针对目标用户user_id,从用户历史歌单中,即train_data.csv,找出该用户以前听过的所有歌曲user_songs
(2)针对该用户听过的所有歌曲,找出每首歌的所有听众user_songs_listeners
(3)找出训练数据train_data中所有歌曲all_songs,及这些歌曲中每一首歌的所有听众,all_songs_listeners
(4)计算相似度矩阵:
通过user_songs_listeners和all_songs_listeners计算user_songs和all_songs的相似度矩阵
(5)通过相似度矩阵获得all_songs中每首歌的打分score,根据score排序,选择top K首歌曲(未出现在user_songs中的),推荐给目标用户
六,代码
1,编写模块RecommendersJsq.py
使用时调用:
import RecommendersJsq as RecommendersJsq
is_model = RecommendersJsq.item_similarity_recommender_py()
is_model.creat(train_data,'user','title')
user_id = list(train_data.user)[7]
recommendations = is_model.recommend(user_id)
print(recommendations)
输出如下结果:
2,模块RecommendersJsq.py的主要method如下:
(1)recommend()
def recommend(self,user_id):
################################
#A get the all songs of the user
###############################
user_songs = self.get_user_items(user_id)
print('by JSQ No. of unique songs for the user: %d' % len(user_songs))
#################################
#B get all songs from train_data
################################
all_songs = self.get_all_items_train_data()
print('by JSQ No. of unique songs in the training set: %d' % len(all_songs))
###############################################
# C. Construct item cooccurence matrix of size
# len(user_songs) X len(songs)
###############################################
cooccurence_matrix = self.get_cooccurence_matrix(user_songs,all_songs)
#######################################################
#D. Use the cooccurence matrix to make recommendations
#######################################################
df_recommendations = self.get_top_recommendations(user_id,cooccurence_matrix,all_songs,user_songs)
return df_recommendations
(2)get_user_items()
针对目标用户user_id,从用户历史歌单中,即train_data.csv,找出该用户以前听过的所有歌曲user_songs
def get_user_items(self,user_id):
user_items_data = self.train_data[self.train_data[self.user_id] == user_id]
user_items = list((user_items_data[self.item_id]).unique())
return user_items
(3) get_all_items_train_data()
找出训练数据train_data中所有歌曲all_songs
# Get unique items (songs) in the training data
def get_all_items_train_data(self):
all_items = list(self.train_data[self.item_id].unique())
return all_items
(4)get_cooccurence_matrix()
计算相似度矩阵(len(user_songs) X len(all_songs))
通过user_songs_listeners和all_songs_listeners计算user_songs和all_songs的相似度矩阵
对应位置的数值=【 user_songs中第i首歌的听众与all_songs中第j首歌的听众的交集】/【前述两者的并集】
# Construct cooccurence matrix
def get_cooccurence_matrix(self,user_songs,all_songs):
######################################################
# get the listeners for each song of the user songs
#####################################################
user_songs_listerners = []
for i in range(0,len(user_songs)):
user_songs_listerners.append(self.get_item_users(user_songs[i]))
##############################################################
#get the listeners for each song of all songs from train_data
##############################################################
all_songs_listeners = []
for i in range(len(all_songs)):
all_songs_listeners.append(self.get_item_users(all_songs[i]))
###############################################
# Initialize the item cooccurence matrix of size
# len(user_songs)Xlen(all_songa)
##################################################
cooccurence_matrix = np.matrix(np.zeros((len(user_songs),len(all_songs))))
#############################################################
# Calculate similarity between user songs and all unique songs
# in the training data
#############################################################
for i in range(0,len(user_songs)):
#Get unique listeners (users) of song (item) i
user_i = user_songs_listerners[i]
for j in range(0,len(all_songs)):
# Get unique listeners (users) of song (item) j
user_j = all_songs_listeners[j]
# Calculate intersection of listeners of songs i and j
interaction = user_i.intersection(user_j)
# Calculate cooccurence_matrix[i,j] as Jaccard Index
if len(interaction) != 0:
# Calculate union of listeners of songs i and j
union = user_i.union(user_j)
cooccurence_matrix[i,j] = float(len(interaction))/float(len(union))
else:
cooccurence_matrix[i, j] = 0
return cooccurence_matrix
其中涉及到对特定的一首歌,找出其所有听众:get_item_users()
# Get unique users for a given item (song)
def get_item_users(self,item_id):
train_data_sub = self.train_data[self.train_data[self.item_id] == item_id]
item_users = set(train_data_sub[self.user_id].unique())
return item_users
(5)get_top_recommendations()
通过相似度矩阵获得all_songs中每首歌的打分score,根据score排序,选择top K首歌曲(未出现在user_songs中的),推荐给目标用户
这里的 score = 矩阵的每一列求和/矩阵总的行数 ,即待推荐歌曲与user_songs中每一首歌的相似度打分的均值。具体可见上述手写图。
def get_top_recommendations(self,user,cooccurence_matrix,all_songs,user_songs):
non_zero = np.count_nonzero(cooccurence_matrix)
print('No. the non zero is %d' %non_zero)
# Calculate a weighted average of the scores in cooccurence matrix for all user songs.
score = cooccurence_matrix.sum(axis=0)/float(cooccurence_matrix.shape[0])
score = np.array(score)[0].tolist()
#Sort the indices of scores based upon their value
#Also maintain the corresponding score
sorted_index = sorted(((e,i) for i,e in enumerate(score)),reverse=True)
# Create a dataframe from the following
column = ['user','song','score','rank']
df = pd.DataFrame(columns=column)
# Fill the dataframe with top 10 item based recommendations
rank = 1
for i in range(0,len(sorted_index)):
if ~np.isnan(sorted_index[i][0]) and all_songs[sorted_index[i][1]] not in user_songs and rank <= 10:
df.loc[df.shape[0]] = [user,all_songs[sorted_index[i][1]],sorted_index[i][0],rank]
rank += 1
if df.shape[0] == 0:
print('No songs could be recommended for user: %s' %user)
return -1
else:
return df