音乐推荐系统

最新推荐文章于 2020-07-27 20:55:30 发布

小智rando

最新推荐文章于 2020-07-27 20:55:30 发布

阅读量6k

点赞数 3

分类专栏：机器学习实战 project 文章标签：推荐系统协同过滤

本文链接：https://blog.csdn.net/qq_38016957/article/details/93323453

版权

project 同时被 2 个专栏收录

10 篇文章 0 订阅

订阅专栏

机器学习实战

7 篇文章 0 订阅

订阅专栏

推荐系统

音乐数据处理
基于商品相似性的推荐
基于SVD矩阵分解的推荐

1、数据概况：

在数据中有用户，歌曲，播放量，以下为数据记录总量
shape：(48373586, 3)
memory usage: 1.1+ GB

2、数据处理：

（1）对每一个用户，分别统计播放总量

key：用户
value：播放量
查询该用户是否在字典中，若在则更新value，否则新用户加入字典

数据中会存在很多惰性用户，行为量很少，可能会对拉低结果的平均效果，所以需要对播放量排序，筛选

output_dict = {}
with open(data_home+'train_triplets.txt') as f:
    for line_number, line in enumerate(f):
        user = line.split('\t')[0]
        play_count = int(line.split('\t')[2])
        if user in output_dict:
            play_count +=output_dict[user]
            output_dict.update({user:play_count})
        output_dict.update({user:play_count})
output_list = [{'user':k,'play_count':v} for k,v in output_dict.items()]
play_count_df = pd.DataFrame(output_list)
play_count_df = play_count_df.sort_values(by = 'play_count', ascending = False)

（2）对于每一首歌，分别统计播放总量

分别统计它的播放总量，与上面相同的做法

（3）滤去惰性用户、音乐

用户：按大小排好序，取一部分用户的作为我们的实验数据
依据：前n个用户播放量总和 / 音乐全部的播放量总和 = 前n个用户行为占全部用户行为的百分比
取个百分比的阈值即可滤去惰性用户

total_play_count = sum(song_count_df.play_count)
print ((float(play_count_df.head(n=100000).play_count.sum())/total_play_count)*100)
play_count_subset = play_count_df.head(n=100000)

取数据：取10W个用户，3W首歌

user_subset = list(play_count_subset.user)
song_subset = list(song_count_subset.song)

对原文件过滤：

triplet_dataset = pd.read_csv(filepath_or_buffer=data_home+'train_triplets.txt',sep='\t', 
                              header=None, names=['user','song','play_count'])
triplet_dataset_sub = triplet_dataset[triplet_dataset.user.isin(user_subset) ]
del(triplet_dataset)
triplet_dataset_sub_song = triplet_dataset_sub[triplet_dataset_sub.song.isin(song_subset)]
del(triplet_dataset_sub)

（3）加入音乐详细信息

音乐详细信息存在数据库中，首先连接数据库，将正常用户的信息提取出来，并转成csv文件。

conn = sqlite3.connect(data_home+'track_metadata.db')
cur = conn.cursor()
cur.execute("SELECT name FROM sqlite_master WHERE type='table'")
cur.fetchall()

track_metadata_df = pd.read_sql(con=conn, sql='select * from songs')
track_metadata_df_sub = track_metadata_df[track_metadata_df.song_id.isin(song_subset)]

得到详细信息之后，将详细信息表和用户表通过歌名merge，并清除无用信息，得到最终数据。

del(track_metadata_df_sub['track_id'])
del(track_metadata_df_sub['artist_mbid'])
track_metadata_df_sub = track_metadata_df_sub.drop_duplicates(['song_id'])
triplet_dataset_sub_song_merged = pd.merge(triplet_dataset_sub_song, track_metadata_df_sub, how='left', left_on='song', right_on='song_id')
triplet_dataset_sub_song_merged.rename(columns={'play_count':'listen_count'},inplace=True)

del(triplet_dataset_sub_song_merged['song_id'])
del(triplet_dataset_sub_song_merged['artist_id'])
del(triplet_dataset_sub_song_merged['duration'])
del(triplet_dataset_sub_song_merged['artist_familiarity'])
del(triplet_dataset_sub_song_merged['artist_hotttnesss'])
del(triplet_dataset_sub_song_merged['track_7digitalid'])
del(triplet_dataset_sub_song_merged['shs_perf'])
del(triplet_dataset_sub_song_merged['shs_work'])


triplet_dataset_sub_song_merged.head(n=10)

（3）推荐

排行榜单推荐

简单暴力，排行榜单推荐：根据播放量进行排名推荐

基于歌曲相似度的推荐

# 1、获得用户歌单
# 2、获取全部歌曲
# 3、构建相似矩阵：
	设用户歌单有100首歌，歌单库有10000首，构建100x10000的矩阵
	对于i行j列的值，即为用户歌单的第i首歌，与歌单库的第j首歌之间的相似度（基于用户）
	相似度的计算：将听过第i首歌的用户与听过第j首歌的用户做交集/并集，数值越大，则相似度越大
# 4、将矩阵的值全部算出，然后对每一列的值求和取均值，即为用户歌单与库中的某首歌的相似程度，哪一首均值高则推荐哪一首

矩阵构建：

    #Construct cooccurence matrix
    def construct_cooccurence_matrix(self, user_songs, all_songs):
            
        ####################################
        #Get users for all songs in user_songs.
        ####################################
        user_songs_users = []        
        for i in range(0, len(user_songs)):
            user_songs_users.append(self.get_item_users(user_songs[i]))
            
        ###############################################
        #Initialize the item cooccurence matrix of size 
        #len(user_songs) X len(songs)
        ###############################################
        cooccurence_matrix = np.matrix(np.zeros(shape=(len(user_songs), len(all_songs))), float)
           
        #############################################################
        #Calculate similarity between user songs and all unique songs
        #in the training data
        #############################################################
        for i in range(0,len(all_songs)):
            #Calculate unique listeners (users) of song (item) i
            songs_i_data = self.train_data[self.train_data[self.item_id] == all_songs[i]]
            users_i = set(songs_i_data[self.user_id].unique())
            
            for j in range(0,len(user_songs)):       
                    
                #Get unique listeners (users) of song (item) j
                users_j = user_songs_users[j]
                    
                #Calculate intersection of listeners of songs i and j
                users_intersection = users_i.intersection(users_j)
                
                #Calculate cooccurence_matrix[i,j] as Jaccard Index
                if len(users_intersection) != 0:
                    #Calculate union of listeners of songs i and j
                    users_union = users_i.union(users_j)
                    
                    cooccurence_matrix[j,i] = float(len(users_intersection))/float(len(users_union))
                else:
                    cooccurence_matrix[j,i] = 0
                    
        
        return cooccurence_matrix

基于矩阵分解（SVD）的推荐

先计算歌曲 ( 被当前用户播放量 / 用户播放总量 ) 当做分值(相当于对电影的评分)
*歌曲被当前用户播放量占比越高，则用户越喜欢这首歌曲

triplet_dataset_sub_song_merged_sum_df=
triplet_dataset_sub_song_merged[['user','listen_count']].groupby('user').sum().reset_index()

triplet_dataset_sub_song_merged_sum_df.rename(columns={'listen_count':'total_listen_count'},inplace=True)
triplet_dataset_sub_song_merged = pd.merge(triplet_dataset_sub_song_merged,triplet_dataset_sub_song_merged_sum_df)
triplet_dataset_sub_song_merged.head()

triplet_dataset_sub_song_merged['fractional_play_count'] = triplet_dataset_sub_song_merged['listen_count']/triplet_dataset_sub_song_merged['total_listen_count']

将用户、歌曲、信息的名字设置为较短的字符，并转成稀疏矩阵

from scipy.sparse import coo_matrix

# 将用户、歌曲、信息的名字设置为较短的字符
small_set = triplet_dataset_sub_song_merged
user_codes = small_set.user.drop_duplicates().reset_index()
song_codes = small_set.song.drop_duplicates().reset_index()
user_codes.rename(columns={'index':'user_index'}, inplace=True)
song_codes.rename(columns={'index':'song_index'}, inplace=True)

# 将值换为之前的index
song_codes['so_index_value'] = list(song_codes.index)
user_codes['us_index_value'] = list(user_codes.index)

small_set = pd.merge(small_set,song_codes,how='left')
small_set = pd.merge(small_set,user_codes,how='left')
mat_candidate = small_set[['us_index_value','so_index_value','fractional_play_count']]
data_array = mat_candidate.fractional_play_count.values
row_array = mat_candidate.us_index_value.values
col_array = mat_candidate.so_index_value.values

data_sparse = coo_matrix((data_array, (row_array, col_array)),dtype=float)

将用户得分的稀疏矩阵进行svd分解

import math as mt
from scipy.sparse.linalg import * #used for matrix multiplication
from scipy.sparse.linalg import svds
from scipy.sparse import csc_matrix

def compute_svd(urm, K):
    U, s, Vt = svds(urm, K)
    
    # 转换的S为50个值，将其转换成矩阵对角线的形式
    dim = (len(s), len(s))
    S = np.zeros(dim, dtype=np.float32)
    for i in range(0, len(s)):
        S[i,i] = mt.sqrt(s[i])

    # 转换成稀疏矩阵的形式
    U = csc_matrix(U, dtype=np.float32)
    S = csc_matrix(S, dtype=np.float32)
    Vt = csc_matrix(Vt, dtype=np.float32)
    
    return U, S, Vt

# svd特征值的数量
K=50
# 用户得分的稀疏矩阵
urm = data_sparse
# 最大歌曲数量
MAX_PID = urm.shape[1]
# 最大用户数量
MAX_UID = urm.shape[0]

U, S, Vt = compute_svd(urm, K)

核心部分：将分解得到的U对应的推荐用户的行数据，乘上SxV，得到还原的用户得分数据，还原的数据中会将用户没有评分的歌曲（即稀疏的地方）更新一个新的值，并将这些值排序取最大的前10个，即为需要给用户推荐的歌曲。

def compute_estimated_matrix(urm, U, S, Vt, uTest, K, test):
    rightTerm = S*Vt 
    max_recommendation = 10
    estimatedRatings = np.zeros(shape=(MAX_UID, MAX_PID), dtype=np.float16)
    recomendRatings = np.zeros(shape=(MAX_UID,max_recommendation ), dtype=np.float16)
    for userTest in uTest:
        prod = U[userTest, :]*rightTerm
        estimatedRatings[userTest, :] = prod.todense()
        recomendRatings[userTest, :] = (-estimatedRatings[userTest, :]).argsort()[:max_recommendation]
    return recomendRatings

# 推荐的用户
uTest = [4,5,6,7,8,873,23]

uTest_recommended_items = compute_estimated_matrix(urm, U, S, Vt, uTest, K, True)

推荐结果：

for user in uTest:
    print("Recommendation for user with user id {}". format(user))
    rank_value = 1
    for i in uTest_recommended_items[user,0:10]:
        song_details = small_set[small_set.so_index_value == i].drop_duplicates('so_index_value')[['title','artist_name']]
        print("The number {} recommended song is {} BY {}".format(rank_value, list(song_details['title'])[0],list(song_details['artist_name'])[0]))
        rank_value+=1

Recommendation for user with user id 4
The number 1 recommended song is Fireflies BY Charttraxx Karaoke
The number 2 recommended song is Hey_ Soul Sister BY Train
The number 3 recommended song is OMG BY Usher featuring will.i.am
The number 4 recommended song is Lucky (Album Version) BY Jason Mraz & Colbie Caillat
The number 5 recommended song is Vanilla Twilight BY Owl City
The number 6 recommended song is Crumpshit BY Philippe Rochard
The number 7 recommended song is Billionaire [feat. Bruno Mars] (Explicit Album Version) BY Travie McCoy
The number 8 recommended song is Love Story BY Taylor Swift
The number 9 recommended song is TULENLIEKKI BY M.A. Numminen
The number 10 recommended song is Use Somebody BY Kings Of Leon
Recommendation for user with user id 5
The number 1 recommended song is Sehr kosmisch BY Harmonia
The number 2 recommended song is Ain’t Misbehavin BY Sam Cooke
The number 3 recommended song is Dog Days Are Over (Radio Edit) BY Florence + The Machine
The number 4 recommended song is Revelry BY Kings Of Leon
The number 5 recommended song is Undo BY BjÃ¶rk
The number 6 recommended song is Cosmic Love BY Florence + The Machine
The number 7 recommended song is Home BY Edward Sharpe & The Magnetic Zeros
The number 8 recommended song is You’ve Got The Love BY Florence + The Machine
The number 9 recommended song is Bring Me To Life BY Evanescence
The number 10 recommended song is Tighten Up BY The Black Keys
Recommendation for user with user id 6
The number 1 recommended song is Crumpshit BY Philippe Rochard
The number 2 recommended song is Marry Me BY Train
The number 3 recommended song is Hey_ Soul Sister BY Train
The number 4 recommended song is Lucky (Album Version) BY Jason Mraz & Colbie Caillat
The number 5 recommended song is One On One BY the bird and the bee
The number 6 recommended song is I Never Told You BY Colbie Caillat
The number 7 recommended song is Canada BY Five Iron Frenzy
The number 8 recommended song is Fireflies BY Charttraxx Karaoke
The number 9 recommended song is TULENLIEKKI BY M.A. Numminen
The number 10 recommended song is Bring Me To Life BY Evanescence
Recommendation for user with user id 7
The number 1 recommended song is Behind The Sea [Live In Chicago] BY Panic At The Disco
The number 2 recommended song is The City Is At War (Album Version) BY Cobra Starship
The number 3 recommended song is Dead Souls BY Nine Inch Nails
The number 4 recommended song is Una Confusion BY LU
The number 5 recommended song is Home BY Edward Sharpe & The Magnetic Zeros
The number 6 recommended song is Climbing Up The Walls BY Radiohead
The number 7 recommended song is Tighten Up BY The Black Keys
The number 8 recommended song is Tive Sim BY Cartola
The number 9 recommended song is West One (Shine On Me) BY The Ruts
The number 10 recommended song is Cosmic Love BY Florence + The Machine
Recommendation for user with user id 8
The number 1 recommended song is Undo BY BjÃ¶rk
The number 2 recommended song is Canada BY Five Iron Frenzy
The number 3 recommended song is Better To Reign In Hell BY Cradle Of Filth
The number 4 recommended song is Unite (2009 Digital Remaster) BY Beastie Boys
The number 5 recommended song is Behind The Sea [Live In Chicago] BY Panic At The Disco
The number 6 recommended song is Rockin’ Around The Christmas Tree BY Brenda Lee
The number 7 recommended song is Devil’s Slide BY Joe Satriani
The number 8 recommended song is Revelry BY Kings Of Leon
The number 9 recommended song is 16 Candles BY The Crests
The number 10 recommended song is Catch You Baby (Steve Pitron & Max Sanna Radio Edit) BY Lonnie Gordon
Recommendation for user with user id 873
The number 1 recommended song is The Scientist BY Coldplay
The number 2 recommended song is Yellow BY Coldplay
The number 3 recommended song is Clocks BY Coldplay
The number 4 recommended song is Fix You BY Coldplay
The number 5 recommended song is In My Place BY Coldplay
The number 6 recommended song is Shiver BY Coldplay
The number 7 recommended song is Speed Of Sound BY Coldplay
The number 8 recommended song is Creep (Explicit) BY Radiohead
The number 9 recommended song is Sparks BY Coldplay
The number 10 recommended song is Use Somebody BY Kings Of Leon
Recommendation for user with user id 23
The number 1 recommended song is Garden Of Eden BY Guns N’ Roses
The number 2 recommended song is Don’t Speak BY John DahlbÃ¤ck
The number 3 recommended song is Master Of Puppets BY Metallica
The number 4 recommended song is TULENLIEKKI BY M.A. Numminen
The number 5 recommended song is Bring Me To Life BY Evanescence
The number 6 recommended song is Kryptonite BY 3 Doors Down
The number 7 recommended song is Make Her Say BY Kid Cudi / Kanye West / Common
The number 8 recommended song is Night Village BY Deep Forest
The number 9 recommended song is Better To Reign In Hell BY Cradle Of Filth
The number 10 recommended song is Xanadu BY Olivia Newton-John;Electric Light Orchestra