[Python嗯~机器学习]---对于音乐推荐引擎的基本理解

音乐推荐引擎

 

数据集

百万歌曲数据库

百万歌曲数据量可以在https://labrosa.ee.columbia.edu/millionsong/ 上下载。原始的数据包含了多年间上百万首歌曲的量化音频特征。它实际上是The EchonestLABRosa的一个合作项目。
这里我们不会使用整个数据集,而只会使用它们中的一部分。
基于这个数据库,还衍生出了一些其他的数据集。其中一个就是The Echonest喜好画像子集。这个数据包含了匿名用户的歌曲播放次数的记录。这个数据集即使只是百万歌曲数据库的一个子集,但它的数据量也非常庞大,因为它包含了4800万行三元组信息:
(user id, song id, play counts)

这个数据大概包含了100万用户对384,000首歌的播放记录。
大家可以通过http://labrosa.ee.columbia.edu/millionsong/sites/default/files/challenge/train_triplets.txt.zip来下载。这个压缩文件大约500MB,解压后大约3.5GB。

数据探索

加载&裁剪数据

对于我们单机工作而言,这个数据太大了。但是如果是商用服务器,即使是单台机器,它能处理的数据量也要比这大得多,更不用说如果拥有集群计算能力的大型公司了。
不过,对于我们在现实工作中,我们也是常常从大数据量中抽取一些数据来在单机上进行分析、建模,这样做主要是数据量小的时候各种操作都非常快,同时也能验证我们想要做的事情是不是可行。
所以,在这里,我们也需要把数据进行一定的裁剪:

In [1]:

import pandas as pd
import numpy as np
import time
import sqlite3

import datetime
import math
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab


%matplotlib inline

plt.rcParams['font.sans-serif']=['SimHei']     #用来正常显示中文标签
plt.rcParams['axes.unicode_minus']=False       #用来正常显示负号

In [2]:

triplet_dataset = pd.read_csv(filepath_or_buffer='./data/train_triplets.txt', 
                              nrows=10000,sep='\t', header=None, 
                              names=['user','song','play_count'])     # 数据本身没有表头

In [3]:

triplet_dataset.head()

Out[3]:

 usersongplay_count
0b80344d063b5ccb3212f76538f3d9e43d87dca9eSOAKIMP12A8C1309951
1b80344d063b5ccb3212f76538f3d9e43d87dca9eSOAPDEY12A81C210A91
2b80344d063b5ccb3212f76538f3d9e43d87dca9eSOBBMDR12A8C13253B2
3b80344d063b5ccb3212f76538f3d9e43d87dca9eSOBFNSP12AF72A0E221
4b80344d063b5ccb3212f76538f3d9e43d87dca9eSOBFOVM12A58A7D4941

对于这样规模大小的数据集,我们首先要做的是有多少用户(或者歌曲)是我们应该要考虑的。在原始数据集中,有大约100万的用户,但是这里面是不是所有用户我们都需要纳入考虑呢?比如说,如果20%的用户的歌曲播放了占了80%的总体播放量,那么其实我们只需要考虑这20%用户就差不多了。

一般来说,我们统计一下播放量的累积求和就可以知道多少用户占了80%的总体播放量。不过鉴于数据量如此之大,pandas提供的累积求和功能会出问题。所以我们必须自己一行行地读取这个文件,一部分一部分地来完成这项工作:

In [4]:

output_dict = {}
with open('./data/train_triplets.txt') as f:
    for line_number, line in enumerate(f):
        user = line.split('\t')[0]
        play_count = int(line.split('\t')[2])
        if user in output_dict:
            play_count +=output_dict[user]
            output_dict.update({user:play_count})
        output_dict.update({user:play_count})
output_list = [{'user':k,'play_count':v} for k,v in output_dict.items()]
play_count_df = pd.DataFrame(output_list)
play_count_df = play_count_df.sort_values(by = 'play_count', ascending = False)

play_count_df.to_csv(path_or_buf='./data/user_playcount_df.csv', index = False)

In [5]:

play_count_df = pd.read_csv('./data/user_playcount_df.csv')
play_count_df.head()

Out[5]:

 play_countuser
013132093cb74eb3c517c5179ae24caf0ebec51b24d2a2
19884119b7c88d58d0c6eb051365c103da5caf817bea6
282103fa44653315697f42410a30cb766a4eb102080bb
37015a2679496cd0af9779a92a13ff7c6af5c81ea8c7b
46494d7d2d888ae04d16e994d6964214a1de81392ee04

In [ ]:

output_dict = {}
with open('./data/train_triplets.txt') as f:
    for line_number, line in enumerate(f):
        song = line.split('\t')[1]
        play_count = int(line.split('\t')[2])
        if song in output_dict:
            play_count +=output_dict[song]
            output_dict.update({song:play_count})
        output_dict.update({song:play_count})
output_list = [{'song':k,'play_count':v} for k,v in output_dict.items()]
song_count_df = pd.DataFrame(output_list)
song_count_df = song_count_df.sort_values(by = 'play_count', ascending = False)

song_count_df.to_csv(path_or_buf='./data/song_playcount_df.csv', index = False)

In [6]:

song_count_df = pd.read_csv(filepath_or_buffer='./data/song_playcount_df.csv')
song_count_df.head()

Out[6]:

 play_countsong
0726885SOBONKR12A58A7A7E0
1648239SOAUWYT12A81C206F1
2527893SOSXLTC12AF72A7F54
3425463SOFRQTD12A81C233C0
4389880SOEGIYH12A6D4FC0E3

有了这两份数据,我们首要的就是要找到前多少用户占了40%的总体播放量。这个"40%"是我们随机选的一个值,大家在实际的工作中可以自己选择这个数值,重点是控制数据集的大小。当然,如果有高效的Presto(支持HiveQL,但纯内存计算)集群的话,在整体数据集上统计这样的数据也会很快。

就我们这个数据集,大约前100,000用户的播放量占据了总体的40%。

In [7]:

total_play_count = sum(song_count_df.play_count)
print (float(play_count_df.head(n=100000).play_count.sum())/total_play_count)*100

play_count_subset = play_count_df.head(n=100000)
40.8807280501

同样的,我们发现大约30,000首歌占据了总体80%的播放量。这个信息就很有价值:10%的歌曲占据了80%的播放量。
那么,通过这样一些条件,我们就可以从原始的数据集中抽取出最具代表性的数据出来,从而使得需要处理的数据量在一个可控的范围内。

In [8]:

print (float(song_count_df.head(n=30000).play_count.sum())/total_play_count)*100

song_count_subset = song_count_df.head(n=30000)
78.3931536665

In [9]:

# 目标用户集和目标歌曲集
user_subset = list(play_count_subset.user)
song_subset = list(song_count_subset.song)

In [11]:

triplet_dataset = pd.read_csv(filepath_or_buffer='./data/train_triplets.txt',sep='\t', 
                              header=None, names=['user','song','play_count'])

# 抽取目标用户
triplet_dataset_sub = triplet_dataset[triplet_dataset.user.isin(user_subset) ]
del(triplet_dataset)

# 过滤非目标歌曲
triplet_dataset_sub_song = triplet_dataset_sub[triplet_dataset_sub.song.isin(song_subset)]
del(triplet_dataset_sub)

triplet_dataset_sub_song.to_csv('./data/triplet_dataset_sub_song.csv', index=False)

In [12]:

triplet_dataset_sub_song = pd.read_csv(filepath_or_buffer='./data/triplet_dataset_sub_song.csv')

In [13]:

triplet_dataset_sub_song.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10775200 entries, 0 to 10775199
Data columns (total 3 columns):
user          object
song          object
play_count    int64
dtypes: int64(1), object(2)
memory usage: 246.6+ MB

额外信息

我们前面加载的数据仅仅是三元组数据,我们既不知道歌曲的名称,也不知道歌手的名字,连专辑的名字都不知道。不过这份数据集其实也额外提供了这些歌曲相关的其他信息,比如歌曲名称、演唱者的名称、专辑名称等等。这份数据以SQLite数据库文件形式提供。原始的下载链接为:http://labrosa.ee.columbia.edu/millionsong/sites/default/files/AdditionalFiles/track_metadata.db

In [14]:

conn = sqlite3.connect('./data/track_metadata.db')
cur = conn.cursor()
cur.execute("SELECT name FROM sqlite_master WHERE type='table'")
cur.fetchall()

Out[14]:

[(u'songs',)]

In [15]:

track_metadata_df = pd.read_sql(con=conn, sql='select * from songs')
track_metadata_df_sub = track_metadata_df[track_metadata_df.song_id.isin(song_subset)]

In [16]:

track_metadata_df_sub.shape

Out[16]:

(30447, 14)

In [17]:

track_metadata_df_sub.head()

Out[17]:

 track_idtitlesong_idreleaseartist_idartist_mbidartist_namedurationartist_familiarityartist_hotttnesssyeartrack_7digitalidshs_perfshs_work
115TRMMGCB128E079651DGet Along (Feat: Pace Won) (Instrumental)SOHNWIM12A67ADF7D9CharangoARU3C671187FB3F71B067102ea-9519-4622-9077-57ca4164cfbbMorcheeba227.473830.8190870.5331172002185967-10
123TRMMGTX128F92FB4D9ViejoSOECFIW12A8C144546CaralunaARPAAPH1187FB3601Bf69d655c-ffd6-4bee-8c2a-3086b2be2fc6Bacilos307.513020.5955540.40070506825058-10
145TRMMGDP128F933E59AI Say A Little PrayerSOGWEOB12AB018A4D0The Legendary Hi Records Albums_ Volume 3: Ful...ARNNRN31187B9AE7B7fb7272ba-f130-4f0a-934d-6eeea4c18c9aAl Green133.589750.7794900.59921019785211723-111898
172TRMMHBF12903CF6E59At the Ball_ That's AllSOJGCRL12A8C144187Best of Laurel & Hardy - The Lonesome PineAR1FEUF1187B9AF3E34a8ae4fd-ad6f-4912-851f-093f12ee3572Laurel & Hardy123.715460.4387090.30712008645877-10
191TRMMHKG12903CDB1B5Black GoldSOHNFBA12AB018CD1DTotal Life ForeverARVXV1J1187FB5BF886a65d878-fcd0-42cf-aff9-ca1d636a8bccFoals386.324440.8425780.51452320109007438-10

In [18]:

# merge数据
del(track_metadata_df_sub['track_id'])
del(track_metadata_df_sub['artist_mbid'])
track_metadata_df_sub = track_metadata_df_sub.drop_duplicates(['song_id'])
triplet_dataset_sub_song_merged = pd.merge(triplet_dataset_sub_song, track_metadata_df_sub, how='left', left_on='song', right_on='song_id')
triplet_dataset_sub_song_merged.rename(columns={'play_count':'listen_count'},inplace=True)

In [19]:

# 删除无用字段
del(triplet_dataset_sub_song_merged['song_id'])
del(triplet_dataset_sub_song_merged['artist_id'])
del(triplet_dataset_sub_song_merged['duration'])
del(triplet_dataset_sub_song_merged['artist_familiarity'])
del(triplet_dataset_sub_song_merged['artist_hotttnesss'])
del(triplet_dataset_sub_song_merged['track_7digitalid'])
del(triplet_dataset_sub_song_merged['shs_perf'])
del(triplet_dataset_sub_song_merged['shs_work'])

In [20]:

triplet_dataset_sub_song_merged.head()

Out[20]:

 usersonglisten_counttitlereleaseartist_nameyear
0d6589314c0a9bcbca4fee0c93b14bc402363afeaSOADQPP12A67020C8212You And Me JesusTribute To Jake HessJake Hess2004
1d6589314c0a9bcbca4fee0c93b14bc402363afeaSOAFTRR12AF72A8D4D1Harder Better Faster StrongerDiscoveryDaft Punk2007
2d6589314c0a9bcbca4fee0c93b14bc402363afeaSOANQFY12AB01832391UprisingUprisingMuse0
3d6589314c0a9bcbca4fee0c93b14bc402363afeaSOAYATB12A6701FD501Breakfast At Tiffany'sHomeDeep Blue Something1993
4d6589314c0a9bcbca4fee0c93b14bc402363afeaSOBOAFP12A8C131F367Lucky (Album Version)We Sing. We Dance. We Steal Things.Jason Mraz & Colbie Caillat0

In [78]:

# 为后面重复使用
triplet_dataset_sub_song_merged.to_csv('./data/triplet_dataset_sub_song_merged.csv',encoding='utf-8', index=False)

最流行的歌曲

In [28]:

popular_songs = triplet_dataset_sub_song_merged[['title','listen_count']].groupby('title').sum().reset_index()
popular_songs_top_20 = popular_songs.sort_values('listen_count', ascending=False).head(n=20)
 
objects = (list(popular_songs_top_20['title']))
y_pos = np.arange(len(objects))
performance = list(popular_songs_top_20['listen_count'])

plt.figure(figsize=(16,8)) 
plt.bar(y_pos, performance, align='center', alpha=0.5)
plt.xticks(y_pos, objects, rotation='vertical',fontsize=12)
plt.ylabel(u'播放次数')
plt.title(u'最流行歌曲')
 
plt.show()

In [26]:

popular_release = triplet_dataset_sub_song_merged[['release','listen_count']].groupby('release').sum().reset_index()
popular_release_top_20 = popular_release.sort_values('listen_count', ascending=False).head(n=20)

objects = (list(popular_release_top_20['release']))
y_pos = np.arange(len(objects))
performance = list(popular_release_top_20['listen_count'])
 
plt.figure(figsize=(16,8)) 
plt.bar(y_pos, performance, align='center', alpha=0.5)
plt.xticks(y_pos, objects, rotation='vertical',fontsize=12)
plt.ylabel(u'播放次数')
plt.title(u'最流行专辑')
 
plt.show()

最流行歌手

In [29]:

popular_artist = triplet_dataset_sub_song_merged[['artist_name','listen_count']].groupby('artist_name').sum().reset_index()
popular_artist_top_20 = popular_artist.sort_values('listen_count', ascending=False).head(n=20)

objects = (list(popular_artist_top_20['artist_name']))
y_pos = np.arange(len(objects))
performance = list(popular_artist_top_20['listen_count'])
 
plt.figure(figsize=(16,8)) 
plt.bar(y_pos, performance, align='center', alpha=0.5)
plt.xticks(y_pos, objects, rotation='vertical',fontsize=12)
plt.ylabel(u'播放次数')
plt.title(u'最流行歌手')
 
plt.show()

不过如果大家对这些音乐熟悉的话,可能会发现,虽然酷玩乐队(Coldplay)是最流行的乐队,但最热门的单曲中却没有他们的单曲。
如果仔细研究一下的话,会发现他们每首单曲的播放量都很平均,因此他们的总播放量可以排名第一,但每首单曲都没有进前20。

用户单曲分布

In [30]:

# 这里我们使用的是`count`,而不是`sum`,所以得到的结果是用户听过的单曲数
user_song_count_distribution = triplet_dataset_sub_song_merged[['user','title']].groupby('user').count().reset_index().sort_values(
by='title',ascending = False)
user_song_count_distribution.title.describe()

Out[30]:

count    99996.000000
mean       107.756310
std         79.737279
min          1.000000
25%         53.000000
50%         89.000000
75%        141.000000
max       1189.000000
Name: title, dtype: float64

In [34]:

x = user_song_count_distribution.title
plt.figure(figsize=(12,6))
n, bins, patches = plt.hist(x, 50, facecolor='green', alpha=0.75)
plt.xlabel(u'播放的单曲数')
plt.ylabel(u'用户量')
plt.grid(True)

我们在这个数据集上还可以进行更多的可视化操作,比如按照发布年份来分析、分析一下歌手专辑的流行度之类的。
相信大家现在已经有了足够的能力来对这些数据进行各种可视化操作,得到更多有意思的信息。

推荐引擎

推荐引擎要做的事情其实已经很明显了:推荐!
推荐的办法有很多,最长被讨论的有如下三种:

  1. 基于用户的推荐引擎
    这种推荐引擎中,用户是最为重要的实体。它的基本逻辑是寻找用户间的相似性,然后以此作为推荐的基础。
  2. 基于内容的推荐引擎
    在这种引擎中,很自然,内容是最为重要的实体,比如在我们这个案例中,歌曲就是核心。这种算法会去寻找内容的特征,然后建立内容间的相似性,基于这些相似性再做推荐
  3. 混合推荐引擎
    这种其实也被称为协同过滤。

我们接下来的代码中会引用https://github.com/llSourcell中的代码。

基于热度的推荐引擎

这种推荐引擎是最容易开发的。它的逻辑非常朴素:如果一样东西被很多人喜欢,那么推荐给更多的人一般来说也不会太坏。

In [36]:

import Recommenders as Recommenders                       # 改编自https://github.com/llSourcell
from sklearn.model_selection import train_test_split

In [41]:

train_data, test_data = train_test_split(triplet_dataset_sub_song_merged, test_size = 0.40, random_state=0)

In [42]:

train_data.head()

Out[42]:

 usersonglisten_counttitlereleaseartist_nameyear
87422968272a3530646a31ef5e49ea894f928d0d6b9b31bSOBTVDE12AF72A3DE51Wish You Were HereMorning ViewIncubus2001
491182374d54aded8585b89ef5e3d86f73bf4ce15a46e44SOBBCWG12AF72AB9CB1BrothersOne Life StandHot Chip2010
5503975a85cbab8153c5d9ef3dc40496602f2f6aa500acbSOWYYUQ12A6701D68D3It's My LifeCrushBon Jovi2000
77757086d24ea4af5d394408f2dbcc977bbb29d356e000dSOXNFHG12A8C135C552DropLabcabincaliforniaThe Pharcyde1995
33437803931fe199c4c42920ed84d72f57196d6c6046878SOUGACV12A6D4F84E01MysteriesShow Your BonesYeah Yeah Yeahs2006

In [43]:

def create_popularity_recommendation(train_data, user_id, item_id):
    train_data_grouped = train_data.groupby([item_id]).agg({user_id: 'count'}).reset_index()
    train_data_grouped.rename(columns = {user_id: 'score'},inplace=True)
    
    train_data_sort = train_data_grouped.sort_values(['score', item_id], ascending = [0,1])
    
    train_data_sort['Rank'] = train_data_sort['score'].rank(ascending=0, method='first')
        
    popularity_recommendations = train_data_sort.head(20)
    return popularity_recommendations

In [44]:

recommendations = create_popularity_recommendation(triplet_dataset_sub_song_merged,'user','title')
recommendations

Out[44]:

 titlescoreRank
19580Sehr kosmisch186281.0
5780Dog Days Are Over (Radio Edit)176382.0
27314You're The One160833.0
19542Secrets151364.0
18636Revelry149435.0
25070Undo146816.0
7531Fireflies130847.0
9641Hey_ Soul Sister129968.0
25216Use Somebody127919.0
9922Horn Concerto No. 4 in E flat K495: II. Romanc...1234310.0
24291Tive Sim1182911.0
3629Canada1159212.0
23468The Scientist1153813.0
4194Clocks1136014.0
12136Just Dance1106115.0
26974Yellow1092216.0
16438OMG1081817.0
9845Home1051318.0
3296Bulletproof1038119.0
4760Creep (Explicit)1024220.0

基于内容相似的推荐

刚才我们开发了一个最简单的热榜推荐。现在我们来稍微开发一个更复杂一点的算法。
我们要开发的这个算法是基于计算歌曲相似度的。我们这里采用的相似度也很简单:

similarity_{ij}=\frac{intersection(users_i, users_j)}{union(users_i, users_j)}

那么向用户k推荐歌曲的话,我们要做的是:

  1. 找出用户k听过的歌曲
  2. 针对他听过的每首歌计算一下跟所有歌曲的相似度
  3. 以相似度为准,找出相似度最高的歌曲并向用户推荐

显然,这个算法的第2条是计算密集型的任务,当歌曲数目较多时,显然计算量非常大。所以这里我们再一次缩减曲库:

In [45]:

song_count_subset = song_count_df.head(n=5000) # 选择最流行的5000首歌
user_subset = list(play_count_subset.user)
song_subset = list(song_count_subset.song)
triplet_dataset_sub_song_merged_sub = triplet_dataset_sub_song_merged[triplet_dataset_sub_song_merged.song.isin(song_subset)]

In [46]:

triplet_dataset_sub_song_merged_sub.head()

Out[46]:

 usersonglisten_counttitlereleaseartist_nameyear
0d6589314c0a9bcbca4fee0c93b14bc402363afeaSOADQPP12A67020C8212You And Me JesusTribute To Jake HessJake Hess2004
1d6589314c0a9bcbca4fee0c93b14bc402363afeaSOAFTRR12AF72A8D4D1Harder Better Faster StrongerDiscoveryDaft Punk2007
2d6589314c0a9bcbca4fee0c93b14bc402363afeaSOANQFY12AB01832391UprisingUprisingMuse0
3d6589314c0a9bcbca4fee0c93b14bc402363afeaSOAYATB12A6701FD501Breakfast At Tiffany'sHomeDeep Blue Something1993
4d6589314c0a9bcbca4fee0c93b14bc402363afeaSOBOAFP12A8C131F367Lucky (Album Version)We Sing. We Dance. We Steal Things.Jason Mraz & Colbie Caillat0

In [47]:

train_data, test_data = train_test_split(triplet_dataset_sub_song_merged_sub, test_size = 0.30, random_state=0)
is_model = Recommenders.item_similarity_recommender_py()
is_model.create(train_data, 'user', 'title')

In [45]:

# 向用户推荐,即使5000篇,计算量也不小,大约需要1小时
user_id = list(train_data.user)[7]
user_items = is_model.get_user_items(user_id)
is_model.recommend(user_id)
No. of unique songs for the user: 82
no. of unique songs in the training set: 4879
Non zero values in cooccurence_matrix :378241

Out[45]:

 user_idsongscorerank
08ffa9a13c6fa5a3c04a95c449b148183dd51ebb7Halo0.0461761
18ffa9a13c6fa5a3c04a95c449b148183dd51ebb7Use Somebody0.0453962
28ffa9a13c6fa5a3c04a95c449b148183dd51ebb7Secrets0.0439633
38ffa9a13c6fa5a3c04a95c449b148183dd51ebb7I Kissed A Girl0.0438094
48ffa9a13c6fa5a3c04a95c449b148183dd51ebb7Marry Me0.0431045
58ffa9a13c6fa5a3c04a95c449b148183dd51ebb7The Only Exception (Album Version)0.0425116
68ffa9a13c6fa5a3c04a95c449b148183dd51ebb7Fireflies0.0424967
78ffa9a13c6fa5a3c04a95c449b148183dd51ebb7Billionaire [feat. Bruno Mars] (Explicit Albu...0.0424478
88ffa9a13c6fa5a3c04a95c449b148183dd51ebb7Drop The World0.0423199
98ffa9a13c6fa5a3c04a95c449b148183dd51ebb7Clocks0.04227110

注意到,我们这里仅仅将歌曲的听众作为特征,并没有使用任何歌曲自身的特征,而实际上这些特征都可以用来定义歌曲之间的相似度。
在现实的工业界场景中,相似度的衡量其实都是包含了非常多的各种各样的特征。

基于矩阵分解的推荐引擎

我们通过迭代的方式来求得内容的特征矩阵 X 和用户对这些特征兴趣的矩阵 \Theta 。

 张三(1)李四(2)王二(3)麻子(4)
泰坦尼克号5500
乱世佳人5??0
罗马假日?40?
无间道0054
指环王005?

Y=\begin{bmatrix}5 & 5 & 0 & 0 \\ 5 & ? & ? & 0 \\ ? & 4 & 0 & ? \\ 0 & 0 & 5 & 4 \\ 0 & 0 & 5 & 0\end{bmatrix}

预测评分:

\begin{bmatrix}(\theta^{(1)})^T(x^{(1)}) & (\theta^{(2)})^T(x^{(1)}) & ... & (\theta^{(n_u)})^T(x^{(1)}) \\ (\theta^{(1)})^T(x^{(2)}) & (\theta^{(2)})^T(x^{(2)}) & ... & (\theta^{(n_u)})^T(x^{(2)}) \\ \vdots & \vdots & \vdots & \vdots \\ (\theta^{(1)})^T(x^{(n_m)}) & (\theta^{2)})^T(x^{(n_m)}) & ... & (\theta^{(n_u)})^T(x^{(n_m)}) \end{bmatrix}

所以,Y=X\Theta^T

既然得到了这个式子,那么我们其实可以利用线性代数的知识来直接求解,而不去迭代的来求解 X 和 \Theta。当然了,考虑到矩阵分解的计算复杂度,我们在实际应用中其实更倾向于在理论课上讨论的迭代式的求解方式。

这里我们作为扩展的内容,使用矩阵分解直接来试试。 对我们而言,我们目前所知道的矩阵分解其实只有在PCA降维的时候简单学习到的 SVD 分解。如果我们还记得使用 S 矩阵的前 K 个元素来挑选最重要的投影方向的话,我们其实也可以理解前 K 个元素对应的也是最重要的隐层特征。所以,我们可以借助 SVD 来构造这里的两个分解。那么基本的步骤是:

  1. 将用户播放矩阵进行SVD分解,得到U,S,V矩阵
  2. 选择 S 的前 K 个元素(对角线)
  3. 计算 S_k 的平方根得到 S_k^{1/2}
  4. 分别计算 U*S_k^{1/2} 和 S_k^{1/2}*V 作为用户喜好矩阵和内容特征矩阵

因为内存限制的原因,在运行下面的代码前最好 "Restart"一下Kernel

In [2]:

triplet_dataset_sub_song_merged = pd.read_csv('./data/triplet_dataset_sub_song_merged.csv',encoding='utf-8')

In [3]:

# 因为我们没有用户评分,只有用户播放的记录,因此我们使用用户播百分比作为评分
triplet_dataset_sub_song_merged_sum_df = triplet_dataset_sub_song_merged[['user','listen_count']].groupby('user').sum().reset_index()
triplet_dataset_sub_song_merged_sum_df.rename(columns={'listen_count':'total_listen_count'},inplace=True)
triplet_dataset_sub_song_merged = pd.merge(triplet_dataset_sub_song_merged,triplet_dataset_sub_song_merged_sum_df)
triplet_dataset_sub_song_merged['fractional_play_count'] = triplet_dataset_sub_song_merged['listen_count']/triplet_dataset_sub_song_merged['total_listen_count']

In [3]:

triplet_dataset_sub_song_merged[triplet_dataset_sub_song_merged.user =='d6589314c0a9bcbca4fee0c93b14bc402363afea'][['user','song','listen_count','fractional_play_count']].head()

Out[3]:

 usersonglisten_countfractional_play_count
0d6589314c0a9bcbca4fee0c93b14bc402363afeaSOADQPP12A67020C82120.036474
1d6589314c0a9bcbca4fee0c93b14bc402363afeaSOAFTRR12AF72A8D4D10.003040
2d6589314c0a9bcbca4fee0c93b14bc402363afeaSOANQFY12AB018323910.003040
3d6589314c0a9bcbca4fee0c93b14bc402363afeaSOAYATB12A6701FD5010.003040
4d6589314c0a9bcbca4fee0c93b14bc402363afeaSOBOAFP12A8C131F3670.021277

In [4]:

# 准备好 用户-歌曲 "评分"矩阵
from scipy.sparse import coo_matrix

small_set = triplet_dataset_sub_song_merged
user_codes = small_set.user.drop_duplicates().reset_index()
song_codes = small_set.song.drop_duplicates().reset_index()
user_codes.rename(columns={'index':'user_index'}, inplace=True)
song_codes.rename(columns={'index':'song_index'}, inplace=True)
song_codes['so_index_value'] = list(song_codes.index)
user_codes['us_index_value'] = list(user_codes.index)
small_set = pd.merge(small_set,song_codes,how='left')
small_set = pd.merge(small_set,user_codes,how='left')
mat_candidate = small_set[['us_index_value','so_index_value','fractional_play_count']]
data_array = mat_candidate.fractional_play_count.values
row_array = mat_candidate.us_index_value.values
col_array = mat_candidate.so_index_value.values

data_sparse = coo_matrix((data_array, (row_array, col_array)),dtype=float)

In [5]:

data_sparse

Out[5]:

<99996x30000 sparse matrix of type '<type 'numpy.float64'>'
	with 10775200 stored elements in COOrdinate format>

In [6]:

user_codes[user_codes.user =='2a2f776cbac6df64d6cb505e7e834e01684673b6']

Out[6]:

 user_indexuserus_index_value
2751429814812a2f776cbac6df64d6cb505e7e834e01684673b627514

In [7]:

import math as mt
from scipy.sparse.linalg import * #used for matrix multiplication
from scipy.sparse.linalg import svds
from scipy.sparse import csc_matrix

In [8]:

def compute_svd(urm, K):
    U, s, Vt = svds(urm, K)

    dim = (len(s), len(s))
    S = np.zeros(dim, dtype=np.float32)
    for i in range(0, len(s)):
        S[i,i] = mt.sqrt(s[i]) # 求平方根

    U = csc_matrix(U, dtype=np.float32)
    S = csc_matrix(S, dtype=np.float32)
    Vt = csc_matrix(Vt, dtype=np.float32)
    
    return U, S, Vt

def compute_estimated_matrix(urm, U, S, Vt, uTest, K):
    rightTerm = S*Vt 
    max_recommendation = 250
    estimatedRatings = np.zeros(shape=(MAX_UID, MAX_PID), dtype=np.float16)
    recomendRatings = np.zeros(shape=(MAX_UID,max_recommendation ), dtype=np.float16)
    for userTest in uTest:
        prod = U[userTest, :]*rightTerm
        estimatedRatings[userTest, :] = prod.todense()
        recomendRatings[userTest, :] = (-estimatedRatings[userTest, :]).argsort()[:max_recommendation]
    return recomendRatings

In [9]:

K=50
urm = data_sparse
MAX_PID = urm.shape[1]
MAX_UID = urm.shape[0]

U, S, Vt = compute_svd(urm, K)

In [10]:

uTest = [4,5,6,7,8,873,23]

uTest_recommended_items = compute_estimated_matrix(urm, U, S, Vt, uTest, K)

In [11]:

for user in uTest:
    print u"Recommendation for user with user id {}". format(user)
    rank_value = 1
    for i in uTest_recommended_items[user,0:10]:
        song_details = small_set[small_set.so_index_value == i].drop_duplicates('so_index_value')[['title','artist_name']]
        print u"The number {} recommended song is {} BY {}".format(rank_value, list(song_details['title'])[0],list(song_details['artist_name'])[0])
        rank_value+=1
Recommendation for user with user id 4
The number 1 recommended song is Fireflies BY Charttraxx Karaoke
The number 2 recommended song is Hey_ Soul Sister BY Train
The number 3 recommended song is OMG BY Usher featuring will.i.am
The number 4 recommended song is Lucky (Album Version) BY Jason Mraz & Colbie Caillat
The number 5 recommended song is Vanilla Twilight BY Owl City
The number 6 recommended song is Crumpshit BY Philippe Rochard
The number 7 recommended song is Billionaire [feat. Bruno Mars]  (Explicit Album Version) BY Travie McCoy
The number 8 recommended song is Love Story BY Taylor Swift
The number 9 recommended song is TULENLIEKKI BY M.A. Numminen
The number 10 recommended song is Use Somebody BY Kings Of Leon
Recommendation for user with user id 5
The number 1 recommended song is Sehr kosmisch BY Harmonia
The number 2 recommended song is Dog Days Are Over (Radio Edit) BY Florence + The Machine
The number 3 recommended song is Ain't Misbehavin BY Sam Cooke
The number 4 recommended song is Revelry BY Kings Of Leon
The number 5 recommended song is Undo BY Björk
The number 6 recommended song is Cosmic Love BY Florence + The Machine
The number 7 recommended song is Home BY Edward Sharpe & The Magnetic Zeros
The number 8 recommended song is You've Got The Love BY Florence + The Machine
The number 9 recommended song is Bring Me To Life BY Evanescence
The number 10 recommended song is Tighten Up BY The Black Keys
Recommendation for user with user id 6
The number 1 recommended song is Crumpshit BY Philippe Rochard
The number 2 recommended song is Marry Me BY Train
The number 3 recommended song is Hey_ Soul Sister BY Train
The number 4 recommended song is Lucky (Album Version) BY Jason Mraz & Colbie Caillat
The number 5 recommended song is One On One BY the bird and the bee
The number 6 recommended song is I Never Told You BY Colbie Caillat
The number 7 recommended song is Canada BY Five Iron Frenzy
The number 8 recommended song is Fireflies BY Charttraxx Karaoke
The number 9 recommended song is TULENLIEKKI BY M.A. Numminen
The number 10 recommended song is Bring Me To Life BY Evanescence
Recommendation for user with user id 7
The number 1 recommended song is Behind The Sea [Live In Chicago] BY Panic At The Disco
The number 2 recommended song is The City Is At War (Album Version) BY Cobra Starship
The number 3 recommended song is Dead Souls BY Nine Inch Nails
The number 4 recommended song is Una Confusion BY LU
The number 5 recommended song is Home BY Edward Sharpe & The Magnetic Zeros
The number 6 recommended song is Climbing Up The Walls BY Radiohead
The number 7 recommended song is Tighten Up BY The Black Keys
The number 8 recommended song is Tive Sim BY Cartola
The number 9 recommended song is West One (Shine On Me) BY The Ruts
The number 10 recommended song is Cosmic Love BY Florence + The Machine
Recommendation for user with user id 8
The number 1 recommended song is Undo BY Björk
The number 2 recommended song is Canada BY Five Iron Frenzy
The number 3 recommended song is Better To Reign In Hell BY Cradle Of Filth
The number 4 recommended song is Unite (2009 Digital Remaster) BY Beastie Boys
The number 5 recommended song is Behind The Sea [Live In Chicago] BY Panic At The Disco
The number 6 recommended song is Rockin' Around The Christmas Tree BY Brenda Lee
The number 7 recommended song is Tautou BY Brand New
The number 8 recommended song is Revelry BY Kings Of Leon
The number 9 recommended song is 16 Candles BY The Crests
The number 10 recommended song is Catch You Baby (Steve Pitron & Max Sanna Radio Edit) BY Lonnie Gordon
Recommendation for user with user id 873
The number 1 recommended song is The Scientist BY Coldplay
The number 2 recommended song is Yellow BY Coldplay
The number 3 recommended song is Clocks BY Coldplay
The number 4 recommended song is Fix You BY Coldplay
The number 5 recommended song is In My Place BY Coldplay
The number 6 recommended song is Shiver BY Coldplay
The number 7 recommended song is Speed Of Sound BY Coldplay
The number 8 recommended song is Creep (Explicit) BY Radiohead
The number 9 recommended song is Sparks BY Coldplay
The number 10 recommended song is Use Somebody BY Kings Of Leon
Recommendation for user with user id 23
The number 1 recommended song is Garden Of Eden BY Guns N' Roses
The number 2 recommended song is Don't Speak BY John Dahlbäck
The number 3 recommended song is Master Of Puppets BY Metallica
The number 4 recommended song is TULENLIEKKI BY M.A. Numminen
The number 5 recommended song is Bring Me To Life BY Evanescence
The number 6 recommended song is Kryptonite BY 3 Doors Down
The number 7 recommended song is Make Her Say BY Kid Cudi / Kanye West / Common
The number 8 recommended song is Night Village BY Deep Forest
The number 9 recommended song is Better To Reign In Hell BY Cradle Of Filth
The number 10 recommended song is Xanadu BY Olivia Newton-John;Electric Light Orchestra

开源推荐引擎库

我们这里简单地实现了一个基于矩阵分解的推荐引擎,虽然非常简单,但希望能给大家一个简明的认识。
当然,在python中也有一些开源的推荐引擎库:

  • scikit-surprise
  • lightfm
  • crab
  • rec_sys
  • ...
  • 1
    点赞
  • 29
    收藏
    觉得还不错? 一键收藏
  • 5
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 5
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值