用 Graphlab Create 构建歌曲推荐系统

课程简介：

Graphlab 是基于C++开发的知名开源推荐系统，支持多种数据结构及数据挖掘算法，用其构建推荐系统步骤简单且效果较好，本节将为大家介绍一下如何用 Graphlab Create 构建一个歌曲推荐系统。

课程目标：

了解 Graphlab 系统
掌握利用 Graphlab 构建推荐系统流程及分析技巧

导入音乐数据

In [2]:

song_data = graphlab.SFrame('song_data.gl')
#其他格式数据导入详见 https://dato.com/learn/userguide/sframe/sframe-intro.html

[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1472092802.log

探索性数据分析

In [3]:

song_data.head()

Out[3]:

user_id	song_id	listen_count	title	artist
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...	SOAKIMP12A8C130995	1	The Cove	Jack Johnson
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...	SOBBMDR12A8C13253B	2	Entre Dos Aguas	Paco De Lucia
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...	SOBXHDL12A81C204C0	1	Stronger	Kanye West
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...	SOBYHAJ12A6701BF1D	1	Constellations	Jack Johnson
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...	SODACBL12A8C13C273	1	Learn To Fly	Foo Fighters
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...	SODDNQT12A6D4F5F7E	5	Apuesta Por El Rock 'N' Roll ...	Héroes del Silencio
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...	SODXRTY12AB0180F3B	1	Paper Gangsta	Lady GaGa
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...	SOFGUAY12AB017B0A8	1	Stacked Actors	Foo Fighters
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...	SOFRQTD12A81C233C0	1	Sehr kosmisch	Harmonia
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...	SOHQWYZ12A6D4FA701	1	Heaven's gonna burn your eyes ...	Thievery Corporation feat. Emiliana Torrini ...

song
The Cove - Jack Johnson
Entre Dos Aguas - Paco De Lucia ...
Stronger - Kanye West
Constellations - Jack Johnson ...
Learn To Fly - Foo Fighters ...
Apuesta Por El Rock 'N' Roll - Héroes del ...
Paper Gangsta - Lady GaGa
Stacked Actors - Foo Fighters ...
Sehr kosmisch - Harmonia
Heaven's gonna burn your eyes - Thievery ...

[10 rows x 6 columns]

歌曲数据包含六个变量，分别是：用户ID、歌曲ID、收听次数、唱片名称、歌手名称以及歌曲名称。

In [4]:

graphlab.canvas.set_target('ipynb') # 设置 canvas 视图输出目标
                                    # 设定参数 ‘ipynb’，“Calling .show()”命令将会在 IPython Notebook 中渲染一个输出单元

In [5]:

song_data['song'].show() # 展示不同歌曲出现次数及占比

In [6]:

len(song_data)

Out[6]:

计算用户数量

In [7]:

users = song_data["user_id"].unique()

In [8]:

len(users)

Out[8]:

创建歌曲推荐系统

In [9]:

train_data,test_data = song_data.random_split(.8, seed =0) # 将样本数据随机分为训练集和测试集两部分，其中训练集占比80%

简单的基于流行度的推荐系统

In [10]:

popularity_model = graphlab.popularity_recommender.create(train_data,
                                                         user_id='user_id',
                                                         item_id='song')

Recsys training: model = popularity

Warning: Ignoring columns song_id, listen_count, title, artist;

    To use one of these as a target column, set target =

    and use a method that allows the use of a target.

Preparing data set.

    Data has 893580 observations with 66085 users and 9952 items.

    Data prepared in: 1.06654s

893580 observations to process; with 9952 unique items.

利用流行度推荐系统进行预测

In [11]:

popularity_model.recommend(users=[users[0]]) # 得到相应用户的推荐歌曲排名列表，默认输出前10名

Out[11]:

user_id	song	score	rank
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...	Sehr kosmisch - Harmonia	4754.0	1
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...	Undo - Björk	4227.0	2
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...	You're The One - Dwight Yoakam ...	3781.0	3
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...	Dog Days Are Over (Radio Edit) - Florence + The ...	3633.0	4
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...	Revelry - Kings Of Leon	3527.0	5
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...	Horn Concerto No. 4 in E flat K495: II. Romance ...	3161.0	6
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...	Secrets - OneRepublic	3148.0	7
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...	Hey_ Soul Sister - Train	2538.0	8
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...	Fireflies - Charttraxx Karaoke ...	2532.0	9
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...	Tive Sim - Cartola	2521.0	10

[10 rows x 4 columns]

In [12]:

popularity_model.recommend(users=[users[1]])

Out[12]:

user_id	song	score	rank
c067c22072a17d33310d7223d 7b79f819e48cf42 ...	Sehr kosmisch - Harmonia	4754.0	1
c067c22072a17d33310d7223d 7b79f819e48cf42 ...	Undo - Björk	4227.0	2
c067c22072a17d33310d7223d 7b79f819e48cf42 ...	You're The One - Dwight Yoakam ...	3781.0	3
c067c22072a17d33310d7223d 7b79f819e48cf42 ...	Dog Days Are Over (Radio Edit) - Florence + The ...	3633.0	4
c067c22072a17d33310d7223d 7b79f819e48cf42 ...	Revelry - Kings Of Leon	3527.0	5
c067c22072a17d33310d7223d 7b79f819e48cf42 ...	Horn Concerto No. 4 in E flat K495: II. Romance ...	3161.0	6
c067c22072a17d33310d7223d 7b79f819e48cf42 ...	Secrets - OneRepublic	3148.0	7
c067c22072a17d33310d7223d 7b79f819e48cf42 ...	Hey_ Soul Sister - Train	2538.0	8
c067c22072a17d33310d7223d 7b79f819e48cf42 ...	Fireflies - Charttraxx Karaoke ...	2532.0	9
c067c22072a17d33310d7223d 7b79f819e48cf42 ...	Tive Sim - Cartola	2521.0	10

[10 rows x 4 columns]

基于用户个性化特征创建歌曲推荐系统

In [13]:

personalized_model = graphlab.item_similarity_recommender.create(train_data,
                                                                 user_id='user_id',
                                                                 item_id='song')

Recsys training: model = item_similarity

Warning: Ignoring columns song_id, listen_count, title, artist;

    To use one of these as a target column, set target =

    and use a method that allows the use of a target.

Preparing data set.

    Data has 893580 observations with 66085 users and 9952 items.

    Data prepared in: 1.09829s

Training model from provided data.

Gathering per-item and per-user statistics.

+--------------------------------+------------+

| Elapsed Time (Item Statistics) | % Complete |

+--------------------------------+------------+

| 10.313ms                       | 3          |

| 70.996ms                       | 100        |

+--------------------------------+------------+

Setting up lookup tables.

Processing data in one pass using dense lookup tables.

+-------------------------------------+------------------+-----------------+

| Elapsed Time (Constructing Lookups) | Total % Complete | Items Processed |

+-------------------------------------+------------------+-----------------+

| 458.742ms                           | 0                | 0               |

| 1.10s                               | 100              | 9952            |

+-------------------------------------+------------------+-----------------+

Finalizing lookup tables.

Generating candidate set for working with new users.

Finished training in 2.23668s

利用个性化推荐系统进行歌曲推荐

In [14]:

personalized_model.recommend(users=[users[0]])

Out[14]:

user_id	song	score	rank
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...	Riot In Cell Block Number Nine - Dr Feelgood ...	0.0374999940395	1
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...	Sei Lá Mangueira - Elizeth Cardoso ...	0.0331632643938	2
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...	The Stallion - Ween	0.0322580635548	3
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...	Rain - Subhumans	0.0314159244299	4
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...	West One (Shine On Me) - The Ruts ...	0.0306771993637	5
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...	Back Against The Wall - Cage The Elephant ...	0.0301204770803	6
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...	Life Less Frightening - Rise Against ...	0.0284431129694	7
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...	A Beggar On A Beach Of Gold - Mike And The ...	0.0230024904013	8
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...	Audience Of One - Rise Against ...	0.0193938463926	9
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...	Blame It On The Boogie - The Jacksons ...	0.0189873427153	10

[10 rows x 4 columns]

In [15]:

personalized_model.recommend(users=[users[1]])

Out[15]:

user_id	song	score	rank
c067c22072a17d33310d7223d 7b79f819e48cf42 ...	Grind With Me (Explicit Version) - Pretty Ricky ...	0.0459424376488	1
c067c22072a17d33310d7223d 7b79f819e48cf42 ...	There Goes My Baby - Usher ...	0.0331920742989	2
c067c22072a17d33310d7223d 7b79f819e48cf42 ...	Panty Droppa [Intro] (Album Version) - Trey ...	0.0318566203117	3
c067c22072a17d33310d7223d 7b79f819e48cf42 ...	Nobody (Featuring Athena Cage) (LP Version) - ...	0.0278467655182	4
c067c22072a17d33310d7223d 7b79f819e48cf42 ...	Youth Against Fascism - Sonic Youth ...	0.0262914180756	5
c067c22072a17d33310d7223d 7b79f819e48cf42 ...	Nice & Slow - Usher	0.0239639401436	6
c067c22072a17d33310d7223d 7b79f819e48cf42 ...	Making Love (Into The Night) - Usher ...	0.0238176941872	7
c067c22072a17d33310d7223d 7b79f819e48cf42 ...	Naked - Marques Houston	0.0228925704956	8
c067c22072a17d33310d7223d 7b79f819e48cf42 ...	I.nner Indulgence - DESTRUCTION ...	0.0220767498016	9
c067c22072a17d33310d7223d 7b79f819e48cf42 ...	Love Lost (Album Version) - Trey Songz ...	0.0204497694969	10

[10 rows x 4 columns]

该模型还可以进行相似歌曲推荐

In [16]:

personalized_model.get_similar_items(['With Or Without You - U2'])

Out[16]:

song	similar	score	rank
With Or Without You - U2	I Still Haven't Found What I'm Looking For ...	0.042857170105	1
With Or Without You - U2	Hold Me_ Thrill Me_ Kiss Me_ Kill Me - U2 ...	0.0337349176407	2
With Or Without You - U2	Window In The Skies - U2	0.0328358411789	3
With Or Without You - U2	Vertigo - U2	0.0300751924515	4
With Or Without You - U2	Sunday Bloody Sunday - U2	0.0271317958832	5
With Or Without You - U2	Bad - U2	0.0251798629761	6
With Or Without You - U2	A Day Without Me - U2	0.0237154364586	7
With Or Without You - U2	Another Time Another Place - U2 ...	0.0203251838684	8
With Or Without You - U2	Walk On - U2	0.0202020406723	9
With Or Without You - U2	Get On Your Boots - U2	0.0196850299835	10

[10 rows x 4 columns]

In [17]:

personalized_model.get_similar_items(['Chan Chan (Live) - Buena Vista Social Club'])

Out[17]:

song	similar	score	rank
Chan Chan (Live) - Buena Vista Social Club ...	Murmullo - Buena Vista Social Club ...	0.188118815422	1
Chan Chan (Live) - Buena Vista Social Club ...	La Bayamesa - Buena Vista Social Club ...	0.18719214201	2
Chan Chan (Live) - Buena Vista Social Club ...	Amor de Loca Juventud - Buena Vista Social Club ...	0.184834122658	3
Chan Chan (Live) - Buena Vista Social Club ...	Diferente - Gotan Project	0.0214592218399	4
Chan Chan (Live) - Buena Vista Social Club ...	Mistica - Orishas	0.0205761194229	5
Chan Chan (Live) - Buena Vista Social Club ...	Hotel California - Gipsy Kings ...	0.0193049907684	6
Chan Chan (Live) - Buena Vista Social Club ...	Nací Orishas - Orishas	0.0191571116447	7
Chan Chan (Live) - Buena Vista Social Club ...	Le Moulin - Yann Tiersen	0.018796980381	8
Chan Chan (Live) - Buena Vista Social Club ...	Gitana - Willie Colon	0.018796980381	9
Chan Chan (Live) - Buena Vista Social Club ...	Criminal - Gotan Project	0.0187793374062	10

[10 rows x 4 columns]

不同推荐模型的定量对比

In [18]:

%matplotlib inline
model_performance = graphlab.recommender.util.compare_models(test_data,
                                                            [popularity_model, personalized_model],
                                                            user_sample=0.05)
# 利用测试集对比不同推荐模型的预测或推荐效果，user_sample 设置进行效果预测的数据的抽样比例

compare_models: using 2931 users to estimate model performance
PROGRESS: Evaluate model M0

recommendations finished on 1000/2931 queries. users per second: 14813.5

recommendations finished on 2000/2931 queries. users per second: 17580.4

Precision and recall summary statistics by cutoff
+--------+-----------------+------------------+
| cutoff |  mean_precision |   mean_recall    |
+--------+-----------------+------------------+
|   1    | 0.0341180484476 | 0.00823292386752 |
|   2    | 0.0317297850563 | 0.0160935708402  |
|   3    | 0.0274081655863 | 0.0204478560031  |
|   4    | 0.0246502900034 | 0.0243574988073  |
|   5    |  0.021494370522 | 0.0265069481676  |
|   6    | 0.0201296485841 | 0.0301672042844  |
|   7    | 0.0190573670615 | 0.0327615421653  |
|   8    | 0.0182105083589 | 0.0354908506521  |
|   9    | 0.0171348421093 | 0.0376075481628  |
|   10   |  0.016308427158 | 0.0397835214749  |
+--------+-----------------+------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1

recommendations finished on 1000/2931 queries. users per second: 13598.8

recommendations finished on 2000/2931 queries. users per second: 17541.9

Precision and recall summary statistics by cutoff
+--------+-----------------+-----------------+
| cutoff |  mean_precision |   mean_recall   |
+--------+-----------------+-----------------+
|   1    |  0.191743432276 | 0.0585482156874 |
|   2    |  0.162060730126 | 0.0951129247189 |
|   3    |  0.141703627886 |  0.120469458609 |
|   4    |  0.125213237803 |  0.139421451307 |
|   5    |  0.113613101331 |  0.155305732877 |
|   6    |  0.103093369726 |  0.167448943357 |
|   7    | 0.0959691962763 |  0.181352449533 |
|   8    | 0.0896451722961 |  0.193948807268 |
|   9    | 0.0848023048637 |  0.205693986157 |
|   10   | 0.0805868304333 |  0.216014609656 |
+--------+-----------------+-----------------+
[10 rows x 3 columns]

绘制不同模型的准确率和召回率曲线

In [19]:

import matplotlib.pyplot as plt
%matplotlib inline

fig, ax = plt.subplots()

pr_curves_by_model = [res['precision_recall_overall'] for res in model_performance]

pr_curve = pr_curves_by_model[0].sort('recall')
ax.plot(list(pr_curve['recall']), list(pr_curve['precision']),
        'blue', label='M1')

pr_curve = pr_curves_by_model[1].sort('recall')
ax.plot(list(pr_curve['recall']), list(pr_curve['precision']),
        'green', label='M2')

ax.set_title('Precision-Recall Averaged Over Users')
ax.set_xlabel('Recall')
ax.set_ylabel('Precision')
ax.legend()

fig.show()

/home/datartisan/.pyenv/versions/2.7.12/envs/datacademy-lesson-27/lib/python2.7/site-packages/matplotlib/figure.py:397: UserWarning: matplotlib is currently using a non-GUI backend, so cannot show the figure
  "matplotlib is currently using a non-GUI backend, "

In [20]:

dir(song_data) # 输出 SFrame 结构的数据 song_data 的所有属性

Out[20]:

['_SFrame__construct_ctr',
 '_SFrame__dropna_errchk',
 '_SFrame__get_graphlabutil_reference_on_spark_unity_jar',
 '__class__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__doc__',
 '__eq__',
 '__format__',
 '__get_column_description__',
 '__get_pretty_tables__',
 '__get_staging_dir__',
 '__getattribute__',
 '__getitem__',
 '__has_size__',
 '__hash__',
 '__init__',
 '__is_materialized__',
 '__iter__',
 '__len__',
 '__materialize__',
 '__module__',
 '__ne__',
 '__new__',
 '__nonzero__',
 '__proxy__',
 '__query_plan_str__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__str_impl__',
 '__subclasshook__',
 '_cache',
 '_group',
 '_imagecols_to_stringcols',
 '_infer_column_types_from_lines',
 '_proxy',
 '_read_csv_impl',
 '_repr_html_',
 '_row_selector',
 '_save_reference',
 'add_column',
 'add_columns',
 'add_row_number',
 'append',
 'apply',
 'column_names',
 'column_types',
 'copy',
 'dropna',
 'dropna_split',
 'dtype',
 'export_csv',
 'export_json',
 'fillna',
 'filter_by',
 'flat_map',
 'from_odbc',
 'from_rdd',
 'from_sql',
 'groupby',
 'head',
 'is_materialized',
 'join',
 'materialize',
 'num_cols',
 'num_columns',
 'num_rows',
 'pack_columns',
 'print_rows',
 'random_split',
 'read_csv',
 'read_csv_with_errors',
 'read_json',
 'remove_column',
 'remove_columns',
 'rename',
 'sample',
 'save',
 'select_column',
 'select_columns',
 'shape',
 'show',
 'sort',
 'split_datetime',
 'stack',
 'swap_columns',
 'tail',
 'to_dataframe',
 'to_numpy',
 'to_odbc',
 'to_rdd',
 'to_spark_dataframe',
 'to_sql',
 'topk',
 'unique',
 'unpack',
 'unstack']

创建特定歌手的用户集合

In [21]:

users_foo = song_data[song_data["artist"]=="Foo Fighters"]
# 以乐队"Foo Fighters"为例，将变量"artist"名称为"Foo Fighters"的数据保存到“users_foo”中
# 即创建"Foo Fighters"的用户集合“users_foo”，下同。

In [22]:

users_foo = users_foo.unique()

In [23]:

users_foo

Out[23]:

artist	listen_count	song	song_id	title
Foo Fighters	6	Next Year - Foo Fighters	SOYYIZT12A8C1408CA	Next Year
Foo Fighters	8	Breakout - Foo Fighters	SOMSQJY12A8C138539	Breakout
Foo Fighters	1	Exhausted - Foo Fighters	SONMPJJ12AB0183AF8	Exhausted
Foo Fighters	1	Next Year - Foo Fighters	SOYYIZT12A8C1408CA	Next Year
Foo Fighters	1	The Pretender - Foo Fighters ...	SOQLUTQ12A8AE48037	The Pretender
Foo Fighters	13	Everlong - Foo Fighters	SOXVVSM12A8C142224	Everlong
Foo Fighters	1	Exhausted - Foo Fighters	SONMPJJ12AB0183AF8	Exhausted
Foo Fighters	2	Virginia Moon - Foo Fighters ...	SOKQTHF12B0B80B306	Virginia Moon
Foo Fighters	1	Low - Foo Fighters	SOXGFMC12A8C1386EC	Low
Foo Fighters	4	Everlong - Foo Fighters	SOXVVSM12A8C142224	Everlong

user_id
7e3f6e77217967868c52338eb dc793b33ba28eb9 ...
6952b87aedaaddb57f99c2207 d0fac06ca1bd86f ...
dfb04c4a166ddd0e53cafccb9 f0540a7744e2346 ...
843209628fc05b104ccd0d841 1cd41faa58a4d6e ...
d6fe20c5b749f74e43595caf3 f9b61b5bcded1ed ...
932ad377a3bb70d0d32969f49 41f0268d560afc9 ...
7e3f6e77217967868c52338eb dc793b33ba28eb9 ...
d8dde5d48711ad8ae25253b5b 310b6adec906f42 ...
875988a37dd8d37da406d652f d8063a60a238b93 ...
608f53c1ec24eecf1a12e34d5 4a60f4d8a8b7591 ...

[3429 rows x 6 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

In [24]:

users_kanye = song_data[song_data["artist"]=="Kanye West"]

In [25]:

users_kanye = users_kanye.unique()

In [26]:

users_taylor = song_data[song_data["artist"]=="Taylor Swift"]

In [27]:

users_taylor = users_taylor.unique()

In [28]:

users_gaga = song_data[song_data["artist"]=="Lady GaGa"]

In [29]:

users_gaga = users_gaga.unique()

查看不同歌手用户集合的用户数

In [30]:

len(users_gaga)

Out[30]:

In [31]:

len(users_foo)

Out[31]:

In [32]:

len(users_taylor)

Out[32]:

In [33]:

len(users_kanye)

Out[33]:

In [34]:

song_data[1]

Out[34]:

{'artist': 'Paco De Lucia',
 'listen_count': 2,
 'song': 'Entre Dos Aguas - Paco De Lucia',
 'song_id': 'SOBBMDR12A8C13253B',
 'title': 'Entre Dos Aguas',
 'user_id': 'b80344d063b5ccb3212f76538f3d9e43d87dca9e'}

删除集合中相同ID的用户

In [35]:

users_gaga = users_gaga["user_id"].unique()

删除相同用户ID后，查看用户集合中用户数

In [36]:

len(users_gaga)

Out[36]:

In [37]:

users_kanye = users_kanye["user_id"].unique();
users_foo = users_foo["user_id"].unique();

In [38]:

users_taylor = users_taylor["user_id"].unique();

In [39]:

len(users_kanye)

Out[39]:

In [40]:

len(users_foo)

Out[40]:

In [41]:

len(users_taylor)

Out[41]:

统计不同歌手的收听次数

In [42]:

listen_counts = song_data.groupby(key_columns='artist', operations={'total_count': graphlab.aggregate.SUM('listen_count')})
# 按照'artist'分组，对相应'listen_count'求和，得到收听总数'total_count'

In [43]:

listen_counts.sort('total_count', ascending=False) # 输出前10名

Out[43]:

artist	total_count
Kings Of Leon	43218
Dwight Yoakam	40619
Björk	38889
Coldplay	35362
Florence + The Machine	33387
Justin Bieber	29715
Alliance Ethnik	26689
OneRepublic	25754
Train	25402
The Black Keys	22184

[3375 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

In [44]:

listen_counts.sort('total_count', ascending=True) # 输出后10名

Out[44]:

artist	total_count
William Tabbert	14
Reel Feelings	24
Beyoncé feat. Bun B and Slim Thug ...	26
Boggle Karaoke	30
Diplo	30
harvey summers	31
Nâdiya	36
Jody Bernal	38
Aneta Langerova	38
Kanye West / Talib Kweli / Q-Tip / Common / ...	38

[3375 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

选取个性化歌曲推荐模型，从测试集中选取10000个用户ID进行歌曲推荐

In [45]:

subset_test_users = test_data['user_id'].unique()[0:10000]

In [46]:

subset_recommendations = personalized_model.recommend(subset_test_users,k=1)

recommendations finished on 1000/10000 queries. users per second: 13895.8

recommendations finished on 2000/10000 queries. users per second: 20416.5

recommendations finished on 3000/10000 queries. users per second: 24293.7

recommendations finished on 4000/10000 queries. users per second: 26516.8

recommendations finished on 5000/10000 queries. users per second: 27759.7

recommendations finished on 6000/10000 queries. users per second: 28610.5

recommendations finished on 7000/10000 queries. users per second: 29651.7

recommendations finished on 8000/10000 queries. users per second: 30588.4

recommendations finished on 9000/10000 queries. users per second: 30825.3

recommendations finished on 10000/10000 queries. users per second: 27610

In [47]:

subset_recommendations.head()

Out[47]:

user_id	song	score	rank
c067c22072a17d33310d7223d 7b79f819e48cf42 ...	Grind With Me (Explicit Version) - Pretty Ricky ...	0.0459424376488	1
696787172dd3f5169dc94deef 97e427cee86147d ...	Senza Una Donna (Without A Woman) - Zucchero / ...	0.017026577677	1
532e98155cbfd1e1a474a28ed 96e59e50f7c5baf ...	Jive Talkin' (Album Version) - Bee Gees ...	0.0118288653237	1
18325842a941bc58449ee71d6 59a08d1c1bd2383 ...	Goodnight And Goodbye - Jonas Brothers ...	0.0159257985651	1
507433946f534f5d25ad1be30 2edb9a2376f503c ...	Find The Cost Of Freedom - Crosby_ Stills_ Nash & ...	0.0165806589303	1
18fafad477f9d72ff86f7d0bd 838a6573de0f64a ...	Rabbit Heart (Raise It Up) - Florence + The ...	0.0799399726093	1
fe85b96ba1983219b296f6b48 69dd29eb2b72ff9 ...	Secrets - OneRepublic	0.0788827141126	1
225ea420b4bede50919d1bfe2 4a599691522d176 ...	Clocks - Coldplay	0.0271030251796	1
95dc7e2b188b1148b2d25f4e6 b6e94afacc4efc3 ...	Bust a Move - Infected Mushroom ...	0.0534738540649	1
4a3a1ae2748f12f7ab921a47d 6d79abf82e3e325 ...	Isis (Spam Remix) - Alaska Y Dinarama ...	0.04180302118	1

[10 rows x 4 columns]

计算不同歌曲被推荐的次数，并排序

In [48]:

most_recommended = subset_recommendations.groupby(key_columns='song', operations={'total_count': graphlab.aggregate.COUNT()})

In [49]:

most_recommended.sort('total_count', ascending=False) # 输出前10名

Out[49]:

song	total_count
Secrets - OneRepublic	392
Undo - Björk	379
Revelry - Kings Of Leon	260
Horn Concerto No. 4 in E flat K495: II. Romance ...	139
Sehr kosmisch - Harmonia	124
Fireflies - Charttraxx Karaoke ...	114
Hey_ Soul Sister - Train	108
You're The One - Dwight Yoakam ...	75
OMG - Usher featuring will.i.am ...	55
Clocks - Coldplay	42

[3154 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

In [50]:

most_recommended.sort('total_count', ascending=True) # 输出后10名

Out[50]:

song	total_count
Arco Arena - Cake	1
Say Goodbye (Album Version) - Skillet ...	1
Back Against The Wall - Cage The Elephant ...	1
Leave The Bourbon On The Shelf - The Killers ...	1
Nomenclature - Andrew Bird ...	1
Wish You Were Here - Incubus ...	1
Change - Blind Melon	1
Get:On - Moguai	1
Big Brother - Kanye West	1
Perfectly Lonely - John Mayer ...	1

[3154 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

用 Graphlab Create 构建歌曲推荐系统