用 Graphlab Create 构建歌曲推荐系统

用 Graphlab Create 构建歌曲推荐系统

课程简介:

Graphlab 是基于C++开发的知名开源推荐系统,支持多种数据结构及数据挖掘算法,用其构建推荐系统步骤简单且效果较好,本节将为大家介绍一下如何用 Graphlab Create 构建一个歌曲推荐系统。

课程目标:

  • 了解 Graphlab 系统
  • 掌握利用 Graphlab 构建推荐系统流程及分析技巧

相关知识准备

Graphlab Create 简介

目前知名的开源推荐系统(open source project for recommendation system)有:SVDFeature、Crab、EasyRec、GraphLab、Lenskit、Mahout、LibMF 等等。其中 Graphlab 是基于C++开发的高性能分布式 graph 处理挖掘系统,特点是对迭代的并行计算处理能力强,用 GraphLab 来进行大数据量的 random walk 或 graph-based 的推荐算法非常有效。

支持的数据结构

  • SFrame:类似于 Pandas 及 R 的 DataFrame,是 GraphLab Create 中的由行列构成的基于磁盘的表格化数据结构,方便处理海量数据。
  • SArray:与 Pandas 的 Series 类似的数组结构
  • Canvas:基于浏览器的交互式 GUI,用于表格数据探索性分析等。
  • SGraph:Graphlab特有的数据结构,类似于 NetWorkx的 Graph,由节点和边构成
  • 此外,Graphlab Create 对 Pandas 提供了很好的支持,可以直接读取 Pandas 的 DataFrame 来构建模型。

支持的数据挖掘算法

  • recommender:用于构建推荐引擎
  • graph_analytics:包括 Connected components、Graph coloring 等
  • clustering:目前支持 K-Means 和 DBSCAN
  • regression:目前支持线性回归、Boosted Decision Trees
  • classification:目前支持神经网络、决策树、随机森林、SVM等
  • text:文本挖掘模型,目前支持主题模型

具体数据结构及方法参见https://dato.com/products/create/docs/index.html

使用前需要先安装 graphlab-create Python 包(https://dato.com/download/install.html ), 导入graphlab模块。因为 graphlab 是一款商业软件,所以安装方式与普通 python 包不同。

安装:

!pip install --upgrade --no-cache-dir https://get.graphlab.com/GraphLab-Create/2.1/your registered email address here/your product key here/GraphLab-Create-License.tar.gz
In [1]:
import graphlab

利用 GraphLab Create构建推荐系统步骤十分简单 : 导入数据、创建推荐模型、进行推荐。下面将按照这一流程逐步构建推荐系统。

导入音乐数据

In [2]:
song_data = graphlab.SFrame('song_data.gl')
#其他格式数据导入详见 https://dato.com/learn/userguide/sframe/sframe-intro.html
[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1472092802.log

探索性数据分析

In [3]:
song_data.head()
Out[3]:
user_id song_id listen_count title artist
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOAKIMP12A8C130995 1 The Cove Jack Johnson
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOBBMDR12A8C13253B 2 Entre Dos Aguas Paco De Lucia
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOBXHDL12A81C204C0 1 Stronger Kanye West
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOBYHAJ12A6701BF1D 1 Constellations Jack Johnson
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SODACBL12A8C13C273 1 Learn To Fly Foo Fighters
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SODDNQT12A6D4F5F7E 5 Apuesta Por El Rock 'N'
Roll ...
Héroes del Silencio
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SODXRTY12AB0180F3B 1 Paper Gangsta Lady GaGa
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOFGUAY12AB017B0A8 1 Stacked Actors Foo Fighters
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOFRQTD12A81C233C0 1 Sehr kosmisch Harmonia
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOHQWYZ12A6D4FA701 1 Heaven's gonna burn your
eyes ...
Thievery Corporation
feat. Emiliana Torrini ...
song
The Cove - Jack Johnson
Entre Dos Aguas - Paco De
Lucia ...
Stronger - Kanye West
Constellations - Jack
Johnson ...
Learn To Fly - Foo
Fighters ...
Apuesta Por El Rock 'N'
Roll - Héroes del ...
Paper Gangsta - Lady GaGa
Stacked Actors - Foo
Fighters ...
Sehr kosmisch - Harmonia
Heaven's gonna burn your
eyes - Thievery ...
[10 rows x 6 columns]
歌曲数据包含六个变量,分别是:用户ID、歌曲ID、收听次数、唱片名称、歌手名称以及歌曲名称。
In [4]:
graphlab.canvas.set_target('ipynb') # 设置 canvas 视图输出目标
                                    # 设定参数 ‘ipynb’,“Calling .show()”命令将会在 IPython Notebook 中渲染一个输出单元
In [5]:
song_data['song'].show() # 展示不同歌曲出现次数及占比
In [6]:
len(song_data)
Out[6]:
1116609

计算用户数量

In [7]:
users = song_data["user_id"].unique()
In [8]:
len(users)
Out[8]:
66346

创建歌曲推荐系统

In [9]:
train_data,test_data = song_data.random_split(.8, seed =0) # 将样本数据随机分为训练集和测试集两部分,其中训练集占比80%

简单的基于流行度的推荐系统

In [10]:
popularity_model = graphlab.popularity_recommender.create(train_data,
                                                         user_id='user_id',
                                                         item_id='song')
Recsys training: model = popularity
Warning: Ignoring columns song_id, listen_count, title, artist;
    To use one of these as a target column, set target = 
    and use a method that allows the use of a target.
Preparing data set.
    Data has 893580 observations with 66085 users and 9952 items.
    Data prepared in: 1.06654s
893580 observations to process; with 9952 unique items.

利用流行度推荐系统进行预测

In [11]:
popularity_model.recommend(users=[users[0]]) # 得到相应用户的推荐歌曲排名列表,默认输出前10名
Out[11]:
user_id song score rank
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Sehr kosmisch - Harmonia 4754.0 1
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Undo - Björk 4227.0 2
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
You're The One - Dwight
Yoakam ...
3781.0 3
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Dog Days Are Over (Radio
Edit) - Florence + The ...
3633.0 4
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Revelry - Kings Of Leon 3527.0 5
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Horn Concerto No. 4 in E
flat K495: II. Romance ...
3161.0 6
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Secrets - OneRepublic 3148.0 7
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Hey_ Soul Sister - Train 2538.0 8
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Fireflies - Charttraxx
Karaoke ...
2532.0 9
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Tive Sim - Cartola 2521.0 10
[10 rows x 4 columns]
In [12]:
popularity_model.recommend(users=[users[1]])
Out[12]:
user_id song score rank
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Sehr kosmisch - Harmonia 4754.0 1
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Undo - Björk 4227.0 2
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
You're The One - Dwight
Yoakam ...
3781.0 3
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Dog Days Are Over (Radio
Edit) - Florence + The ...
3633.0 4
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Revelry - Kings Of Leon 3527.0 5
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Horn Concerto No. 4 in E
flat K495: II. Romance ...
3161.0 6
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Secrets - OneRepublic 3148.0 7
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Hey_ Soul Sister - Train 2538.0 8
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Fireflies - Charttraxx
Karaoke ...
2532.0 9
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Tive Sim - Cartola 2521.0 10
[10 rows x 4 columns]

基于用户个性化特征创建歌曲推荐系统

In [13]:
personalized_model = graphlab.item_similarity_recommender.create(train_data,
                                                                 user_id='user_id',
                                                                 item_id='song')
Recsys training: model = item_similarity
Warning: Ignoring columns song_id, listen_count, title, artist;
    To use one of these as a target column, set target = 
    and use a method that allows the use of a target.
Preparing data set.
    Data has 893580 observations with 66085 users and 9952 items.
    Data prepared in: 1.09829s
Training model from provided data.
Gathering per-item and per-user statistics.
+--------------------------------+------------+
| Elapsed Time (Item Statistics) | % Complete |
+--------------------------------+------------+
| 10.313ms                       | 3          |
| 70.996ms                       | 100        |
+--------------------------------+------------+
Setting up lookup tables.
Processing data in one pass using dense lookup tables.
+-------------------------------------+------------------+-----------------+
| Elapsed Time (Constructing Lookups) | Total % Complete | Items Processed |
+-------------------------------------+------------------+-----------------+
| 458.742ms                           | 0                | 0               |
| 1.10s                               | 100              | 9952            |
+-------------------------------------+------------------+-----------------+
Finalizing lookup tables.
Generating candidate set for working with new users.
Finished training in 2.23668s

利用个性化推荐系统进行歌曲推荐

In [14]:
personalized_model.recommend(users=[users[0]])
Out[14]:
user_id song score rank
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Riot In Cell Block Number
Nine - Dr Feelgood ...
0.0374999940395 1
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Sei Lá Mangueira -
Elizeth Cardoso ...
0.0331632643938 2
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
The Stallion - Ween 0.0322580635548 3
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Rain - Subhumans 0.0314159244299 4
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
West One (Shine On Me) -
The Ruts ...
0.0306771993637 5
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Back Against The Wall -
Cage The Elephant ...
0.0301204770803 6
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Life Less Frightening -
Rise Against ...
0.0284431129694 7
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
A Beggar On A Beach Of
Gold - Mike And The ...
0.0230024904013 8
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Audience Of One - Rise
Against ...
0.0193938463926 9
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Blame It On The Boogie -
The Jacksons ...
0.0189873427153 10
[10 rows x 4 columns]
In [15]:
personalized_model.recommend(users=[users[1]])
Out[15]:
user_id song score rank
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Grind With Me (Explicit
Version) - Pretty Ricky ...
0.0459424376488 1
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
There Goes My Baby -
Usher ...
0.0331920742989 2
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Panty Droppa [Intro]
(Album Version) - Trey ...
0.0318566203117 3
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Nobody (Featuring Athena
Cage) (LP Version) - ...
0.0278467655182 4
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Youth Against Fascism -
Sonic Youth ...
0.0262914180756 5
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Nice & Slow - Usher 0.0239639401436 6
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Making Love (Into The
Night) - Usher ...
0.0238176941872 7
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Naked - Marques Houston 0.0228925704956 8
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
I.nner Indulgence -
DESTRUCTION ...
0.0220767498016 9
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Love Lost (Album Version)
- Trey Songz ...
0.0204497694969 10
[10 rows x 4 columns]

该模型还可以进行相似歌曲推荐

In [16]:
personalized_model.get_similar_items(['With Or Without You - U2'])
Out[16]:
song similar score rank
With Or Without You - U2 I Still Haven't Found
What I'm Looking For ...
0.042857170105 1
With Or Without You - U2 Hold Me_ Thrill Me_ Kiss
Me_ Kill Me - U2 ...
0.0337349176407 2
With Or Without You - U2 Window In The Skies - U2 0.0328358411789 3
With Or Without You - U2 Vertigo - U2 0.0300751924515 4
With Or Without You - U2 Sunday Bloody Sunday - U2 0.0271317958832 5
With Or Without You - U2 Bad - U2 0.0251798629761 6
With Or Without You - U2 A Day Without Me - U2 0.0237154364586 7
With Or Without You - U2 Another Time Another
Place - U2 ...
0.0203251838684 8
With Or Without You - U2 Walk On - U2 0.0202020406723 9
With Or Without You - U2 Get On Your Boots - U2 0.0196850299835 10
[10 rows x 4 columns]
In [17]:
personalized_model.get_similar_items(['Chan Chan (Live) - Buena Vista Social Club'])
Out[17]:
song similar score rank
Chan Chan (Live) - Buena
Vista Social Club ...
Murmullo - Buena Vista
Social Club ...
0.188118815422 1
Chan Chan (Live) - Buena
Vista Social Club ...
La Bayamesa - Buena Vista
Social Club ...
0.18719214201 2
Chan Chan (Live) - Buena
Vista Social Club ...
Amor de Loca Juventud -
Buena Vista Social Club ...
0.184834122658 3
Chan Chan (Live) - Buena
Vista Social Club ...
Diferente - Gotan Project 0.0214592218399 4
Chan Chan (Live) - Buena
Vista Social Club ...
Mistica - Orishas 0.0205761194229 5
Chan Chan (Live) - Buena
Vista Social Club ...
Hotel California - Gipsy
Kings ...
0.0193049907684 6
Chan Chan (Live) - Buena
Vista Social Club ...
Nací Orishas - Orishas 0.0191571116447 7
Chan Chan (Live) - Buena
Vista Social Club ...
Le Moulin - Yann Tiersen 0.018796980381 8
Chan Chan (Live) - Buena
Vista Social Club ...
Gitana - Willie Colon 0.018796980381 9
Chan Chan (Live) - Buena
Vista Social Club ...
Criminal - Gotan Project 0.0187793374062 10
[10 rows x 4 columns]

不同推荐模型的定量对比

In [18]:
%matplotlib inline
model_performance = graphlab.recommender.util.compare_models(test_data,
                                                            [popularity_model, personalized_model],
                                                            user_sample=0.05)
# 利用测试集对比不同推荐模型的预测或推荐效果,user_sample 设置进行效果预测的数据的抽样比例
compare_models: using 2931 users to estimate model performance
PROGRESS: Evaluate model M0
recommendations finished on 1000/2931 queries. users per second: 14813.5
recommendations finished on 2000/2931 queries. users per second: 17580.4
Precision and recall summary statistics by cutoff
+--------+-----------------+------------------+
| cutoff |  mean_precision |   mean_recall    |
+--------+-----------------+------------------+
|   1    | 0.0341180484476 | 0.00823292386752 |
|   2    | 0.0317297850563 | 0.0160935708402  |
|   3    | 0.0274081655863 | 0.0204478560031  |
|   4    | 0.0246502900034 | 0.0243574988073  |
|   5    |  0.021494370522 | 0.0265069481676  |
|   6    | 0.0201296485841 | 0.0301672042844  |
|   7    | 0.0190573670615 | 0.0327615421653  |
|   8    | 0.0182105083589 | 0.0354908506521  |
|   9    | 0.0171348421093 | 0.0376075481628  |
|   10   |  0.016308427158 | 0.0397835214749  |
+--------+-----------------+------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1
recommendations finished on 1000/2931 queries. users per second: 13598.8
recommendations finished on 2000/2931 queries. users per second: 17541.9
Precision and recall summary statistics by cutoff
+--------+-----------------+-----------------+
| cutoff |  mean_precision |   mean_recall   |
+--------+-----------------+-----------------+
|   1    |  0.191743432276 | 0.0585482156874 |
|   2    |  0.162060730126 | 0.0951129247189 |
|   3    |  0.141703627886 |  0.120469458609 |
|   4    |  0.125213237803 |  0.139421451307 |
|   5    |  0.113613101331 |  0.155305732877 |
|   6    |  0.103093369726 |  0.167448943357 |
|   7    | 0.0959691962763 |  0.181352449533 |
|   8    | 0.0896451722961 |  0.193948807268 |
|   9    | 0.0848023048637 |  0.205693986157 |
|   10   | 0.0805868304333 |  0.216014609656 |
+--------+-----------------+-----------------+
[10 rows x 3 columns]

绘制不同模型的准确率和召回率曲线

In [19]:
import matplotlib.pyplot as plt
%matplotlib inline

fig, ax = plt.subplots()

pr_curves_by_model = [res['precision_recall_overall'] for res in model_performance]

pr_curve = pr_curves_by_model[0].sort('recall')
ax.plot(list(pr_curve['recall']), list(pr_curve['precision']),
        'blue', label='M1')

pr_curve = pr_curves_by_model[1].sort('recall')
ax.plot(list(pr_curve['recall']), list(pr_curve['precision']),
        'green', label='M2')

ax.set_title('Precision-Recall Averaged Over Users')
ax.set_xlabel('Recall')
ax.set_ylabel('Precision')
ax.legend()

fig.show()
/home/datartisan/.pyenv/versions/2.7.12/envs/datacademy-lesson-27/lib/python2.7/site-packages/matplotlib/figure.py:397: UserWarning: matplotlib is currently using a non-GUI backend, so cannot show the figure
  "matplotlib is currently using a non-GUI backend, "
In [20]:
dir(song_data) # 输出 SFrame 结构的数据 song_data 的所有属性
Out[20]:
['_SFrame__construct_ctr',
 '_SFrame__dropna_errchk',
 '_SFrame__get_graphlabutil_reference_on_spark_unity_jar',
 '__class__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__doc__',
 '__eq__',
 '__format__',
 '__get_column_description__',
 '__get_pretty_tables__',
 '__get_staging_dir__',
 '__getattribute__',
 '__getitem__',
 '__has_size__',
 '__hash__',
 '__init__',
 '__is_materialized__',
 '__iter__',
 '__len__',
 '__materialize__',
 '__module__',
 '__ne__',
 '__new__',
 '__nonzero__',
 '__proxy__',
 '__query_plan_str__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__str_impl__',
 '__subclasshook__',
 '_cache',
 '_group',
 '_imagecols_to_stringcols',
 '_infer_column_types_from_lines',
 '_proxy',
 '_read_csv_impl',
 '_repr_html_',
 '_row_selector',
 '_save_reference',
 'add_column',
 'add_columns',
 'add_row_number',
 'append',
 'apply',
 'column_names',
 'column_types',
 'copy',
 'dropna',
 'dropna_split',
 'dtype',
 'export_csv',
 'export_json',
 'fillna',
 'filter_by',
 'flat_map',
 'from_odbc',
 'from_rdd',
 'from_sql',
 'groupby',
 'head',
 'is_materialized',
 'join',
 'materialize',
 'num_cols',
 'num_columns',
 'num_rows',
 'pack_columns',
 'print_rows',
 'random_split',
 'read_csv',
 'read_csv_with_errors',
 'read_json',
 'remove_column',
 'remove_columns',
 'rename',
 'sample',
 'save',
 'select_column',
 'select_columns',
 'shape',
 'show',
 'sort',
 'split_datetime',
 'stack',
 'swap_columns',
 'tail',
 'to_dataframe',
 'to_numpy',
 'to_odbc',
 'to_rdd',
 'to_spark_dataframe',
 'to_sql',
 'topk',
 'unique',
 'unpack',
 'unstack']

创建特定歌手的用户集合

In [21]:
users_foo = song_data[song_data["artist"]=="Foo Fighters"]
# 以乐队"Foo Fighters"为例,将变量"artist"名称为"Foo Fighters"的数据保存到“users_foo”中
# 即创建"Foo Fighters"的用户集合“users_foo”,下同。
In [22]:
users_foo = users_foo.unique()
In [23]:
users_foo
Out[23]:
artist listen_count song song_id title
Foo Fighters 6 Next Year - Foo Fighters SOYYIZT12A8C1408CA Next Year
Foo Fighters 8 Breakout - Foo Fighters SOMSQJY12A8C138539 Breakout
Foo Fighters 1 Exhausted - Foo Fighters SONMPJJ12AB0183AF8 Exhausted
Foo Fighters 1 Next Year - Foo Fighters SOYYIZT12A8C1408CA Next Year
Foo Fighters 1 The Pretender - Foo
Fighters ...
SOQLUTQ12A8AE48037 The Pretender
Foo Fighters 13 Everlong - Foo Fighters SOXVVSM12A8C142224 Everlong
Foo Fighters 1 Exhausted - Foo Fighters SONMPJJ12AB0183AF8 Exhausted
Foo Fighters 2 Virginia Moon - Foo
Fighters ...
SOKQTHF12B0B80B306 Virginia Moon
Foo Fighters 1 Low - Foo Fighters SOXGFMC12A8C1386EC Low
Foo Fighters 4 Everlong - Foo Fighters SOXVVSM12A8C142224 Everlong
user_id
7e3f6e77217967868c52338eb
dc793b33ba28eb9 ...
6952b87aedaaddb57f99c2207
d0fac06ca1bd86f ...
dfb04c4a166ddd0e53cafccb9
f0540a7744e2346 ...
843209628fc05b104ccd0d841
1cd41faa58a4d6e ...
d6fe20c5b749f74e43595caf3
f9b61b5bcded1ed ...
932ad377a3bb70d0d32969f49
41f0268d560afc9 ...
7e3f6e77217967868c52338eb
dc793b33ba28eb9 ...
d8dde5d48711ad8ae25253b5b
310b6adec906f42 ...
875988a37dd8d37da406d652f
d8063a60a238b93 ...
608f53c1ec24eecf1a12e34d5
4a60f4d8a8b7591 ...
[3429 rows x 6 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
In [24]:
users_kanye = song_data[song_data["artist"]=="Kanye West"]
In [25]:
users_kanye = users_kanye.unique()
In [26]:
users_taylor = song_data[song_data["artist"]=="Taylor Swift"]
In [27]:
users_taylor = users_taylor.unique()
In [28]:
users_gaga = song_data[song_data["artist"]=="Lady GaGa"]
In [29]:
users_gaga = users_gaga.unique()

查看不同歌手用户集合的用户数

In [30]:
len(users_gaga)
Out[30]:
4129
In [31]:
len(users_foo)
Out[31]:
3429
In [32]:
len(users_taylor)
Out[32]:
6227
In [33]:
len(users_kanye)
Out[33]:
3775
In [34]:
song_data[1]
Out[34]:
{'artist': 'Paco De Lucia',
 'listen_count': 2,
 'song': 'Entre Dos Aguas - Paco De Lucia',
 'song_id': 'SOBBMDR12A8C13253B',
 'title': 'Entre Dos Aguas',
 'user_id': 'b80344d063b5ccb3212f76538f3d9e43d87dca9e'}

删除集合中相同ID的用户

In [35]:
users_gaga = users_gaga["user_id"].unique()

删除相同用户ID后,查看用户集合中用户数

In [36]:
len(users_gaga)
Out[36]:
2928
In [37]:
users_kanye = users_kanye["user_id"].unique();
users_foo = users_foo["user_id"].unique();
In [38]:
users_taylor = users_taylor["user_id"].unique();
In [39]:
len(users_kanye)
Out[39]:
2522
In [40]:
len(users_foo)
Out[40]:
2055
In [41]:
len(users_taylor)
Out[41]:
3246

统计不同歌手的收听次数

In [42]:
listen_counts = song_data.groupby(key_columns='artist', operations={'total_count': graphlab.aggregate.SUM('listen_count')})
# 按照'artist'分组,对相应'listen_count'求和,得到收听总数'total_count'
In [43]:
listen_counts.sort('total_count', ascending=False) # 输出前10名
Out[43]:
artist total_count
Kings Of Leon 43218
Dwight Yoakam 40619
Björk 38889
Coldplay 35362
Florence + The Machine 33387
Justin Bieber 29715
Alliance Ethnik 26689
OneRepublic 25754
Train 25402
The Black Keys 22184
[3375 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
In [44]:
listen_counts.sort('total_count', ascending=True) # 输出后10名
Out[44]:
artist total_count
William Tabbert 14
Reel Feelings 24
Beyoncé feat. Bun B and
Slim Thug ...
26
Boggle Karaoke 30
Diplo 30
harvey summers 31
Nâdiya 36
Jody Bernal 38
Aneta Langerova 38
Kanye West / Talib Kweli
/ Q-Tip / Common / ...
38
[3375 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

选取个性化歌曲推荐模型,从测试集中选取10000个用户ID进行歌曲推荐

In [45]:
subset_test_users = test_data['user_id'].unique()[0:10000]
In [46]:
subset_recommendations = personalized_model.recommend(subset_test_users,k=1)
recommendations finished on 1000/10000 queries. users per second: 13895.8
recommendations finished on 2000/10000 queries. users per second: 20416.5
recommendations finished on 3000/10000 queries. users per second: 24293.7
recommendations finished on 4000/10000 queries. users per second: 26516.8
recommendations finished on 5000/10000 queries. users per second: 27759.7
recommendations finished on 6000/10000 queries. users per second: 28610.5
recommendations finished on 7000/10000 queries. users per second: 29651.7
recommendations finished on 8000/10000 queries. users per second: 30588.4
recommendations finished on 9000/10000 queries. users per second: 30825.3
recommendations finished on 10000/10000 queries. users per second: 27610
In [47]:
subset_recommendations.head()
Out[47]:
user_id song score rank
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Grind With Me (Explicit
Version) - Pretty Ricky ...
0.0459424376488 1
696787172dd3f5169dc94deef
97e427cee86147d ...
Senza Una Donna (Without
A Woman) - Zucchero / ...
0.017026577677 1
532e98155cbfd1e1a474a28ed
96e59e50f7c5baf ...
Jive Talkin' (Album
Version) - Bee Gees ...
0.0118288653237 1
18325842a941bc58449ee71d6
59a08d1c1bd2383 ...
Goodnight And Goodbye -
Jonas Brothers ...
0.0159257985651 1
507433946f534f5d25ad1be30
2edb9a2376f503c ...
Find The Cost Of Freedom
- Crosby_ Stills_ Nash & ...
0.0165806589303 1
18fafad477f9d72ff86f7d0bd
838a6573de0f64a ...
Rabbit Heart (Raise It
Up) - Florence + The ...
0.0799399726093 1
fe85b96ba1983219b296f6b48
69dd29eb2b72ff9 ...
Secrets - OneRepublic 0.0788827141126 1
225ea420b4bede50919d1bfe2
4a599691522d176 ...
Clocks - Coldplay 0.0271030251796 1
95dc7e2b188b1148b2d25f4e6
b6e94afacc4efc3 ...
Bust a Move - Infected
Mushroom ...
0.0534738540649 1
4a3a1ae2748f12f7ab921a47d
6d79abf82e3e325 ...
Isis (Spam Remix) -
Alaska Y Dinarama ...
0.04180302118 1
[10 rows x 4 columns]

计算不同歌曲被推荐的次数,并排序

In [48]:
most_recommended = subset_recommendations.groupby(key_columns='song', operations={'total_count': graphlab.aggregate.COUNT()})
In [49]:
most_recommended.sort('total_count', ascending=False) # 输出前10名
Out[49]:
song total_count
Secrets - OneRepublic 392
Undo - Björk 379
Revelry - Kings Of Leon 260
Horn Concerto No. 4 in E
flat K495: II. Romance ...
139
Sehr kosmisch - Harmonia 124
Fireflies - Charttraxx
Karaoke ...
114
Hey_ Soul Sister - Train 108
You're The One - Dwight
Yoakam ...
75
OMG - Usher featuring
will.i.am ...
55
Clocks - Coldplay 42
[3154 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
In [50]:
most_recommended.sort('total_count', ascending=True) # 输出后10名
Out[50]:
song total_count
Arco Arena - Cake 1
Say Goodbye (Album
Version) - Skillet ...
1
Back Against The Wall -
Cage The Elephant ...
1
Leave The Bourbon On The
Shelf - The Killers ...
1
Nomenclature - Andrew
Bird ...
1
Wish You Were Here -
Incubus ...
1
Change - Blind Melon 1
Get:On - Moguai 1
Big Brother - Kanye West 1
Perfectly Lonely - John
Mayer ...
1
[3154 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值