今天将和大家一起学习如何仅使用奇异值分解来构建推荐系统。如果你对奇异值分解不是很熟悉,推荐阅读戳👉 这次终于彻底理解了奇异值分解(SVD)原理及应用
奇异值分解是一种非常流行的线性代数技术,用于将矩阵分解为几个较小矩阵的乘积。该技术用途广泛。可以使用 SVD 来挖掘项目之间的关系,由此构建推荐系统。
本文主要介绍
如何对矩阵进行奇异值分解
如何解释奇异值分解的结果
单个推荐系统需要哪些数据,以及如何利用 SVD 对其进行分析
如何利用 SVD 的结果提出建议
奇异值分解简介
一个整数24可以分解为 24=2×3×4 的因数,矩阵也可以表示为其他一些矩阵的乘积。因为矩阵是数字数组,所以它们有自己的乘法规则,因此有不同的分解方式,或称为分解。一般有 QR 分解或 LU 分解。另一种是奇异值分解,它对要分解的矩阵的形状或性质没有限制。
假设一个矩阵 (如m×n矩阵)被分解为
是一个 矩阵, 是一个对角矩阵 , 和 是一个 矩阵。对角矩阵 可以是非正方形的,但只有对角线上的条目可能是非零的。矩阵 和 是正交矩阵。表示的列 和 均是单位向量且彼此正交并。如果任意两个向量的点积为零,那么它们就是正交的。如果一个向量的l2范数是1,那么它就是单位向量。正交矩阵的性质是它的转置就是它的逆。换句话说,由于 是一个正交矩阵, 或者 , 是单位矩阵。
奇异值分解得名于对角矩阵 ,称为矩阵 的奇异值。它们实际上是矩阵 特征值的平方根。类比于分解为素数的数字,矩阵的奇异值分解揭示了该矩阵的结构。
实际上上面描述的被称为full SVD。还有另一种称为reduced SVD 或compact SVD 的版本。同样,奇异值分解公式 ,但此时 一种 方对角矩阵, 是矩阵的 的秩,通常小于或等于 和 。矩阵 是 矩阵, 是一个 矩阵。因为矩阵 和 是非正方形的,它们被称为半正交, 和 , 这两种情况中 均为 r×r单位矩阵。
奇异值分解在推荐系统中的意义
如果矩阵 的秩是 ,那么可以证明矩阵 和矩阵 的秩均为 。在奇异值分解(简化 SVD)中,矩阵 的列是矩阵 的特征向量,矩阵 的行是矩阵 的特征向量。有趣的是矩阵 和矩阵 可能有不同的形状大小(因为矩阵 可以是非正方形),但它们具有相同的特征值集,即对角矩阵 对角线上的值的平方。
这就是为什么奇异值分解的结果可以揭示很多关于矩阵 的信息。
假设我们收集了一些书评,比如书是列,人是行,条目是一个人对一本书的评分。在这种情况下, 将是一个人对人的表格,其中的条目即为一个人给出的评分与匹配的另一个人给出的评分的总和。相似地 将是一个书到书的表格,其中条目是收到的评分与相匹配的另一本书收到的评分总和。人与书之间隐藏的联系是什么?那可能是类型,作者,或类似性质的东西。
构建推荐系统
数据集
接下来看看如何利用 SVD 的结果来构建推荐系统。首先从这个链接下载数据集(注意:它是 600MB 大)
该数据集是“推荐系统和个性化数据集[1]”中的“社交推荐数据[2]”。它包含用户对Librarything[3]书籍的评论。我们是对用户给一本书的“starts”数感兴趣。
如果解压这个 tar
文件,会看到一个名为“reviews.json”
的大文件。可以提取它或者即时读取包含的文件。
import tarfile
# 公众号:机器学习研习院 后台回复 lthing_data 获取
with tarfile.open("lthing_data.tar.gz") as tar:
print("Files in tar archive:")
tar.list()
with tar.extractfile("lthing_data/reviews.json") as file:
count = 0
for line in file:
print(line)
count += 1
if count > 3:
break
以上将打印:
Files in tar archive:
?rwxr-xr-x julian/julian 0 2016-09-30 17:58:55 lthing_data/
?rw-r--r-- julian/julian 4824989 2014-01-02 13:55:12 lthing_data/edges.txt
?rw-rw-r-- julian/julian 1604368260 2016-09-30 17:58:25 lthing_data/reviews.json
b"{'work': '3206242', 'flags': [], 'unixtime': 1194393600, 'stars': 5.0, 'nhelpful': 0, 'time': 'Nov 7, 2007', 'comment': 'This a great book for young readers to be introduced to the world of Middle Earth. ', 'user': 'van_stef'}\n"
b"{'work': '12198649', 'flags': [], 'unixtime': 1333756800, 'stars': 5.0, 'nhelpful': 0, 'time': 'Apr 7, 2012', 'comment': 'Help Wanted: Tales of On The Job Terror from Evil Jester Press is a fun and scary read. This book is edited by Peter Giglio and has short stories by Joe McKinney, Gary Brandner, Henry Snider and many more. As if work wasnt already scary enough, this book gives you more reasons to be scared. Help Wanted is an excellent anthology that includes some great stories by some master storytellers.\\nOne of the stories includes Agnes: A Love Story by David C. Hayes, which tells the tale of a lawyer named Jack who feels unappreciated at work and by his wife so he starts a relationship with a photocopier. They get along well until the photocopier starts wanting the lawyer to kill for it. The thing I liked about this story was how the author makes you feel sorry for Jack. His two co-workers are happily married and love their jobs while Jack is married to a paranoid alcoholic and he hates and works at a job he cant stand. You completely understand how he can fall in love with a copier because he is a lonely soul that no one understands except the copier of course.\\nAnother story in Help Wanted is Work Life Balance by Jeff Strand. In this story a man works for a company that starts to let their employees do what they want at work. It starts with letting them come to work a little later than usual, then the employees are allowed to hug and kiss on the job. Things get really out of hand though when the company starts letting employees carry knives and stab each other, as long as it doesnt interfere with their job. This story is meant to be more funny then scary but still has its scary moments. Jeff Strand does a great job mixing humor and horror in this story.\\nAnother good story in Help Wanted: On The Job Terror is The Chapel Of Unrest by Stephen Volk. This is a gothic horror story that takes place in the 1800s and has to deal with an undertaker who has the duty of capturing and embalming a ghoul who has been eating dead bodies in a graveyard. Stephen Volk through his use of imagery in describing the graveyard, the chapel and the clothes of the time, transports you into an 1800s gothic setting that reminded me of Bram Stokers Dracula.\\nOne more story in this anthology that I have to mention is Expulsion by Eric Shapiro which tells the tale of a mad man going into a office to kill his fellow employees. This is a very short but very powerful story that gets you into the mind of a disgruntled employee but manages to end on a positive note. Though there were stories I didnt like in Help Wanted, all in all its a very good anthology. I highly recommend this book ', 'user': 'dwatson2'}\n"
b"{'work': '12533765', 'flags': [], 'unixtime': 1352937600, 'nhelpful': 0, 'time': 'Nov 15, 2012', 'comment': 'Magoon, K. (2012). Fire in the streets. New York: Simon and Schuster/Aladdin. 336 pp. ISBN: 978-1-4424-2230-8. (Hardcover); $16.99.\\nKekla Magoon is an author to watch (http://www.spicyreads.org/Author_Videos.html- scroll down). One of my favorite books from 2007 is Magoons The Rock and the River. At the time, I mentioned in reviews that we have very few books that even mention the Black Panther Party, let alone deal with them in a careful, thorough way. Fire in the Streets continues the story Magoon began in her debut book. While her familys financial fortunes drip away, not helped by her mothers drinking and assortment of boyfriends, the Panthers provide a very real respite for Maxie. Sam is still dealing with the death of his brother. Maxies relationship with Sam only serves to confuse and upset them both. Her friends, Emmalee and Patrice, are slowly drifting away. The Panther Party is the only thing that seems to make sense and she basks in its routine and consistency. She longs to become a full member of the Panthers and constantly battles with her Panther brother Raheem over her maturity and ability to do more than office tasks. Maxie wants to have her own gun. When Maxie discovers that there is someone working with the Panthers that is leaking information to the government about Panther activity, Maxie investigates. Someone is attempting to destroy the only place that offers her shelter. Maxie is determined to discover the identity of the traitor, thinking that this will prove her worth to the organization. However, the truth is not simple and it is filled with pain. Unfortunately we still do not have many teen books that deal substantially with the Democratic National Convention in Chicago, the Black Panther Party, and the social problems in Chicago that lead to the civil unrest. Thankfully, Fire in the Streets lives up to the standard Magoon set with The Rock and the River. Readers will feel like they have stepped back in time. Magoons factual tidbits add journalistic realism to the story and only improves the atmosphere. Maxie has spunk. Readers will empathize with her Atlas-task of trying to hold onto her world. Fire in the Streets belongs in all middle school and high school libraries. While readers are able to read this story independently of The Rock and the River, I strongly urge readers to read both and in order. Magoons recognition by the Coretta Scott King committee and the NAACP Image awards are NOT mistakes!', 'user': 'edspicer'}\n"
b'{\'work\': \'12981302\', \'flags\': [], \'unixtime\': 1364515200, \'stars\': 4.0, \'nhelpful\': 0, \'time\': \'Mar 29, 2013\', \'comment\': "Well, I definitely liked this book better than the last in the series. There was less fighting and more story. I liked both Toni and Ricky Lee and thought they were pretty good together. The banter between the two was sweet and often times funny. I enjoyed seeing some of the past characters and of course it\'s always nice to be introduced to new ones. I just wonder how many more of these books there will be. At least two hopefully, one each for Rory and Reece. ", \'user\': \'amdrane2\'}\n'
解压数据集
reviews.json
中的每一行都是一条记录。我们将提取每条记录的“user”
、“work”
和“stars”
字段,只要这三个字段中没有缺失数据。尽管有名称,单该数据集不是严格遵循 JSON 字符串格式的,尤其是它使用单引号而不是双引号。因此这里并不能使用Python 中的json
包,而是用ast
来解码这样的字符串。
import ast
reviews = []
with tarfile.open("lthing_data.tar.gz") as tar:
with tar.extractfile("lthing_data/reviews.json") as file:
for line in file:
record = ast.literal_eval(line.decode("utf8"))
if any(x not in record for x in ['user', 'work', 'stars']):
continue
reviews.append([record['user'], record['work'], record['stars']])
print(len(reviews), "records retrieved")
1387209 records retrieved
构建数据框
现在创建一个矩阵,存储不同的用户如何评价每本书。利用pandas库将数据矩阵转换成表格:
import pandas as pd
reviews = pd.DataFrame(reviews, columns=["user", "work", "stars"])
print(reviews.head())
user work stars
0 van_stef 3206242 5.0
1 dwatson2 12198649 5.0
2 amdrane2 12981302 4.0
3 Lila_Gustavus 5231009 3.0
4 skinglist 184318 2.0
数据筛选
这里,小猴子为了节省时间和内存,没有使用所有数据。只考虑那些评论超过 50 本书的用户以及那些被超过 50 位用户评论的图书。这样可以数据集裁剪到其原始大小的 15% 以下:
查找评论超过50本书的用户
usercount = reviews[["work","user"]].groupby("user").count()
usercount = usercount[usercount["work"] >= 50]
print(usercount.head())
work
user
84
-Eva- 602
06nwingert 370
1983mk 63
1dragones 194
查找被超过50个用户评论过的书
workcount = reviews[["work","user"]].groupby("work").count()
workcount = workcount[workcount["user"] >= 50]
print(workcount.head())
user
work
10000 106
10001 53
1000167 186
10001797 53
10005525 134
只保留流行的书籍和活跃的用户
reviews = reviews[reviews["user"].isin(usercount.index) & reviews["work"].isin(workcount.index)]
print(reviews)
user work stars
0 van_stef 3206242 5.0
6 justine 3067 4.5
18 stephmo 1594925 4.0
19 Eyejaybee 2849559 5.0
35 LisaMaria_C 452949 4.5
... ... ... ...
1387161 connie53 1653 4.0
1387177 BruderBane 24623 4.5
1387192 StuartAston 8282225 4.0
1387202 danielx 9759186 4.0
1387206 jclark88 8253945 3.0
[205110 rows x 3 columns]
数据转换
然后利用 pandas 中的"数据透视表"功能将其转换为矩阵:
reviewmatrix = reviews.pivot(index="user", columns="work", values="stars").fillna(0)
结果是一个 5593 行 2898 列的矩阵
应用奇异值分解
在一个矩阵中表示 5593 个用户和 2898 本书。然后应用 SVD(这需要一段时间):
from numpy.linalg import svd
matrix = reviewmatrix.values
u, s, vh = svd(matrix, full_matrices=False)
默认情况下,svd()
返回一个完整的奇异值分解。选择一个简化的版本,可以使用更小的矩阵来节省内存。列vh
对应于书籍,可以基于向量空间模型来找出哪本书与正在看的那本书最相似:
import numpy as np
def cosine_similarity(v,u):
return (v @ u)/ (np.linalg.norm(v) * np.linalg.norm(u))
highest_similarity = -np.inf
highest_sim_col = -1
for col in range(1,vh.shape[1]):
similarity = cosine_similarity(vh[:,0], vh[:,col])
if similarity > highest_similarity:
highest_similarity = similarity
highest_sim_col = col
print("Column %d is most similar to column 0" % highest_sim_col)
Column 906 is most similar to column 0
尝试找到与第一列最匹配的书,结果是906行。
在推荐系统中,当用户选择一本书时,可能会根据上面计算的余弦距离,并向她展示与她选择的那本书最相似的其他几本书。
取决于数据集,我们可以使用截断的 SVD 来降低矩阵的维数vh
。本质上,在使用它来计算相似度之前,在列vh
上删除了几行s
中对应的奇异值很小的行。这可能会使预测更加准确,因为一本书的那些不太重要的特征被排除在考虑之外。
注意,在分解 中, 的行是用户和 的列是书,我们不能确定 的列或 的行是什么意思。例如,我们知道它们可能是在用户和书籍之间提供一些潜在联系的类型,而我们无法确定它们到底是什么。但这并不防碍将它们用作推荐系统中的特征。
参考资料
[1]
推荐系统和个性化数据集: https://gitee.com/yunduodatastudio/picture/raw/master/data.png
[2]社交推荐数据: https://gitee.com/yunduodatastudio/picture/raw/master/data.png
[3]Librarything: https://www.librarything.com/
往期精彩回顾
适合初学者入门人工智能的路线及资料下载(图文+视频)机器学习入门系列下载中国大学慕课《机器学习》(黄海广主讲)机器学习及深度学习笔记等资料打印《统计学习方法》的代码复现专辑
AI基础下载机器学习交流qq群955171419,加入微信群请扫码: