[转载] 基于内容的推荐系统(含python代码)-简练

http://www.ryanzhang.info/archives/2594
基于内容的推荐系统的核心思想是:推荐给用户 x 那些与 x 给出高评价的物品近似的物品。

具体方法为:

为物品简历“档案” item profiles
根据用户对物品的打分建立用户“档案” user profiles
推荐时,根据用户档案与物品档案之间的相似程度进行推荐
用之前的文档做例子,TF-IDF矩阵可以视为一个item profiles,


from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.preprocessing import normalize
from scipy.sparse import csr_matrix

items = ['this is the first document, this really is',
'nothing will stop this from been the second doument, second is not a bad order',
'I wonder if three documents would be ok as an example, example like this is stupid',
'ok I think four documents is enough, I I I I think so.']

# will simply using tfidf as the item - profile
# row = item(documents) column = feature(term)
vectorizer = CountVectorizer(min_df=1)
counts = vectorizer.fit_transform(items)
# column = item(documents) row = feature(term)
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(counts).transpose()

print shape(tfidf)

(32,4)


例子中的4个文档是四个待推荐给用户的物品,分别对应矩阵的4列,矩阵的每一行代表一个词语,矩阵的元素是TF-IDF值,代表的是各个词语在文档中占得重要程度。

如果推荐的物品是电影,那么可以每一列代表一个电影,每一行代表一个演员或者一个导演,或者电影的一种风格,元素值为1或0,代表电影中是否有该演员出演。同样的也需要对该二元矩阵进行标准化处理,例如:每一行的元素都除以该行的总和。

下面来了4个用户,分别对这四个物品中的一个至多个进行打分,分值在0~5范围内,每一列代表一个用户,每一行代表一篇文档,用我们上面item profile的矩阵与其做点积,获得的矩阵便是用户档案。可以理解为各个用户对文档的各个成分的喜好程度,如果是电影推荐的例子,则可以理解为是用户对各个演员、导演或者风格的喜好。


# user's rating for this four documents
# cloumn = user row = items(documents)
ratings = csr_matrix([[5, 0, 0, 2], [3, 3, 0, 0], [2, 1, 1, 1], [0, 0, 1, 1]], dtype = u'double')
# normalize
usersn = ratings - ratings.mean(0)

userprofile = tfidf.dot(usersn)

userprofile = csr_matrix(userprofile)


推荐时,我们需要一个距离的测量,来计算用户档案 i 与物品档案 x 之间的相似程度,例如:
U(X,I)=cos(a)=X·I/normal(X)/normal(I)

# smaller score suggest more similarity between user and item
for i in range(4):
# iterative over 4 users
scores = []
u = userprofile[:,i]
for j in range(4):
# iterative over 4 documents
v = tfidf[:,j]
scores.append(sum(u.transpose().dot(v).todense())/np.linalg.norm(u.todense())/np.linalg.norm(v.todense()))
print "document recommended for user {0} is document number {1}".format(i+1, scores.index(max(scores))+1)

document recommended for user 1 is document number 1
document recommended for user 2 is document number 2
document recommended for user 3 is document number 4
document recommended for user 4 is document number 1

另外,根据item profile我们可以计算物品对之间的相似度(例如同上面一样的使用cosine距离),因而,可以找出用户打分很高的物品,然后找到与该物品相似度较高的物品作为推荐。
例如有三部电影,以及一位用户对3部电影的打分,计算两两电影之间的近似程度。矩阵的每一行代表一部电影,前5列代表电影的特征(演员、导演一类),最后一列代表用户A的评分。计算相似度时我们将用户的打分也算入其中,并且乘以一个系数a:

avals = [0,0.5,1,2]
for aval in avals:
df =DataFrame([[1,0,1,0,1,2],[1,1,0,0,1,6],[0,1,0,1,0,2]],
index = ['A', 'B', 'C'])
df[5] = df[5]*aval
a = df.loc['A']
b = df.loc['B']
c = df.loc['C']
ab = dot(a,b)/norm(a)/norm(b)
ac = dot(a,c)/norm(a)/norm(c)
bc = dot(b,c)/norm(c)/norm(b)
print 'alpha is {0}'.format(aval)
print 'AB {} AC {} BC {}'.format(arccos(clip(ab, -1, 1)),arccos(clip(ac, -1, 1)),arccos(clip(bc, -1, 1)))


alpha is 0
Distance between AB is0.841068670568
Distance between AC is1.57079632679
Distance between BC is1.15026199151

alpha is 0.5
Distance between AB is0.764558797186
Distance between AC is1.27795355507
Distance between BC is0.841068670568

alpha is 1
Distance between AB is0.559880567744
Distance between AC is0.905600271782
Distance between BC is0.555121167557

alpha is 2
Distance between AB is0.329838880928
Distance between AC is0.525285295294
Distance between BC is0.309193320943
  • 4
    点赞
  • 24
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值