Recommendation System

Recommendation System

Dataset

In this project, I use Amazon Review Data 2018 dataset.

For now(2022-03-31), I only download the dataset Office_Products.csv and try to deal it with NMF. So all the ideas below are base on that.

Office_Products.csv (ratings only)

there are 1,048,575 rows.

Including 11210 products and 799315 users.

The sparsity is 0.00011702426536352439.

The format is |item|user|rating|timestamp|

Sparse Matrix

In order to load such large matrix in the memory, I need to use sparse matrix to save the matrix.
from scipy.sparse import csc_matrix

For this dataset, CSC Sparse Matrix is very suitable. The original rows in the csv can be mapped to the CSC matrix space directly.

Know more about Sparse matrix

Use NMF

I use scikit-learn (version 1.0.2) from sklearn.decomposition import NMF

and specific parameters model = NMF(n_components=2, init='random', random_state=0, verbose=True)

Because of n_components=2, when I output model.reconstruction_err_, the number is 4556.902650495943. The whole process took 119.2437425 seconds.

I tried to change the parameters:

n_componentsbeta-divergencetime consuming(secs)
24556119
20??

About the Result

Get the transformed data by

model = NMF(n_components=20, init='random', random_state=0, verbose=True)
W = model.fit_transform(csc)

And factorization matrix

H = model.components_

To show clearly, I use a simple sample to implement this process.

This simple matrix is just like:

    uA  uB  uC  uD  uE  uF  uG  uH  uI  uJ  uK  uL  uM  uN  uO
iA   5   5   3   0   5   5   4   3   2   1   4   1   3   4   5
iB   5   0   4   0   4   4   3   2   1   2   4   4   3   4   0
iC   0   3   0   5   4   5   0   4   4   5   3   0   0   0   0
iD   5   4   3   3   5   5   0   1   1   3   4   5   0   2   4
iE   5   4   3   3   5   5   3   3   3   4   5   0   5   2   4
iF   5   4   2   2   0   5   3   3   3   4   4   4   5   2   5
iG   5   4   3   3   2   0   0   0   0   0   0   0   2   1   0
iH   5   4   3   3   2   0   0   0   0   0   0   0   1   0   1
iI   5   4   3   3   1   0   0   0   0   0   0   0   0   2   2
iJ   5   4   3   3   1   0   0   0   0   0   0   0   0   1   1

The index is item A to J, and the column is user A to O, and the value is the user’s rating of the item.

Let’s see what happen in the W and H.

W

[[0.81240799 0.71153396 0.47062388 0.43807017 1.39456425 2.24323719
  1.02417204 1.25356481 1.10517661 1.47624595 1.84626347 0.97437242
  1.14921406 0.8159644  1.14200028]
 [2.23910382 1.70186882 1.34300272 1.09192602 0.68045441 0.
  0.0542231  0.         0.         0.         0.04426552 0.12260418
  0.34109613 0.51642843 0.6157604 ]]

H

[[2.20401687 1.53852775]
 [1.9083879  0.83214869]
 [1.95596132 0.        ]
 [1.87637018 1.65573674]
 [2.48959328 1.41632377]
 [2.38108536 1.08460665]
 [0.         2.29342959]
 [0.         2.27353353]
 [0.         2.32513876]
 [0.         2.23196277]]

Graphic (generated by H)
the distribution of items

It is obvious that the items are divided into two piles.

This is because we set n_components to 2 at the beginning

Get Fast

  1. zero-masked not try yet…

Use Timestamp?

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值