recosystem: Recommender System Using Parallel Matrix Factorization

A Quick View of Recommender System

The main task of recommender system is to predict unknown entries in the rating matrix based on observed values, as is shown in the table below:

Rating matrix

Each cell with number in it is the rating given by some user on a specific item, while those marked with question marks are unknown ratings that need to be predicted. In some other literatures, this problem may be named collaborative filtering, matrix completion, matrix recovery, etc.

A popular technique to solve the recommender system problem is the matrix factorization method. The idea is to approximate the whole rating matrix  Rm×n Rm×n by the product of two matrices of lower dimensions,  Pk×m Pk×m and  Qk×n Qk×n, such that

RPQ R≈P′Q

Let  pu pu be the  u u-th column of  P P, and  qv qv be the  v v-th column of  Q Q, then the rating given by user  u uon item  v v would be predicted as  puqv pu′qv.

A typical solution for  P P and  Q Q is given by the following optimization problem [12]:

minP,Q(u,v)R[f(pu,qv;ru,v)+μP||pu||1+μQ||qv||1+λP2||pu||22+λQ2||qv||22] minP,Q∑(u,v)∈R[f(pu,qv;ru,v)+μP||pu||1+μQ||qv||1+λP2||pu||22+λQ2||qv||22]

where  (u,v) (u,v) are locations of observed entries in  R R ru,v ru,v is the observed rating,  f f is the loss function, and  μP,μQ,λP,λQ μP,μQ,λP,λQ are penalty parameters to avoid overfitting.

The process of solving the matrices  P P and  Q Q is called model training, and the selection of penalty parameters is parameter tuning. After obtaining  P P and  Q Q, we can then do the prediction of  R̂ u,v=puqv R^u,v=pu′qv.

LIBMF and recosystem

LIBMF is an open source C++ library for recommender system using parallel matrix factorization, developed by Dr. Chih-Jen Lin and his research group. [3]

LIBMF is a parallelized library, meaning that users can take advantage of multi-core CPUs to speed up the computation. It also utilizes some advanced CPU features to further improve the performance.

recosystem (Github) is an R wrapper of the LIBMF library that inherits most of its features. Additionally, this package provides a number of user-friendly R functions to simplify data processing and model building. Also, unlike most other R packages for statistical modeling that store the whole dataset and model object in memory, LIBMF (and hence recosystem) can significantly reduce memory use, for instance the constructed model that contains information for prediction can be stored in the hard disk, and output result can also be directly written into a file rather than be kept in memory.

Overview of recosystem

The usage of recosystem is quite simple, mainly consisting of the following steps:

  1. Create a model object (a Reference Class object in R) by calling Reco().
  2. Specify the data source, either from a data file or from R objects in memory.
  3. Train the model by calling the $train() method. A number of parameters can be set inside the function.
  4. (Optionally) Call the $tune() method to select best tuning parameters along a set of candidate values, in order to achieve better model performance.
  5. (Optionally) Export the model via $output(), i.e. write the factorization matrices  P P and  Q Qinto files or return them as R objects.
  6. Use the $predict() method to compute predicted values.

More details are covered in the package vignette and the help pages ?recosystem::Reco?recosystem::data_source?recosystem::train?recosystem::tune?recosystem::output, and ?recosystem::predict.

In the next section we will demonstrate how to use recosystem to analyze a real movie recommendation data set.

MovieLens Data

The MovieLens website collected many movie rating data for research use. [4] In this article we download the MovieLens 1M Dataset from grouplens, which contains 1 million ratings from 6000 users and 4000 movies.

The rating data file, ratings.dat, looks like below:

1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291
...

Each line has the format UserID::MovieID::Rating::Timestamp, for example the first line says that User #1 gave Movie #1193 a rating of 5 at certain time point.

In recosystem, we will not use the time information, and the required data format is UserID MovieID Rating, i.e., the columns are space-separated, and columns after Rating will be ignored. Therefore, we first transform the data file into a format that is supported byrecosystem. On Unix-like OS’s, we can use the sed command to replace :: by a space:

sed -e 's/::/ /g' ratings.dat > ratings2.dat

Then we can start to train a recommender, as the following code shows:

library(recosystem)                                             # 1
r = Reco()                                                      # 2
train_set = data_file("ratings2.dat",  index1 = TRUE)           # 3
r$train(train_set, opts = list(dim = 20,                        # 4
                               costp_l1 = 0, costp_l2 = 0.01,   # 5
                               costq_l1 = 0, costq_l2 = 0.01,   # 6
                               niter = 10,                      # 7
                               nthread = 4))                    # 8

In the code above, line 2 creates a model object such that the training function $train() can be called from it. Line 3 specifies the data source – a data file on hard disk. Since in our data user ID and movie ID start from 1 rather than 0, we use the index1 = TRUE options in the function.

The data can also be read from memory, if the UserID, MovieID and Rating columns are stored as R vectors. Below shows an alternative way to provide the training set.

dat = read.table("ratings2.dat", sep = " ", header = FALSE,
                 colClasses = c(rep("integer", 3), "NULL"))
train_set = data_memory(user_index = dat[, 1],
                        item_index = dat[, 2],
                        rating = dat[, 3], index1 = TRUE)

Line 4 to line 6 set the relevant model parameters:  k,μP,μQ,λP k,μP,μQ,λP, and  λQ λQ, and Line 7 gives the number of iterations. Finally as I have mentioned previously, LIBMF is a parallelized library, so users can specify the number of threads that will be working simultaneously via the nthreadparameter. However, when nthread > 1, the training result is NOT guaranteed to be reproducible, even if a random seed is set.

Now everything looks good, except one inadequacy: the setting of tuning parameters is ad-hoc, which may make the model sub-optimal. To tune these parameters, we can call the $tune()function to test a set of candidate values and use cross validation to evaluate their performance. Below shows this process:

opts_tune = r$tune(train_set,                                   # 9
                   opts = list(dim      = c(10, 20, 30),        # 10
                               costp_l2 = c(0.01, 0.1),         # 11
                               costq_l2 = c(0.01, 0.1),         # 12
                               costp_l1 = 0,                    # 13
                               costq_l1 = 0,                    # 14
                               lrate    = c(0.01, 0.1),         # 15
                               nthread  = 4,                    # 16
                               niter    = 10,                   # 17
                               verbose  = TRUE))                # 18
r$train(train_set, opts = c(opts_tune$min,                      # 19
                            niter = 100, nthread = 4))          # 20

The options in line 9 to line 15 are tuning parameters. The tuning function will evaluate each combination of them and calculate the associated cross-validated RMSE. The parameter set with the smallest RMSE will be contained in the returned value, which can then be passed to$train() (Line 19-20).

Finally, we can use the model object to do predictions. The code below shows how to predict ratings given by the first 20 users on the first 20 movies.

user = 1:20
movie = 1:20
pred = expand.grid(user = user, movie = movie)
test_set = data_memory(pred$user, pred$movie, index1 = TRUE)
pred$rating = r$predict(test_set, out_memory())

library(ggplot2)
ggplot(pred, aes(x = movie, y = user, fill = rating)) +
    geom_raster() +
    scale_fill_gradient("Rating", low = "#d6e685", high = "#1e6823") +
    xlab("Movie ID") + ylab("User ID") +
    coord_fixed() +
    theme_bw(base_size = 22)
Predicted ratings

Performance

To make the best use of recosystem, the parallel computing option nthread should be used in the training and tuning step. Also, LIBMF and recosystem can make use of some advanced CPU features to speed-up computation, if you compile the package from source and turn on some compiler options.

To build recosystem, one needs a C++ compiler that supports the C++11 standard. Then you can edit src/Makevars (src/Makevars.win for Windows system) according to the following guideline:

  • The default Makevars provides generic options that should apply to most CPUs.
  • If your CPU supports SSE3 (a list of supported CPUs), add
PKG_CPPFLAGS += -DUSESSE
PKG_CXXFLAGS += -msse3
PKG_CPPFLAGS += -DUSEAVX
PKG_CXXFLAGS += -mavx

After editing the Makevars file, run R CMD INSTALL recosystem to install recosystem.

The plot below shows the effect of parallel computing and the compiler option on the performance of computation. The y axis is the elapsed time of the model tuning procedure in the previous example.

Performance

References

[1] Chin, Wei-Sheng, Yong Zhuang, Yu-Chin Juan, and Chih-Jen Lin. 2015a. A Fast Parallel Stochastic Gradient Method for Matrix Factorization in Shared Memory Systems. ACM TIST.

[2] Chin, Wei-Sheng, Yong Zhuang, Yu-Chin Juan, and Chih-Jen Lin. 2015b. A Learning-Rate Schedule for Stochastic Gradient Methods to Matrix Factorization. ACM TIST.

[3] Lin, Chih-Jen, Yu-Chin Juan, Yong Zhuang, and Wei-Sheng Chin. 2015. LIBMF: A Matrix-Factorization Library for Recommender Systems.

[4] F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
毕业设计,基于SpringBoot+Vue+MySQL开发的公寓报修管理系统,源码+数据库+毕业论文+视频演示 现代经济快节奏发展以及不断完善升级的信息化技术,让传统数据信息的管理升级为软件存储,归纳,集中处理数据信息的管理方式。本公寓报修管理系统就是在这样的大环境下诞生,其可以帮助管理者在短时间内处理完毕庞大的数据信息,使用这种软件工具可以帮助管理人员提高事务处理效率,达到事半功倍的效果。此公寓报修管理系统利用当下成熟完善的Spring Boot框架,使用跨平台的可开发大型商业网站的Java语言,以及最受欢迎的RDBMS应用软件之一的MySQL数据库进行程序开发。公寓报修管理系统有管理员,住户,维修人员。管理员可以管理住户信息和维修人员信息,可以审核维修人员的请假信息,住户可以申请维修,可以对维修结果评价,维修人员负责住户提交的维修信息,也可以请假。公寓报修管理系统的开发根据操作人员需要设计的界面简洁美观,在功能模块布局上跟同类型网站保持一致,程序在实现基本要求功能时,也为数据信息面临的安全问题提供了一些实用的解决方案。可以说该程序在帮助管理者高效率地处理工作事务的同时,也实现了数据信息的整体化,规范化与自动化。 关键词:公寓报修管理系统;Spring Boot框架;MySQL;自动化;VUE
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值