mahout算法源码分析之Collaborative Filtering with ALS-WR （一）实战

最新推荐文章于 2023-07-27 22:19:27 发布

iteye_12675

最新推荐文章于 2023-07-27 22:19:27 发布

阅读量324

点赞数

文章标签：大数据数据结构与算法 java

Mahout版本：0.7，hadoop版本：1.0.4，jdk：1.7.0_25 64bit。

学习总是一个痛并快乐着的过程。。。

今天简要介绍一下mahout中的Collaborative Filtering with ALS-WR，这个算法，你要问我这个是什么算法，我最多告诉你它是一个推荐算法，其他我也不知道。这里主要是参考这里的介绍Collaborative Filtering with ALS-WR。

此篇作为实战，就是要先把算法跑起来，先不管具体实现过程，通过现象，看到什么，然后才分析具体实现过程。看到官网的介绍上面说其实这个算法跑的是examples/bin/factorize-movielens-1M.sh这个文件，那么就打开这个文件来看看吧：

# Instructions:
#
# Before using this script, you have to download and extract the Movielens 1M dataset
# from http://www.grouplens.org/node/73
#
# To run:  change into the mahout directory and type:
#  examples/bin/factorize-movielens-1M.sh /path/to/ratings.dat

if [ "$1" = "--help" ] || [ "$1" = "--?" ]; then
  echo "This script runs the Alternating Least Squares Recommender on the Grouplens data set (size 1M)."
  echo "Syntax: $0 /path/to/ratings.dat\n"
  exit
fi

if [ $# -ne 1 ]
then
  echo -e "\nYou have to download the Movielens 1M dataset from http://www.grouplens.org/node/73 before"
  echo -e "you can run this example. After that extract it and supply the path to the ratings.dat file.\n"
  echo -e "Syntax: $0 /path/to/ratings.dat\n"
  exit -1
fi

MAHOUT="../../bin/mahout"

WORK_DIR=/tmp/mahout-work-${USER}
echo "creating work directory at ${WORK_DIR}"
mkdir -p ${WORK_DIR}/movielens

echo "Converting ratings..."
cat $1 |sed -e s/::/,/g| cut -d, -f1,2,3 > ${WORK_DIR}/movielens/ratings.csv

# create a 90% percent training set and a 10% probe set
$MAHOUT splitDataset --input ${WORK_DIR}/movielens/ratings.csv --output ${WORK_DIR}/dataset \
    --trainingPercentage 0.9 --probePercentage 0.1 --tempDir ${WORK_DIR}/dataset/tmp

# run distributed ALS-WR to factorize the rating matrix defined by the training set
$MAHOUT parallelALS --input ${WORK_DIR}/dataset/trainingSet/ --output ${WORK_DIR}/als/out \
    --tempDir ${WORK_DIR}/als/tmp --numFeatures 20 --numIterations 10 --lambda 0.065

# compute predictions against the probe set, measure the error
$MAHOUT evaluateFactorization --input ${WORK_DIR}/dataset/probeSet/ --output ${WORK_DIR}/als/rmse/ \
    --userFeatures ${WORK_DIR}/als/out/U/ --itemFeatures ${WORK_DIR}/als/out/M/ --tempDir ${WORK_DIR}/als/tmp

# compute recommendations
$MAHOUT recommendfactorized --input ${WORK_DIR}/als/out/userRatings/ --output ${WORK_DIR}/recommendations/ \
    --userFeatures ${WORK_DIR}/als/out/U/ --itemFeatures ${WORK_DIR}/als/out/M/ \
    --numRecommendations 6 --maxRating 5

# print the error
echo -e "\nRMSE is:\n"
cat ${WORK_DIR}/als/rmse/rmse.txt
echo -e "\n"

echo -e "\nSample recommendations:\n"
shuf ${WORK_DIR}/recommendations/part-m-00000 |head
echo -e "\n\n"

echo "removing work directory"
rm -rf ${WORK_DIR}mahout@ubuntu:~/mahout-d-0.7/examples/bin$

这里可以看到一共有5个操作：（1）把原始数据转换为我们需要的格式；（2）分数据集；（3）并行ALS；（4）评价算法模型；（5）进行推荐；下面来一个一个进行实战：

（1）转换数据，下载原始数据MovieLens Data Sets，这里下载的是1M数据，解压后，打开ratings.dat，可以看到下面的数据：

1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291
1::1197::3::978302268
1::1287::5::978302039
1::2804::5::978300719
1::594::4::978302268
1::919::4::978301368

然后使用linux命令：cat ratings.dat |sed -e s/::/,/g|cut -d, -f1,2,3 > ratings.csv，把数据转换成下面的形式：

1,1193,5
1,661,3
1,914,3
1,3408,4
1,2355,5
1,1197,3
1,1287,5
1,2804,5
1,594,4
1,919,4

这里简要介绍下数据ratings.dat 的结构如下：UserID::MovieID::Rating::Timestamp，然后转换后的结构如下：UserID,MovieID,Rating。

然后把生成的ratings.csv上传到HDFS文件系统，准备进行下一步。

（2）分数聚集为训练数据和测试数据：进入mahout根目录，使用命令splitDataset，下面是这个命令的参数：

usage: <command> [Generic Options] [Job-Specific Options]
Generic Options:
 -archives <paths>              comma separated archives to be unarchived
                                on the compute machines.
 -conf <configuration file>     specify an application configuration file
 -D <property=value>            use value for given property
 -files <paths>                 comma separated files to be copied to the
                                map reduce cluster
 -fs <local|namenode:port>      specify a namenode
 -jt <local|jobtracker:port>    specify a job tracker
 -libjars <paths>               comma separated jar files to include in
                                the classpath.
 -tokenCacheFile <tokensFile>   name of the file with the tokens
Job-Specific Options:                                                           
  --input (-i) input                              Path to job input directory.  
  --output (-o) output                            The directory pathname for    
                                                  output.                       
  --trainingPercentage (-t) trainingPercentage    percentage of the data to use 
                                                  as training set (default:     
                                                  0.9)                          
  --probePercentage (-p) probePercentage          percentage of the data to use 
                                                  as probe set (default: 0.1)   
  --help (-h)                                     Print out help                
  --tempDir tempDir                               Intermediate output directory 
  --startPhase startPhase                         First phase to run            
  --endPhase endPhase                             Last phase to run

命令为：./mahout splitDataset -i input/ratings.csv -o output/als -t 0.9 -p 0.1 --tempDir temp ，运行完成后，可以看到该命令一共运行了三个Job，分别产生了三分输出结果：（a）应该是原始数据的转换，输入的map记录数为100020，输出也是100020；（b）是产生训练数据集，输入100020条记录，输出900362条记录；（c）输入100020条记录，输出99847条记录；

（3）并行ALS：命令为./mahoutparallelALS ，先看其使用参数和方法：

usage: <command> [Generic Options] [Job-Specific Options]
Generic Options:
 -archives <paths>              comma separated archives to be unarchived
                                on the compute machines.
 -conf <configuration file>     specify an application configuration file
 -D <property=value>            use value for given property
 -files <paths>                 comma separated files to be copied to the
                                map reduce cluster
 -fs <local|namenode:port>      specify a namenode
 -jt <local|jobtracker:port>    specify a job tracker
 -libjars <paths>               comma separated jar files to include in
                                the classpath.
 -tokenCacheFile <tokensFile>   name of the file with the tokens
Job-Specific Options:                                                           
  --input (-i) input                     Path to job input directory.           
  --output (-o) output                   The directory pathname for output.     
  --lambda lambda                        regularization parameter               
  --implicitFeedback implicitFeedback    data consists of implicit feedback?    
  --alpha alpha                          confidence parameter (only used on     
                                         implicit feedback)                     
  --numFeatures numFeatures              dimension of the feature space         
  --numIterations numIterations          number of iterations                   
  --help (-h)                            Print out help                         
  --tempDir tempDir                      Intermediate output directory          
  --startPhase startPhase                First phase to run                     
  --endPhase endPhase                    Last phase to run

然后使用命令：./mahout parallelALS-i output/als/trainingSet -o output/als/als --tempDir temp/als --numFeatures 20 --numIterations 10 --lambda 0.065
由上面的参数可以看到应该要十次循环，但是运行完上面的命令后，可以发现mahout不止建立的10个Job。命令运行后，先跑了3个Job，然后就出现下面的提示（每跑一个任务提示一次）

13/10/03 21:27:24 INFO als.ParallelALSFactorizationJob: Recomputing U (iteration 0/10)
13/10/03 21:27:50 INFO als.ParallelALSFactorizationJob: Recomputing M (iteration 0/10)
13/10/03 21:28:20 INFO als.ParallelALSFactorizationJob: Recomputing U (iteration 1/10)
...
13/10/03 21:35:51 INFO als.ParallelALSFactorizationJob: Recomputing U (iteration 9/10)
13/10/03 21:36:17 INFO als.ParallelALSFactorizationJob: Recomputing M (iteration 9/10)

在输出文件中会有M、U和userRationgs三个文件夹，在temp中则会出现U0~U8、M0~M8、M--1、averageRatings和itemRatings这些文件夹。

（4）评价算法模型：使用的mahout命令是evaluateFactorization，首先看下其用法和参数：

usage: <command> [Generic Options] [Job-Specific Options]
Generic Options:
 -archives <paths>              comma separated archives to be unarchived
                                on the compute machines.
 -conf <configuration file>     specify an application configuration file
 -D <property=value>            use value for given property
 -files <paths>                 comma separated files to be copied to the
                                map reduce cluster
 -fs <local|namenode:port>      specify a namenode
 -jt <local|jobtracker:port>    specify a job tracker
 -libjars <paths>               comma separated jar files to include in
                                the classpath.
 -tokenCacheFile <tokensFile>   name of the file with the tokens
Job-Specific Options:                                                           
  --input (-i) input             Path to job input directory.                   
  --userFeatures userFeatures    path to the user feature matrix                
  --itemFeatures itemFeatures    path to the item feature matrix                
  --output (-o) output           The directory pathname for output.             
  --help (-h)                    Print out help                                 
  --tempDir tempDir              Intermediate output directory                  
  --startPhase startPhase        First phase to run                             
  --endPhase endPhase            Last phase to run

使用下面的命令来运行：./mahout evaluateFactorization -i output/als/probeSet -o output/rmse --userFeatures output/als/als/U --itemFeatures output/als/als/M --tempDir temp/rmse，命令运行完毕后，可以在HDFS的 output/ rmse/rmse.txt文件中查看到均方根误差为：0.8548619405669956（感觉好像均方根误差很小的样子？）

（5）推荐：推荐使用的命令是recommendfactorized，这个命令的用户和参数为：

usage: <command> [Generic Options] [Job-Specific Options]
Generic Options:
 -archives <paths>              comma separated archives to be unarchived
                                on the compute machines.
 -conf <configuration file>     specify an application configuration file
 -D <property=value>            use value for given property
 -files <paths>                 comma separated files to be copied to the
                                map reduce cluster
 -fs <local|namenode:port>      specify a namenode
 -jt <local|jobtracker:port>    specify a job tracker
 -libjars <paths>               comma separated jar files to include in
                                the classpath.
 -tokenCacheFile <tokensFile>   name of the file with the tokens
Job-Specific Options:                                                           
  --input (-i) input                         Path to job input directory.       
  --userFeatures userFeatures                path to the user feature matrix    
  --itemFeatures itemFeatures                path to the item feature matrix    
  --numRecommendations numRecommendations    number of recommendations per user 
  --maxRating maxRating                      maximum rating available           
  --output (-o) output                       The directory pathname for output. 
  --help (-h)                                Print out help                     
  --tempDir tempDir                          Intermediate output directory      
  --startPhase startPhase                    First phase to run                 
  --endPhase endPhase                        Last phase to run

使用命令：./mahout recommendfactorized -i output/als/als/userRatings -o output/recommendations --userFeatures output/als/als/U --itemFeatures output/als/als/M --numRecommendations 6 --maxRating 5，即可运行该命令。运行完毕后，在终端中可以看到map的输出为6040条记录，正好对应了数据集中用户的数量，同时可以在相应的HDFS文件系统上面查看相应的推荐输出：