Mahout版本:0.7,hadoop版本:1.0.4,jdk:1.7.0_25 64bit。
学习总是一个痛并快乐着的过程。。。
今天简要介绍一下mahout中的Collaborative Filtering with ALS-WR,这个算法,你要问我这个是什么算法,我最多告诉你它是一个推荐算法,其他我也不知道。这里主要是参考这里的介绍Collaborative Filtering with ALS-WR。
此篇作为实战,就是要先把算法跑起来,先不管具体实现过程,通过现象,看到什么,然后才分析具体实现过程。看到官网的介绍上面说其实这个算法跑的是examples/bin/factorize-movielens-1M.sh这个文件,那么就打开这个文件来看看吧:
# Instructions:
#
# Before using this script, you have to download and extract the Movielens 1M dataset
# from http://www.grouplens.org/node/73
#
# To run: change into the mahout directory and type:
# examples/bin/factorize-movielens-1M.sh /path/to/ratings.dat
if [ "$1" = "--help" ] || [ "$1" = "--?" ]; then
echo "This script runs the Alternating Least Squares Recommender on the Grouplens data set (size 1M)."
echo "Syntax: $0 /path/to/ratings.dat\n"
exit
fi
if [ $# -ne 1 ]
then
echo -e "\nYou have to download the Movielens 1M dataset from http://www.grouplens.org/node/73 before"
echo -e "you can run this example. After that extract it and supply the path to the ratings.dat file.\n"
echo -e "Syntax: $0 /path/to/ratings.dat\n"
exit -1
fi
MAHOUT="../../bin/mahout"
WORK_DIR=/tmp/mahout-work-${USER}
echo "creating work directory at ${WORK_DIR}"
mkdir -p ${WORK_DIR}/movielens
echo "Converting ratings..."
cat $1 |sed -e s/::/,/g| cut -d, -f1,2,3 > ${WORK_DIR}/movielens/ratings.csv
# create a 90% percent training set and a 10% probe set
$MAHOUT splitDataset --input ${WORK_DIR}/movielens/ratings.csv --output ${WORK_DIR}/dataset \
--trainingPercentage 0.9 --probePercentage 0.1 --tempDir ${WORK_DIR}/dataset/tmp
# run distributed ALS-WR to factorize the rating matrix defined by the training set
$MAHOUT parallelALS --input ${WORK_DIR}/dataset/trainingSet/ --output ${WORK_DIR}/als/out \
--tempDir ${WORK_DIR}/als/tmp --numFeatures 20 --numIterations 10 --lambda 0.065
# compute predictions against the probe set, measure the error
$MAHOUT evaluateFactorization --input ${WORK_DIR}/dataset/probeSet/ --output ${WORK_DIR}/als/rmse/ \
--userFeatures ${WORK_DIR}/als/out/U/ --itemFeatures ${WORK_DIR}/als/out/M/ --tempDir ${WORK_DIR}/als/tmp
# compute recommendations
$MAHOUT recommendfactorized --input ${WORK_DIR}/als/out/userRatings/ --output ${WORK_DIR}/recommendations/ \
--userFeatures ${WORK_DIR}/als/out/U/ --itemFeatures ${WORK_DIR}/als/out/M/ \
--numRecommendations 6 --maxRating 5
# print the error
echo -e "\nRMSE is:\n"
cat ${WORK_DIR}/als/rmse/rmse.txt
echo -e "\n"
echo -e "\nSample recommendations:\n"
shuf ${WORK_DIR}/recommendations/part-m-00000 |head
echo -e "\n\n"
echo "removing work directory"
rm -rf ${WORK_DIR}mahout@ubuntu:~/mahout-d-0.7/examples/bin$
这里可以看到一共有5个操作:(1)把原始数据转换为我们需要的格式;(2)分数据集;(3)并行ALS;(4)评价算法模型;(5)进行推荐;下面来一个一个进行实战:
(1)转换数据,下载原始数据MovieLens Data Sets,这里下载的是1M数据,解压后,打开ratings.dat,可以看到下面的数据:
1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291
1::1197::3::978302268
1::1287::5::978302039
1::2804::5::978300719
1::594::4::978302268
1::919::4::978301368
然后使用linux命令:cat ratings.dat |sed -e s/::/,/g|cut -d, -f1,2,3 > ratings.csv,把数据转换成下面的形式:
1,1193,5
1,661,3
1,914,3
1,3408,4
1,2355,5
1,1197,3
1,1287,5
1,2804,5
1,594,4
1,919,4
这里简要介绍下数据ratings.dat 的结构如下:UserID::MovieID::Rating::Timestamp,然后转换后的结构如下:UserID,MovieID,Rating。
然后把生成的ratings.csv上传到HDFS文件系统,准备进行下一步。
(2)分数聚集为训练数据和测试数据:进入mahout根目录,使用命令splitDataset,下面是这个命令的参数:
usage: <command> [Generic Options] [Job-Specific Options]
Generic Options:
-archives <paths> comma separated archives to be unarchived
on the compute machines.
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-files <paths> comma separated files to be copied to the
map reduce cluster
-fs <local|namenode:port> specify a namenode
-jt <local|jobtracker:port> specify a job tracker
-libjars <paths> comma separated jar files to include in
the classpath.
-tokenCacheFile <tokensFile> name of the file with the tokens
Job-Specific Options:
--input (-i) input Path to job input directory.
--output (-o) output The directory pathname for
output.
--trainingPercentage (-t) trainingPercentage percentage of the data to use
as training set (default:
0.9)
--probePercentage (-p) probePercentage percentage of the data to use
as probe set (default: 0.1)
--help (-h) Print out help
--tempDir tempDir Intermediate output directory
--startPhase startPhase First phase to run
--endPhase endPhase Last phase to run
命令为:./mahout splitDataset -i input/ratings.csv -o output/als -t 0.9 -p 0.1 --tempDir temp ,运行完成后,可以看到该命令一共运行了三个Job,分别产生了三分输出结果:(a)应该是原始数据的转换,输入的map记录数为100020,输出也是100020;(b)是产生训练数据集,输入100020条记录,输出900362条记录;(c)输入100020条记录,输出99847条记录;
(3)并行ALS:命令为./mahoutparallelALS ,先看其使用参数和方法:
usage: <command> [Generic Options] [Job-Specific Options]
Generic Options:
-archives <paths> comma separated archives to be unarchived
on the compute machines.
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-files <paths> comma separated files to be copied to the
map reduce cluster
-fs <local|namenode:port> specify a namenode
-jt <local|jobtracker:port> specify a job tracker
-libjars <paths> comma separated jar files to include in
the classpath.
-tokenCacheFile <tokensFile> name of the file with the tokens
Job-Specific Options:
--input (-i) input Path to job input directory.
--output (-o) output The directory pathname for output.
--lambda lambda regularization parameter
--implicitFeedback implicitFeedback data consists of implicit feedback?
--alpha alpha confidence parameter (only used on
implicit feedback)
--numFeatures numFeatures dimension of the feature space
--numIterations numIterations number of iterations
--help (-h) Print out help
--tempDir tempDir Intermediate output directory
--startPhase startPhase First phase to run
--endPhase endPhase Last phase to run
然后使用命令:./mahout parallelALS-i output/als/trainingSet -o output/als/als --tempDir temp/als --numFeatures 20 --numIterations 10 --lambda 0.065
由上面的参数可以看到应该要十次循环,但是运行完上面的命令后,可以发现mahout不止建立的10个Job。命令运行后,先跑了3个Job,然后就出现下面的提示(每跑一个任务提示一次)
13/10/03 21:27:24 INFO als.ParallelALSFactorizationJob: Recomputing U (iteration 0/10)
13/10/03 21:27:50 INFO als.ParallelALSFactorizationJob: Recomputing M (iteration 0/10)
13/10/03 21:28:20 INFO als.ParallelALSFactorizationJob: Recomputing U (iteration 1/10)
...
13/10/03 21:35:51 INFO als.ParallelALSFactorizationJob: Recomputing U (iteration 9/10)
13/10/03 21:36:17 INFO als.ParallelALSFactorizationJob: Recomputing M (iteration 9/10)
在输出文件中会有M、U和userRationgs三个文件夹,在temp中则会出现U0~U8、M0~M8、M--1、averageRatings和itemRatings这些文件夹。
(4)评价算法模型:使用的mahout命令是evaluateFactorization,首先看下其用法和参数:
usage: <command> [Generic Options] [Job-Specific Options]
Generic Options:
-archives <paths> comma separated archives to be unarchived
on the compute machines.
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-files <paths> comma separated files to be copied to the
map reduce cluster
-fs <local|namenode:port> specify a namenode
-jt <local|jobtracker:port> specify a job tracker
-libjars <paths> comma separated jar files to include in
the classpath.
-tokenCacheFile <tokensFile> name of the file with the tokens
Job-Specific Options:
--input (-i) input Path to job input directory.
--userFeatures userFeatures path to the user feature matrix
--itemFeatures itemFeatures path to the item feature matrix
--output (-o) output The directory pathname for output.
--help (-h) Print out help
--tempDir tempDir Intermediate output directory
--startPhase startPhase First phase to run
--endPhase endPhase Last phase to run
使用下面的命令来运行:./mahout evaluateFactorization -i output/als/probeSet -o output/rmse --userFeatures output/als/als/U --itemFeatures output/als/als/M --tempDir temp/rmse,命令运行完毕后,可以在HDFS的
output/
rmse/rmse.txt文件中查看到均方根误差为:0.8548619405669956(感觉好像均方根误差很小的样子?)
(5)推荐:推荐使用的命令是recommendfactorized,这个命令的用户和参数为:
usage: <command> [Generic Options] [Job-Specific Options]
Generic Options:
-archives <paths> comma separated archives to be unarchived
on the compute machines.
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-files <paths> comma separated files to be copied to the
map reduce cluster
-fs <local|namenode:port> specify a namenode
-jt <local|jobtracker:port> specify a job tracker
-libjars <paths> comma separated jar files to include in
the classpath.
-tokenCacheFile <tokensFile> name of the file with the tokens
Job-Specific Options:
--input (-i) input Path to job input directory.
--userFeatures userFeatures path to the user feature matrix
--itemFeatures itemFeatures path to the item feature matrix
--numRecommendations numRecommendations number of recommendations per user
--maxRating maxRating maximum rating available
--output (-o) output The directory pathname for output.
--help (-h) Print out help
--tempDir tempDir Intermediate output directory
--startPhase startPhase First phase to run
--endPhase endPhase Last phase to run
使用命令:./mahout recommendfactorized -i output/als/als/userRatings -o output/recommendations --userFeatures output/als/als/U --itemFeatures output/als/als/M --numRecommendations 6 --maxRating 5,即可运行该命令。运行完毕后,在终端中可以看到map的输出为6040条记录,正好对应了数据集中用户的数量,同时可以在相应的HDFS文件系统上面查看相应的推荐输出:
分享,成长,快乐
转载请注明blog地址:http://blog.csdn.net/fansy1990