《Spark机器学习》第4章--构建基于Spark的推荐引擎

4.2 提取有效特征

加载MovieLen数据集

val rawData = sc.textFile("C:\\Users\\13798\\Desktop\\dataset\\ml-100k\\u.data")

其输出类似如下所示:
14/03/30 11:42:41 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform… using builtin-java classes where applicable
14/03/30 11:42:41 WARN LoadSnappy: Snappy native library not loaded
14/03/30 11:42:41 INFO FileInputFormat: Total input paths to process : 1
14/03/30 11:42:41 INFO SparkContext: Starting job: first at :15
14/03/30 11:42:41 INFO DAGScheduler: Got job 0 (first at :15)
with 1 output partitions (allowLocal=true)
14/03/30 11:42:41 INFO DAGScheduler: Final stage: Stage 0 (first at :15)
14/03/30 11:42:41 INFO DAGScheduler: Parents of final stage: List()
14/03/30 11:42:41 INFO DAGScheduler: Missing parents: List()
14/03/30 11:42:41 INFO DAGScheduler: Computing the requested partition locally
14/03/30 11:42:41 INFO HadoopRDD: Input split: file:/Users/Nick/
workspace/datasets/ml-100k/u.data:0+1979173
14/03/30 11:42:41 INFO SparkContext: Job finished: first at :15,
took 0.030533 s
res0: String = 196 242 3 881250949

该数据由用户ID、影片ID、星级(rating)和时间戳等字段依次组成,各字段间用制表符分隔。但这里在训练模型时,时间戳信息是不需要的。所以我们只提取前三个字段:

val rawRatings = rawData.map(_.split("\t").take(3))

rawRatings.first()命令会只将新RDD的第一条记录返回到驱动程序。通过调用它,我们可以检查一下新RDD。该命令输出如下:
14/03/30 12:24:00 INFO SparkContext: Starting job: first at :21
14/03/30 12:24:00 INFO DAGScheduler: Got job 1 (first at :21)
with 1 output partitions (allowLocal=true)
14/03/30 12:24:00 INFO DAGScheduler: Final stage: Stage 1 (first at :21)
14/03/30 12:24:00 INFO DAGScheduler: Parents of final stage: List()
14/03/30 12:24:00 INFO DAGScheduler: Missing parents: List()
14/03/30 12:24:00 INFO DAGScheduler: Computing the requested partition locally
14/03/30 12:24:00 INFO HadoopRDD: Input split: file:/Users/Nick/
workspace/datasets/ml-100k/u.data:0+1979173
14/03/30 12:24:00 INFO SparkContext: Job finished: first at :21,
took 0.00391 s
res6: Array[String] = Array(196, 242, 3)

下面使用Spark的MLlib来训练模型:
import org.apache.spark.mllib.recommendation.ALS

这里要是用的函数是train。若只输入ALS.train然后回车,终端会提示错误。但这个错误会包含该函数的声明信息

<console>:28: error: ambiguous reference to overloaded definition,
both method train in object ALS of type (ratings: org.apache.spark.rdd.RDD[org.apache.spark.mllib.recommendation.Rating], rank: Int, iterations: Int)org.apache.spark.mllib.recommendation.MatrixFactorizationModel
and  method train in object ALS of type (ratings: org.apache.spark.rdd.RDD[org.apache.spark.mllib.recommendation.Rating], rank: Int, iterations: Int, lambda: Double)org.apache.spark.mllib.recommendation.MatrixFactorizationModel
match expected type ?
       ALS.train
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值