4.2 提取有效特征
加载MovieLen数据集
val rawData = sc.textFile("C:\\Users\\13798\\Desktop\\dataset\\ml-100k\\u.data")
其输出类似如下所示:
14/03/30 11:42:41 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform… using builtin-java classes where applicable
14/03/30 11:42:41 WARN LoadSnappy: Snappy native library not loaded
14/03/30 11:42:41 INFO FileInputFormat: Total input paths to process : 1
14/03/30 11:42:41 INFO SparkContext: Starting job: first at :15
14/03/30 11:42:41 INFO DAGScheduler: Got job 0 (first at :15)
with 1 output partitions (allowLocal=true)
14/03/30 11:42:41 INFO DAGScheduler: Final stage: Stage 0 (first at :15)
14/03/30 11:42:41 INFO DAGScheduler: Parents of final stage: List()
14/03/30 11:42:41 INFO DAGScheduler: Missing parents: List()
14/03/30 11:42:41 INFO DAGScheduler: Computing the requested partition locally
14/03/30 11:42:41 INFO HadoopRDD: Input split: file:/Users/Nick/
workspace/datasets/ml-100k/u.data:0+1979173
14/03/30 11:42:41 INFO SparkContext: Job finished: first at :15,
took 0.030533 s
res0: String = 196 242 3 881250949
该数据由用户ID、影片ID、星级(rating)和时间戳等字段依次组成,各字段间用制表符分隔。但这里在训练模型时,时间戳信息是不需要的。所以我们只提取前三个字段:
val rawRatings = rawData.map(_.split("\t").take(3))
rawRatings.first()命令会只将新RDD的第一条记录返回到驱动程序。通过调用它,我们可以检查一下新RDD。该命令输出如下:
14/03/30 12:24:00 INFO SparkContext: Starting job: first at :21
14/03/30 12:24:00 INFO DAGScheduler: Got job 1 (first at :21)
with 1 output partitions (allowLocal=true)
14/03/30 12:24:00 INFO DAGScheduler: Final stage: Stage 1 (first at :21)
14/03/30 12:24:00 INFO DAGScheduler: Parents of final stage: List()
14/03/30 12:24:00 INFO DAGScheduler: Missing parents: List()
14/03/30 12:24:00 INFO DAGScheduler: Computing the requested partition locally
14/03/30 12:24:00 INFO HadoopRDD: Input split: file:/Users/Nick/
workspace/datasets/ml-100k/u.data:0+1979173
14/03/30 12:24:00 INFO SparkContext: Job finished: first at :21,
took 0.00391 s
res6: Array[String] = Array(196, 242, 3)
下面使用Spark的MLlib来训练模型:
import org.apache.spark.mllib.recommendation.ALS
这里要是用的函数是train。若只输入ALS.train然后回车,终端会提示错误。但这个错误会包含该函数的声明信息:
<console>:28: error: ambiguous reference to overloaded definition,
both method train in object ALS of type (ratings: org.apache.spark.rdd.RDD[org.apache.spark.mllib.recommendation.Rating], rank: Int, iterations: Int)org.apache.spark.mllib.recommendation.MatrixFactorizationModel
and method train in object ALS of type (ratings: org.apache.spark.rdd.RDD[org.apache.spark.mllib.recommendation.Rating], rank: Int, iterations: Int, lambda: Double)org.apache.spark.mllib.recommendation.MatrixFactorizationModel
match expected type ?
ALS.train