问题:ML使用的数据源要求向量features格式必须是nullable=false,比如这样:
StructType(StructField(id,IntegerType,false), StructField(features,ArrayType(DoubleType,false),true))
[id: int, features: array<double>]
然而,我们大部分数据源(如ALS的itemFactors)读取到dataframe时候StructField默认都是nullable=true,若直接将源df聚类fit如:
val kModel=new KMeans()
.setK(3)
.setFeaturesCol("features")
.setPredictionCol("prediction")
.fit(df)
则会报错:Column features must be .... array<float> but was actually of type array<float>
咋看其实我们的array<float>明明属于要求的类型,其实这个报错信息是不完善的;
那怎么才能正确运行呢?如下手动创建dataframe则满足nullable=false:
val df: DataFrame = spark.sqlContext.createDataFrame(Seq(
(6, Array(1.6, 0.6, 0.2))
)).toDF("id", "features")
若将(6, Array(1.6, 0.6, 0.2))存到文件123.txt然后通过
val df=spark.sqlContext.read.format("json").load(kmPath+"123.txt").selectExpr("cast(id as int)","cast(features as array<double>) ") 读出来则nullable=true,直接fit就会报上面那个错!这就比较恶心了,然后我们看官网给的例子使用的libsvm格式文件,参考官网介绍: http://spark.apache.org/docs/latest/ml-clustering.html
没办法,后面只能改用MLLIB了,有一点需要注意的是alsModel.itemFactors特征向量部分不能直接r.getAs[Array[Double]](1),如:
val alsModel= ALSModel.load(alsPath)
val songDF: DataFrame =alsModel.itemFactors
val songRDD =songDF.map(r=> {
// (r.getInt(0), Vectors.dense( r.getAs[Array[Double]](1) ) ) //不能直接这样
val f: mutable.WrappedArray[Float] =r.getAs[mutable.WrappedArray[Float]](1)
(r.getInt(0), Vectors.dense( f.map(_.toDouble).toArray ))
}).rdd
否则后面训练的时候会报一个不完整的错,说要求向量为Array[Double]...:
val kmodel=new KMeans()
.setK(10)
.run(songRDD.map(_._2))