使用pyspark 中的VectorAssembler出现报错
vectorAssembler = ft.VectorAssembler(inputCols=['cust_sex','cust_age'],outputCol='features')
查看输入数据类型
df1.printSchema()
发现输入的inputCols的字段类型是string,而这个函数只接受float 或者int
故先进行类型转换
df1=df1.withColumn('device_number', df1.device_number.astype("int"))
df1=df1.withColumn('cust_sex', df1.cust_sex.astype("int"))
再执行
ft.VectorAssembler(inputCols=['cust_sex','cust_age'],outputCol='features',handleInvalid='keep').transform(df1).show()
成功,同时注意 若原列中有null,需要将handleInvalid设置为'keep'或者"skip",否则报错:
Caused by: org.apache.spark.SparkException: Encountered null while assembling a row with handleInvalid = "keep". Consider removing nulls from dataset or using handleInvalid = "keep" or "skip".