在spark中利用自己定义schema的方式创建dataframe遇到了rt所示的错误
原因分析:
首先定义rdd
val dCom = spark.read.orc(pth1+"/202006")
.select("uid","publish_time","content_length")
.join(isTrueUser, Seq("uid"))
.rdd.map(row =>{
val uid = row.getAs[String]("uid")
val dtime = row.getAs[String]("publish_time")
val dlength = row.getAs[Long]("content_length").toInt
(uid, dtime, dlength, 0, "0")
})
定义schema数据类型
val commSchema = StructType(List(
StructField("uid", StringType, nullable = true),
StructField("time", StringType, nullable = true),
StructField("length", IntegerType, nullable = true),
StructField("topic", IntegerType, nullable = true),
StructField("type", StringType, nullable = true)
))
val df = spark.createDataFrame(dCom, commSchema)
最后出现rt所示的错误,最后发现在createDataFrame方法中第一个参数传入的是RDD[Row]
形式的,于是在定义rdd中将(uid, dtime, dlength, 0, "0")
修改为Row(uid, dtime, dlength, 0, "0")
就可以了。