给DataFrame添加自增ID三个方式
1. 通过monotonically_increasing_id()函数进行添加
dataFrame.withColumn("id",monotonically_increasing_id())
此方法最快捷,序列从0开始自增
2. 通过zipWithIndex算子进行添加
val schemaType = resDF.schema.add(StructField("id", LongType))
val dfRDD = resDF.rdd.zipWithIndex()
val resRDD = dfRDD.map(rdd => Row.merge(rdd._1, Row(rdd._2)))
val addIndexDF = spark.createDataFrame(resRDD, schemaType)
3.通过row_number()进行添加
dataFrame.withColumn("id", row_number().over())
注意:此方法只会让数据在一个partition中进行,在大量数据情况下,会发送oom