rdd添加索引
rdd.zipWithIndex()
添加索引后,rdd转成dataframe会只有两列,以前的rdd所有数据+索引数据,需要将rdd变化为多列
def getOneDF(x):
return x[0]['a'],x[0]['b'],x[0]['c'],x[0]['d'],x[1]
a.map(getOneDF).toDF().show()
dataframe添加索引
from pyspark.sql.functions import monotonically_increasing_id
dataframe.withColumn('id',monotonically_increasing_id())
monotonically_increasing_id()添加索引时当索引比较大时,会从8589934592开始