这个需求好多时候是建立在想横向合并两个pyspark_dataframe,但是pyspark_dataframe与pandas_dataframe有所不同,无法用concat这类函数硬拼接,pyspark里的monotonically_increasing_id函数到一定长度之后两个df自增的中间会隔断,突然从一个比较大的数开始,合并之后就是空或缺行的dataframe,如下图:
解决方式如下:
from commons import sparkSession
from pyspark.sql.functions import monotonically_increasing_id
from pyspark.sql.types import StructField, LongType
if __name__ == '__main__':
def add_id(df):
schema = df.schema.add(StructField("id", LongType()))
rdd = df.rdd.zipWithIndex()
def flat(l):
for k in l:
if not isinstance(k, (list, tuple)):
yield k
else:
yield from flat(k)
rdd = rdd.map(lambda x: list(flat(x)))
res = sparkSession.createDataFrame(rdd, schema)
return res
df1 = sparkSession.createDataFrame([('abcd', '123'),('abd', '13'),('abc', '12')], ['s', 'd'])
# df1=df1.withColumn("id", monotonically_increasing_id())
df1.show()
df2 = sparkSession.createDataFrame([('mnb', '456'),('mgb', '56'),('ngb', '45')], ['r', 't'])
df1 = add_id(df1)
df1.show()
df2 = add_id(df2)
df2.show()
join_df = df1.join(df2, df1.id == df2.id)
join_df.show()
结果: