参考文档
1、https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.html
1、去除重复列
pyspark.sql.DataFrame.dropDuplicates(subset=None)
作用:返回删除重复行的新 DataFrame,可选择仅考虑某些列。
Examples
from pyspark.sql import Row
df = sc.parallelize([ \
Row(name='Alice', age=5, height=80), \
Row(name='Alice', age=5, height=80), \
Row(name='Alice', age=10, height=80)]).toDF()
df.dropDuplicates().show()
name | age | height |
---|---|---|
Alice | 5 | 80 |
Alice | 10 | 80 |
df.dropDuplicates(['name', 'height']).show()
name | age | height |
---|---|---|
Alice |