pyspark:distinct和dropDuplicates区别

最新推荐文章于 2024-05-13 13:51:55 发布

yujkss

最新推荐文章于 2024-05-13 13:51:55 发布

阅读量4.4k

点赞数 1

分类专栏： # Spark

原文链接：https://blog.csdn.net/helloworld0906/article/details/108966193

版权

Spark 专栏收录该内容

10 篇文章 1 订阅

订阅专栏

文章目录

SPARK Distinct Function
Spark dropDuplicates() Function

distinct数据去重，不接受传参
使用distinct：，返回当前DataFrame中不重复的Row记录。该方法和接下来的dropDuplicates()方法不传入指定字段时的结果相同。

dropDuplicates：接受传参，根据指定字段去重
跟distinct方法不同的是，此方法可以根据指定字段去重。

例如我们想要去掉相同用户通过相同渠道下单的数据：

df.dropDuplicates("user","type").show()
 
输出：
+---+----+----+--------------------+
| id|user|type|           visittime|
+---+----+----+--------------------+
|  8|   3|APP2|2017-08-03 13:44:...|
|  1|   1| 助手1|2017-08-10 13:44:...|
|  7|   3| 助手2|2017-08-14 13:44:...|
| 12|   1| 助手2|2017-07-07 13:45:...|
|  3|   2| 助手1|2017-08-05 13:44:...|
|  5|   3|APP1|2017-08-02 13:44:...|
|  9|   2|APP2|2017-08-11 13:44:...|
|  2|   1|APP1|2017-08-04 13:44:...|

SPARK Distinct Function

val dfTN = Seq(("Smith",23,5.3),("Rashmi",27,5.8),("Smith",23,5.3),("Payal",27,5.8)).toDF("Name","Age","Height")
dfTN.show()

+------+---+------+
|  Name|Age|Height|
+------+---+------+
| Smith| 23|   5.3|
|Rashmi| 27|   5.8|
| Smith| 23|   5.3|
| Payal| 27|   5.8|
+------+---+------+

dfTN.distinct.show()
+------+---+------+
|  Name|Age|Height|
+------+---+------+
| Smith| 23|   5.3|
| Payal| 27|   5.8|
|Rashmi| 27|   5.8|
+------+---+------+

dfTN.select('Age,'Height).distinct.show()
+---+------+
|Age|Height|
+---+------+
| 23|   5.3|
| 27|   5.8|
+---+------+

Spark dropDuplicates() Function

val dfTN = Seq(("Smith",23,5.3),("Rashmi",27,5.8),("Smith",23,5.3),("Payal",27,5.8)).toDF("Name","Age","Height")
dfTN.show()
+------+---+------+
|  Name|Age|Height|
+------+---+------+
| Smith| 23|   5.3|
|Rashmi| 27|   5.8|
| Smith| 23|   5.3|
| Payal| 27|   5.8|
+------+---+------+

#Applying dropDuplicates() on entire dataframe 与distinct效果一样
+------+---+------+
|  Name|Age|Height|
+------+---+------+
| Smith| 23|   5.3|
| Payal| 27|   5.8|
|Rashmi| 27|   5.8|
+------+---+------+

dfTN.dropDuplicates("Age","Height").show()
+------+---+------+
|  Name|Age|Height|
+------+---+------+
| Smith| 23|   5.3|
|Rashmi| 27|   5.8|
+------+---+------+

参考：
https://understandingbigdata.com/spark-distinct-and-dropduplicates/
https://blog.csdn.net/helloworld0906/article/details/108966193

yujkss

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
pyspark:distinct和dropDuplicates区别

distinct数据去重使用distinct：返回当前DataFrame中不重复的Row记录。该方法和接下来的dropDuplicates()方法不传入指定字段时的结果相同。dropDuplicates：根据指定字段去重跟distinct方法不同的是，此方法可以根据指定字段去重。例如我们想要去掉相同用户通过相同渠道下单的数据：df.dropDuplicates("user","type").show() 输出：+---+----+----+--------------------+| id
复制链接

扫一扫