distinct数据去重,不接受传参
使用distinct
:,返回当前DataFrame中不重复的Row记录。 该方法和接下来的dropDuplicates()方法不传入指定字段时的结果相同
。
dropDuplicates:接受传参,根据指定字段去重
跟distinct方法不同的是,此方法可以根据指定字段去重。
例如我们想要去掉相同用户通过相同渠道下单的数据:
df.dropDuplicates("user","type").show()
输出:
+---+----+----+--------------------+
| id|user|type| visittime|
+---+----+----+--------------------+
| 8| 3|APP2|2017-08-03 13:44:...|
| 1| 1| 助手1|2017-08-10 13:44:...|
| 7| 3| 助手2|2017-08-14 13:44:...|
| 12| 1| 助手2|2017-07-07 13:45:...|
| 3| 2| 助手1|2017-08-05 13:44:...|
| 5| 3|APP1|2017-08-02 13:44:...|
| 9| 2|APP2|2017-08-11 13:44:...|
| 2| 1|APP1|2017-08-04 13:44:...|
SPARK Distinct Function
val dfTN = Seq(("Smith",23,5.3),("Rashmi",27,5.8),("Smith",23,5.3),("Payal",27,5.8)).toDF("Name","Age","Height")
dfTN.show()
+------+---+------+
| Name|Age|Height|
+------+---+------+
| Smith| 23| 5.3|
|Rashmi| 27| 5.8|
| Smith| 23| 5.3|
| Payal| 27| 5.8|
+------+---+------+
dfTN.distinct.show()
+------+---+------+
| Name|Age|Height|
+------+---+------+
| Smith| 23| 5.3|
| Payal| 27| 5.8|
|Rashmi| 27| 5.8|
+------+---+------+
dfTN.select('Age,'Height).distinct.show()
+---+------+
|Age|Height|
+---+------+
| 23| 5.3|
| 27| 5.8|
+---+------+
Spark dropDuplicates() Function
val dfTN = Seq(("Smith",23,5.3),("Rashmi",27,5.8),("Smith",23,5.3),("Payal",27,5.8)).toDF("Name","Age","Height")
dfTN.show()
+------+---+------+
| Name|Age|Height|
+------+---+------+
| Smith| 23| 5.3|
|Rashmi| 27| 5.8|
| Smith| 23| 5.3|
| Payal| 27| 5.8|
+------+---+------+
#Applying dropDuplicates() on entire dataframe 与distinct效果一样
+------+---+------+
| Name|Age|Height|
+------+---+------+
| Smith| 23| 5.3|
| Payal| 27| 5.8|
|Rashmi| 27| 5.8|
+------+---+------+
dfTN.dropDuplicates("Age","Height").show()
+------+---+------+
| Name|Age|Height|
+------+---+------+
| Smith| 23| 5.3|
|Rashmi| 27| 5.8|
+------+---+------+
参考:
https://understandingbigdata.com/spark-distinct-and-dropduplicates/
https://blog.csdn.net/helloworld0906/article/details/108966193