pyspark之DataFrame数据处理学习【数据去重之一】

最新推荐文章于 2024-06-04 20:51:16 发布

Data_IT_Farmer

最新推荐文章于 2024-06-04 20:51:16 发布

阅读量1.3w

点赞数

分类专栏： Python Spark DataFrame 文章标签： pyspark 数据去重 DataFrame

Python 同时被 3 个专栏收录

84 篇文章 10 订阅

订阅专栏

Spark

44 篇文章 6 订阅

订阅专栏

DataFrame

10 篇文章 0 订阅

订阅专栏

pyspark之DataFrame数据处理学习【数据去重之一】

1、重复数据，例如

spark = SparkSession.builder.appName("dataDeal").getOrCreate()
df = spark.createDataFrame([
(1, 144.5, 5.9, 33, 'M'),
(2, 167.2, 5.4, 45, 'M'),
(3, 124.1, 5.2, 23, 'F'),
(4, 144.5, 5.9, 33, 'M'),
(5, 133.2, 5.7, 54, 'F'),
(3, 124.1, 5.2, 23, 'F'),
(5, 129.2, 5.3, 42, 'M'),
], ['id', 'weight', 'height', 'age', 'gender'])

>>> df.show()
+---+------+------+---+------+
| id|weight|height|age|gender|
+---+------+------+---+------+
| 1| 144.5| 5.9| 33| M|
| 2| 167.2| 5.4| 45| M|
| 3| 124.1| 5.2| 23| F|
| 4| 144.5| 5.9| 33| M|
| 5| 133.2| 5.7| 54| F|
| 3| 124.1| 5.2| 23| F|
| 5| 129.2| 5.3| 42| M|
+---+------+------+---+------+
上面的数据中存在如下问题：
有两行id等于3并且完全相同

id为1和4的两行是一样的数据，只是id不同，可以假定为是同一个人的数据

有两行的id等于5，这看上去是一个异常数据，因为他们看上去不像是同一个人的数据

2、检查是否有重复数据采用.distinct()方法

print ('Count of rows:{0}'.format(df.count()))
print ('Count of distinct rows:{0}'.format(df.distinct().count()))
>>> print ('Count of rows:{0}'.format(df.count()))
Count of rows:7
>>> print ('Count of distinct rows:{0}'.format(df.distinct().count()))
Count of distinct rows:6
>>> df.columns
['id', 'weight', 'height', 'age', 'gender']

可以看到返回的两个值不等，一个为6，一个为7.所以，可以判断出我们的数据集中有完全相同的行（即重复的数据）

3、移除重复的数据采用.dropDuplicates()方法

1）、#移除重复的数据
df = df.dropDuplicates()
#查看去重后的数据
df.show()

>>> df.show()
+---+------+------+---+------+
| id|weight|height|age|gender|
+---+------+------+---+------+
| 4| 144.5| 5.9| 33| M|
| 1| 144.5| 5.9| 33| M|
| 5| 129.2| 5.3| 42| M|
| 5| 133.2| 5.7| 54| F|
| 2| 167.2| 5.4| 45| M|
| 3| 124.1| 5.2| 23| F|
+---+------+------+---+------+

通过结果可以看出，删除了一行id为3的记录

2）、接着可以通过重复之前的工作检查与id无关的重复数据

#对除id以外的列进行对比
print ("Count of ids:{0}".format(df.count()))
print ("Count of distinct ids:{0}".format(df.select([c for c in df.columns if c != 'id']).distinct().count()))
Count of ids:6
Count of distinct ids:5
可以继续使用.dropDuplicates()删除重复数据，但是需要使用subset参数来指定只处理除id以外的列。subset参数指明.dropDuplicates()方法只查找subset参数指定的列

#去掉除id以外其他属性相同的数据
df = df.dropDuplicates(subset=[c for c in df.columns if c != 'id'])
df.show()
>>> df.show()
+---+------+------+---+------+
| id|weight|height|age|gender|
+---+------+------+---+------+
| 5| 133.2| 5.7| 54| F|
| 4| 144.5| 5.9| 33| M|
| 2| 167.2| 5.4| 45| M|
| 3| 124.1| 5.2| 23| F|
| 5| 129.2| 5.3| 42| M|
+---+------+------+---+------+
从结果可以看出，现在的数据没有任何一行是重复的（既没有完全相同的记录也没有除id以外相同的记录）

#去掉除id以外其他属性相同的数据

>>> df = df.dropDuplicates(subset=[c for c in df.columns if c in [ 'weight','height','age']])
>>> df.show()
+---+------+------+---+------+
| id|weight|height|age|gender|
+---+------+------+---+------+
| 2| 167.2| 5.4| 45| M|
| 5| 133.2| 5.7| 54| F|
| 5| 129.2| 5.3| 42| M|
| 3| 124.1| 5.2| 23| F|
| 4| 144.5| 5.9| 33| M|
+---+------+------+---+------+

3）检测是否有重复的id

#计算id的总数和id的唯一数
import pyspark.sql.functions as fn
>>> df.agg(fn.count('id').alias('count'),fn.countDistinct('id').alias('distinct')).show()
+-----+--------+
|count|distinct|
+-----+--------+
| 5| 4|
+-----+--------+
.count()方法和.countDistinct()方法分别计算DataFrame的行数和id的唯一数。.alias()方法可以对返回的列指定一个别名。

从结果中可以看出，总共5条记录，但只有4个唯一id。假设id相同的数据是偶然事件，异常值，则将每一行给定一个唯一的id

4）#重新给每行分配id
df.withColumn('new_id',fn.monotonically_increasing_id()).show()
+---+------+------+---+------+-------------+
| id|weight|height|age|gender| new_id|
+---+------+------+---+------+-------------+
| 2| 167.2| 5.4| 45| M| 68719476736|
| 5| 133.2| 5.7| 54| F| 395136991232|
| 5| 129.2| 5.3| 42| M| 884763262976|
| 3| 124.1| 5.2| 23| F| 962072674304|
| 4| 144.5| 5.9| 33| M|1331439861760|
+---+------+------+---+------+-------------+
.monotonicallymonotonically_increasing_id()方法给每条记录提供一个唯一且递增的id

参考：https://blog.csdn.net/xiaoql520/article/details/78774581

Data_IT_Farmer

关注

0
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
pyspark之DataFrame数据处理学习【数据去重之一】

pyspark之DataFrame数据处理学习【数据去重之一】1、重复数据，例如spark = SparkSession.builder.appName("dataDeal").getOrCreate()df = spark.createDataFrame([ (1, 144.5, 5.9, 33, 'M'), (2, 167.2, 5.4, 45, 'M'), ...
复制链接

扫一扫