https://zhuanlan.zhihu.com/p/34901683
1 重复数据筛查
满足以下功能:
- 筛选出重复的行。
- 对某一个字段,筛选出重复的值
- 对某几个字段筛选出重复的值
1.1 重复行
from pyspark.sql import SparkSession
# 创建SparkSession对象,调用.builder类
# .appName("testapp")方法给应用程序一个名字;.getOrCreate()方法创建或着获取一个已经创建的SparkSession
spark = SparkSession.builder.appName("testapp").getOrCreate()
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col
# 构建数据
data = [
(0, 1,Vectors.dense([-1.0, -1.0 ]),),
(1, 0,Vectors.dense([-1.0, 1.0 ]),),
(3, 1,Vectors.dense([1.0, 1.0]),),
(2,0, Vectors.dense([1.0, -1.0 ]),),
(3, 1,Vectors.dense([1.0, 1.0]),)]
df = spark.createDataFrame(data, ["id", "label","features"])
df.show()
+---+-----+-----------+
| id|label| features|
+---+-----+-----------+
| 0| 1|[-1.0,-1.0]|
| 1| 0| [-1.0,1.0]|
| 3| 1| [1.0,1.0]|
| 2| 0| [1.0,-1.0]|
| 3| 1| [1.0,1.0]|
+---+-----+-----------+
上面的dataframe
中有重复的行,需要找出来,并且删除掉。
# 查看去重前后的行数是否发生变化
print('Count of distinct rows:',df.distinct().count())
print('Count of rows:',df.count())
Count of distinct rows: 4
Count of rows: 5
# 删除重复行
df.drop_duplicates().show()
+---+-----+-----------+
| id|label| features|
+---+-----+-----------+
| 3| 1| [1.0,1.0]|
| 0| 1|[-1.0,-1.0]|
| 1| 0| [-1.0,1.0]|
| 2| 0| [1.0,-1.0]|
+---+-----+-----------+
# drop_duplicates和dropDuplicates功能一样,名字不同
df.dropDuplicates().show()
+---+-----+-----------+
| id|label| features|
+---+-----+-----------+
| 3| 1| [1.0,1.0]|
| 0| 1|[-1.0,-1.0]|
| 1| 0| [-1.0,1.0]|
| 2| 0| [1.0,-1.0]|
+---+-----+-----------+
# 也可以选择某几个字段是否有完全相同的值
df.dropDuplicates(['id','label']).show()
+---+-----+-----------+
| id|label| features|
+---+-----+-----------+
| 1| 0| [-1.0,1.0]|
| 0| 1|[-1.0,-1.0]|
| 2| 0| [1.0,-1.0]|
| 3| 1| [1.0,1.0]|
+---+-----+-----------+
1.2 重复字段
# 利用pyspark.sql.functions模块下的函数
from pyspark.sql import functions as fn
# 这里要用到聚合函数,agg
# alias相当于别名,类似sql里的 AS
# 类似python中的values_count
df.agg(
fn.count('id').alias('id_count'),
fn.countDistinct('id').alias('id_ditinctcount'),
fn.count('label').alias('label_count'),
fn.countDistinct('label').alias('label_distinctcount'),
).show()
+--------+---------------+-----------+-------------------+
|id_count|id_ditinctcount|label_count|label_distinctcount|
+--------+---------------+-----------+-------------------+
| 5| 4| 5| 2|
+--------+---------------+-----------+-------------------+
2 缺失值
2.1 按照行统计缺失值个数
df = spark.createDataFrame([
(1, 143.5, 5.6, 28, 'M', 100000),
(2, 167.2, 5.4, 45, 'M', None),
(3, None, 5.2, None, None, None),
(4, 144.5, 5.9, 33, 'M', None),
(5, 133.2, 5.7, 54, 'F', None),
(6, 124.1, 5.2, None, 'F', None),
(7, 129.2, 5.3, 42, 'M', 76000),
], ['id', 'weight', 'height', 'age', 'gender', 'income'])
df.show()
+---+------+------+----+------+------+
| id|weight|height| age|gender|income|
+---+------+------+----+------+------+
| 1| 143.5| 5.6| 28| M|100000|
| 2| 167.2| 5.4| 45| M| null|
| 3| null| 5.2|null| null| null|
| 4| 144.5| 5.9| 33| M| null|
| 5| 133.2| 5.7| 54| F| null|
| 6| 124.1| 5.2|null| F| null|
| 7| 129.2| 5.3| 42| M| 76000|
+---+------+------+----+------+------+
按照行统计缺失值个数,用到map
函数。
# 转成rdd才能用map,这个类似R里面的apply族函数
# 注意lambda内部,很像是pandas里面的语法,略直观
df.rdd.map(lambda x:[x['id'],sum([c == None for c in x])]).toDF(['id','null_num']).show()
+---+--------+
| id|null_num|
+---+--------+
| 1| 0|
| 2| 1|
| 3| 4|
| 4| 1|
| 5| 1|
| 6| 2|
| 7| 0|
+---+--------+
2.2 按照列统计缺失值个数
判断是否是缺失值,返回true和false
# 判断是否是缺失值
df.select('weight',
fn.isnull('weight').alias('weight_null'),
'age',
fn.isnull('age').alias('age_null')).show()
+------+-----------+----+--------+
|weight|weight_null| age|age_null|
+------+-----------+----+--------+
| 143.5| false| 28| false|
| 167.2| false| 45| false|
| null| true|null| true|
| 144.5| false| 33| false|
| 133.2| false| 54| false|
| 124.1| false|null| true|
| 129.2| false| 42| false|
+------+-----------+----+--------+
统计缺失值个数,逻辑其实是类sql,count某个字段,过滤缺失值
# 求和统计缺失值个数,逻辑其实是类sql的
# count某个字段,过滤缺失值
df.select(
fn.count('weight').alias('weight_null'),
fn.count('age').alias('age_null'),
fn.count('*').alias('row_num')).show()
+-----------+--------+-------+
|weight_null|age_null|row_num|
+-----------+--------+-------+
| 6| 5| 7|
+-----------+--------+-------+
统计所有的字段,存在缺失值的占比
df.select([(1 - fn.count(i) / fn.count('*')).alias(i + '_missing')
for i in df.columns]).show()
+----------+------------------+--------------+------------------+------------------+------------------+
|id_missing| weight_missing|height_missing| age_missing| gender_missing| income_missing|
+----------+------------------+--------------+------------------+------------------+------------------+
| 0.0|0.1428571428571429| 0.0|0.2857142857142857|0.1428571428571429|0.7142857142857143|
+----------+------------------+--------------+------------------+------------------+------------------+
转化成百分比,但是拼接百分号有点问题的,也许要转成sql来写了
df.select([
fn.round((1 - fn.count(i) / fn.count('*')) * 100,
2).alias(i + '_missingrate') for i in df.columns
]).show()
+--------------+------------------+------------------+---------------+------------------+------------------+
|id_missingrate|weight_missingrate|height_missingrate|age_missingrate|gender_missingrate|income_missingrate|
+--------------+------------------+------------------+---------------+------------------+------------------+
| 0.0| 14.29| 0.0| 28.57| 14.29| 71.43|
+--------------+------------------+------------------+---------------+------------------+------------------+
summary方法也很方便滴
df.summary("count").show()
+-------+---+------+------+---+------+------+
|summary| id|weight|height|age|gender|income|
+-------+---+------+------+---+------+------+
| count| 7| 6| 7| 5| 6| 2|
+-------+---+------+------+---+------+------+
2.3 删除缺失值所在行
删除缺失值个数大于某些阈值的行。可以看函数说明https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameNaFunctions。
drop(how='any', thresh=None, subset=None)
Parameters:
how – ‘any’ or ‘all’. If ‘any’, drop a row if it contains any nulls. If ‘all’, drop a row only if all its values are null.
[any-有缺失值就删除该行,all-都是缺失值才删除该行]
thresh – int, default None If specified, drop rows that have less than thresh non-null values. This overwrites the how parameter.
[thresh表示该行中,不为null的字段数的上限。e.g. thresh=4表示删除每一行不为null的字段数大于4的]
subset – optional list of column names to consider.
[表示判断是否为null的字段,即可能不是对所有字段判断的]
df.join(df.rdd.map(lambda x:[x['id'],7-sum([c == None for c in x])]).toDF(['id','null_num']),on='id').show()
+---+------+------+----+------+------+--------+
| id|weight|height| age|gender|income|null_num|
+---+------+------+----+------+------+--------+
| 7| 129.2| 5.3| 42| M| 76000| 7|
| 6| 124.1| 5.2|null| F| null| 5|
| 5| 133.2| 5.7| 54| F| null| 6|
| 1| 143.5| 5.6| 28| M|100000| 7|
| 3| null| 5.2|null| null| null| 3|
| 2| 167.2| 5.4| 45| M| null| 6|
| 4| 144.5| 5.9| 33| M| null| 6|
+---+------+------+----+------+------+--------+
删除非缺失值个数小于5的行
其中:
id=3的行,缺失值4个,非缺失值3个
id=6的行,缺失值3个,非缺失值4个
df.dropna(thresh=5).show()
+---+------+------+---+------+------+
| id|weight|height|age|gender|income|
+---+------+------+---+------+------+
| 1| 143.5| 5.6| 28| M|100000|
| 2| 167.2| 5.4| 45| M| null|
| 4| 144.5| 5.9| 33| M| null|
| 5| 133.2| 5.7| 54| F| null|
| 7| 129.2| 5.3| 42| M| 76000|
+---+------+------+---+------+------+
只对某几个字段,统计这几个字段中缺失值的数据
在 'height'、 'age'、 'gender'、'income'
这四个字段中,
id=3的那行,只有1个非缺失值字段
id=6的那行,只有2个非缺失值字段
df.dropna(how='any',thresh = 3,subset =('height', 'age', 'gender','income')).show()
+---+------+------+---+------+------+
| id|weight|height|age|gender|income|
+---+------+------+---+------+------+
| 1| 143.5| 5.6| 28| M|100000|
| 2| 167.2| 5.4| 45| M| null|
| 4| 144.5| 5.9| 33| M| null|
| 5| 133.2| 5.7| 54| F| null|
| 7| 129.2| 5.3| 42| M| 76000|
+---+------+------+---+------+------+
2.4 缺失值填充
fillna
方法可以使用,具体参数如下:
fillna(value, subset=None)
Parameters:
*value* – int, long, float, string, bool or dict. Value to replace null values with. If the value is a dict, then subset is ignored and value must be a mapping from column name (string) to replacement value. The replacement value must be an int, long, float, boolean, or string.
*subset* – optional list of column names to consider. Columns specified in subset that do not have matching data type are ignored. For example, if value is a string, and subset contains a non-string column, then the non-string column is simply ignored.
具体解释看英文即可,建议直接value
用dict
直接、方便。至于用什么填充缺失值,不在本次讨论之列。
df.fillna(value = {'age':-999,'gender':'missing'}).show()
+---+------+------+----+-------+------+
| id|weight|height| age| gender|income|
+---+------+------+----+-------+------+
| 1| 143.5| 5.6| 28| M|100000|
| 2| 167.2| 5.4| 45| M| null|
| 3| null| 5.2|-999|missing| null|
| 4| 144.5| 5.9| 33| M| null|
| 5| 133.2| 5.7| 54| F| null|
| 6| 124.1| 5.2|-999| F| null|
| 7| 129.2| 5.3| 42| M| 76000|
+---+------+------+----+-------+------+
3 描述统计
3.1 打印字段类型
df = spark.createDataFrame([
(1, 143.5, 5.6, 28, 'M', 100000),
(2, 167.2, 5.4, 45, 'M', None),
(3, None, 5.2, None, None, None),
(4, 144.5, 5.9, 33, 'M', None),
(5, 133.2, 5.7, 54, 'F', None),
(6, 124.1, 5.2, None, 'F', None),
(7, 129.2, 5.3, 42, 'M', 76000),
], ['id', 'weight', 'height', 'age', 'gender', 'income'])
类似sql里的describe table
df.printSchema()
root
|-- id: long (nullable = true)
|-- weight: double (nullable = true)
|-- height: double (nullable = true)
|-- age: long (nullable = true)
|-- gender: string (nullable = true)
|-- income: long (nullable = true)
3.2 数值统计
使用describe方法,类似R里的summary,python里的describe类似
df.describe().show()
+-------+-----------------+------------------+-------------------+-----------------+------+-----------------+
|summary| id| weight| height| age|gender| income|
+-------+-----------------+------------------+-------------------+-----------------+------+-----------------+
| count| 7| 6| 7| 5| 6| 2|
| mean| 4.0|140.28333333333333| 5.471428571428572| 40.4| null| 88000.0|
| stddev|2.160246899469287|15.411348632312055|0.26903708365381984|10.21273714534943| null|16970.56274847714|
| min| 1| 124.1| 5.2| 28| F| 76000|
| max| 7| 167.2| 5.9| 54| M| 100000|
+-------+-----------------+------------------+-------------------+-----------------+------+-----------------+
4 列操作
4.1 新建数据
from pyspark.sql import SparkSession
# 创建SparkSession对象,调用.builder类
# .appName("testapp")方法给应用程序一个名字;.getOrCreate()方法创建或着获取一个已经创建的SparkSession
spark = SparkSession.builder.appName("testapp").getOrCreate()
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col
# 构建数据
data = [
(0, 1,Vectors.dense([-1.0, -1.0 ]),),
(1, 0,Vectors.dense([-1.0, 1.0 ]),),
(3, 1,Vectors.dense([1.0, 1.0]),),
(2,0, Vectors.dense([1.0, -1.0 ]),),
(3, 1,Vectors.dense([1.0, 1.0]),)]
df = spark.createDataFrame(data, ["id", "label","features"])
df.show()
+---+-----+-----------+
| id|label| features|
+---+-----+-----------+
| 0| 1|[-1.0,-1.0]|
| 1| 0| [-1.0,1.0]|
| 3| 1| [1.0,1.0]|
| 2| 0| [1.0,-1.0]|
| 3| 1| [1.0,1.0]|
+---+-----+-----------+
4.2 修改列名
4.2.1 s e l e c t E x p r selectExpr selectExpr
方法比较笨,十分不灵活。在数据库列数比较少时可以使用,通过别名的方式实现
s
e
l
e
c
t
E
x
p
r
selectExpr
selectExpr函数可以用类sql的方式处理,注意产生的新的数据框
df.selectExpr('id','label as label_sql').show()
+---+---------+
| id|label_sql|
+---+---------+
| 0| 1|
| 1| 0|
| 3| 1|
| 2| 0|
| 3| 1|
+---+---------+
或者直接别名,相当于复制列,再改变列名
df.selectExpr('*','label as label_sql').show()
+---+-----+-----------+---------+
| id|label| features|label_sql|
+---+-----+-----------+---------+
| 0| 1|[-1.0,-1.0]| 1|
| 1| 0| [-1.0,1.0]| 0|
| 3| 1| [1.0,1.0]| 1|
| 2| 0| [1.0,-1.0]| 0|
| 3| 1| [1.0,1.0]| 1|
+---+-----+-----------+---------+
4.2.2 s e l e c t select select
s e l e c t E x p r selectExpr selectExpr是 s e l e c t select select变体,逻辑基本一致,代码比较啰嗦
df.select('*',df.label.alias('label_sql')).show()
+---+-----+-----------+---------+
| id|label| features|label_sql|
+---+-----+-----------+---------+
| 0| 1|[-1.0,-1.0]| 1|
| 1| 0| [-1.0,1.0]| 0|
| 3| 1| [1.0,1.0]| 1|
| 2| 0| [1.0,-1.0]| 0|
| 3| 1| [1.0,1.0]| 1|
+---+-----+-----------+---------+
df.select('*',col("label").alias("label_sql")).show()
+---+-----+-----------+---------+
| id|label| features|label_sql|
+---+-----+-----------+---------+
| 0| 1|[-1.0,-1.0]| 1|
| 1| 0| [-1.0,1.0]| 0|
| 3| 1| [1.0,1.0]| 1|
| 2| 0| [1.0,-1.0]| 0|
| 3| 1| [1.0,1.0]| 1|
+---+-----+-----------+---------+
4.2.3 修改指定的列名
直接修改指定的列名,方便直接
最佳方法无疑
df.withColumnRenamed("label", "label_sql").show()
+---+---------+-----------+
| id|label_sql| features|
+---+---------+-----------+
| 0| 1|[-1.0,-1.0]|
| 1| 0| [-1.0,1.0]|
| 3| 1| [1.0,1.0]|
| 2| 0| [1.0,-1.0]|
| 3| 1| [1.0,1.0]|
+---+---------+-----------+
4.2.4 toDF方法
这个方法适合改多个列名,另外不知道toDF是否费内存
文档:
toDF(*cols)
Returns a new class:DataFrame that with new specified column names
Parameters: cols – list of new column names (string)
df.toDF(*['id1','label1','feature1']).show()
+---+------+-----------+
|id1|label1| feature1|
+---+------+-----------+
| 0| 1|[-1.0,-1.0]|
| 1| 0| [-1.0,1.0]|
| 3| 1| [1.0,1.0]|
| 2| 0| [1.0,-1.0]|
| 3| 1| [1.0,1.0]|
+---+------+-----------+
df.toDF('id1','label1','feature1').show()
+---+------+-----------+
|id1|label1| feature1|
+---+------+-----------+
| 0| 1|[-1.0,-1.0]|
| 1| 0| [-1.0,1.0]|
| 3| 1| [1.0,1.0]|
| 2| 0| [1.0,-1.0]|
| 3| 1| [1.0,1.0]|
+---+------+-----------+
上面的四个方法基本满足日常需求,还有其他方法可以Google~
参考:https://stackoverflow.com/questions/34077353/how-to-change-dataframe-column-names-in-pyspark
4.3 新增一列
单纯的新增列基本没啥必要,更多的应用是对某些列操作,计算结果为新增列
新增列又有两种方式:
- 数据框自身列的衍生计算
- 其他数据框的赋值
4.2.1 s e l e c t E x p r selectExpr selectExpr
sql可以完成的计算都可以用 s e l e c t E x p r selectExpr selectExpr,但是均为既有列的计算
df.selectExpr('*','label+10 AS label_sql').show()
+---+-----+-----------+---------+
| id|label| features|label_sql|
+---+-----+-----------+---------+
| 0| 1|[-1.0,-1.0]| 11|
| 1| 0| [-1.0,1.0]| 10|
| 3| 1| [1.0,1.0]| 11|
| 2| 0| [1.0,-1.0]| 10|
| 3| 1| [1.0,1.0]| 11|
+---+-----+-----------+---------+
4.2.2 既有列的复杂计算
这里相当于pandas中apply用法,对列进行操作
真正在集群上处理的时候,需要每个节点都安装numpy,因为toDF这个函数用到numpy
def extract(row):
return (row.id,row.label, ) + tuple(row.features.toArray().tolist())
df.rdd.map(lambda row: extract(row)).toDF(["id","label",'feature_a','feature_b']).show()
+---+-----+---------+---------+
| id|label|feature_a|feature_b|
+---+-----+---------+---------+
| 0| 1| -1.0| -1.0|
| 1| 0| -1.0| 1.0|
| 3| 1| 1.0| 1.0|
| 2| 0| 1.0| -1.0|
| 3| 1| 1.0| 1.0|
+---+-----+---------+---------+
df.rdd.map(
lambda x: [x['id'], x['label']] + x['features'].toArray().tolist()).toDF(
['id', 'label', 'feature_a', 'feature_b']).show()
+---+-----+---------+---------+
| id|label|feature_a|feature_b|
+---+-----+---------+---------+
| 0| 1| -1.0| -1.0|
| 1| 0| -1.0| 1.0|
| 3| 1| 1.0| 1.0|
| 2| 0| 1.0| -1.0|
| 3| 1| 1.0| 1.0|
+---+-----+---------+---------+
也不知道是什么时候的事儿了,什么时候的笔记。时间不停转,记忆不停散。人生究竟是得到的多,还是失去的多。
2022-04-24 于南京市江宁区九龙湖