PysparkNote102---DataFrame常用操作2

https://zhuanlan.zhihu.com/p/34901683

1 重复数据筛查

  满足以下功能:

  1. 筛选出重复的行。
  2. 对某一个字段,筛选出重复的值
  3. 对某几个字段筛选出重复的值

1.1 重复行

from pyspark.sql import SparkSession
# 创建SparkSession对象,调用.builder类
# .appName("testapp")方法给应用程序一个名字;.getOrCreate()方法创建或着获取一个已经创建的SparkSession
spark = SparkSession.builder.appName("testapp").getOrCreate()
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col
# 构建数据
data = [
 (0, 1,Vectors.dense([-1.0, -1.0 ]),),
 (1, 0,Vectors.dense([-1.0, 1.0 ]),),
 (3, 1,Vectors.dense([1.0, 1.0]),),    
 (2,0, Vectors.dense([1.0, -1.0 ]),),
 (3, 1,Vectors.dense([1.0, 1.0]),)]
df = spark.createDataFrame(data, ["id", "label","features"])
df.show()
+---+-----+-----------+
| id|label|   features|
+---+-----+-----------+
|  0|    1|[-1.0,-1.0]|
|  1|    0| [-1.0,1.0]|
|  3|    1|  [1.0,1.0]|
|  2|    0| [1.0,-1.0]|
|  3|    1|  [1.0,1.0]|
+---+-----+-----------+

  上面的dataframe中有重复的行,需要找出来,并且删除掉。

# 查看去重前后的行数是否发生变化
print('Count of distinct rows:',df.distinct().count())
print('Count of rows:',df.count())
Count of distinct rows: 4
Count of rows: 5
# 删除重复行
df.drop_duplicates().show()
+---+-----+-----------+
| id|label|   features|
+---+-----+-----------+
|  3|    1|  [1.0,1.0]|
|  0|    1|[-1.0,-1.0]|
|  1|    0| [-1.0,1.0]|
|  2|    0| [1.0,-1.0]|
+---+-----+-----------+
# drop_duplicates和dropDuplicates功能一样,名字不同
df.dropDuplicates().show()
+---+-----+-----------+
| id|label|   features|
+---+-----+-----------+
|  3|    1|  [1.0,1.0]|
|  0|    1|[-1.0,-1.0]|
|  1|    0| [-1.0,1.0]|
|  2|    0| [1.0,-1.0]|
+---+-----+-----------+
# 也可以选择某几个字段是否有完全相同的值
df.dropDuplicates(['id','label']).show()
+---+-----+-----------+
| id|label|   features|
+---+-----+-----------+
|  1|    0| [-1.0,1.0]|
|  0|    1|[-1.0,-1.0]|
|  2|    0| [1.0,-1.0]|
|  3|    1|  [1.0,1.0]|
+---+-----+-----------+

1.2 重复字段

# 利用pyspark.sql.functions模块下的函数
from pyspark.sql import functions as fn
# 这里要用到聚合函数,agg
# alias相当于别名,类似sql里的 AS 
# 类似python中的values_count
df.agg(
    fn.count('id').alias('id_count'),
    fn.countDistinct('id').alias('id_ditinctcount'),
    fn.count('label').alias('label_count'),
    fn.countDistinct('label').alias('label_distinctcount'),
).show()
+--------+---------------+-----------+-------------------+
|id_count|id_ditinctcount|label_count|label_distinctcount|
+--------+---------------+-----------+-------------------+
|       5|              4|          5|                  2|
+--------+---------------+-----------+-------------------+

2 缺失值

2.1 按照行统计缺失值个数

df = spark.createDataFrame([
    (1, 143.5, 5.6, 28, 'M', 100000),
    (2, 167.2, 5.4, 45, 'M', None),
    (3, None, 5.2, None, None, None),
    (4, 144.5, 5.9, 33, 'M', None),
    (5, 133.2, 5.7, 54, 'F', None),
    (6, 124.1, 5.2, None, 'F', None),
    (7, 129.2, 5.3, 42, 'M', 76000),
], ['id', 'weight', 'height', 'age', 'gender', 'income'])
df.show()
+---+------+------+----+------+------+
| id|weight|height| age|gender|income|
+---+------+------+----+------+------+
|  1| 143.5|   5.6|  28|     M|100000|
|  2| 167.2|   5.4|  45|     M|  null|
|  3|  null|   5.2|null|  null|  null|
|  4| 144.5|   5.9|  33|     M|  null|
|  5| 133.2|   5.7|  54|     F|  null|
|  6| 124.1|   5.2|null|     F|  null|
|  7| 129.2|   5.3|  42|     M| 76000|
+---+------+------+----+------+------+

按照行统计缺失值个数,用到map函数。

# 转成rdd才能用map,这个类似R里面的apply族函数
# 注意lambda内部,很像是pandas里面的语法,略直观
df.rdd.map(lambda x:[x['id'],sum([c == None for c in x])]).toDF(['id','null_num']).show()
+---+--------+
| id|null_num|
+---+--------+
|  1|       0|
|  2|       1|
|  3|       4|
|  4|       1|
|  5|       1|
|  6|       2|
|  7|       0|
+---+--------+

2.2 按照列统计缺失值个数

判断是否是缺失值,返回true和false

# 判断是否是缺失值
df.select('weight',
          fn.isnull('weight').alias('weight_null'), 
          'age',
          fn.isnull('age').alias('age_null')).show()
+------+-----------+----+--------+
|weight|weight_null| age|age_null|
+------+-----------+----+--------+
| 143.5|      false|  28|   false|
| 167.2|      false|  45|   false|
|  null|       true|null|    true|
| 144.5|      false|  33|   false|
| 133.2|      false|  54|   false|
| 124.1|      false|null|    true|
| 129.2|      false|  42|   false|
+------+-----------+----+--------+

统计缺失值个数,逻辑其实是类sql,count某个字段,过滤缺失值

# 求和统计缺失值个数,逻辑其实是类sql的
# count某个字段,过滤缺失值
df.select(
    fn.count('weight').alias('weight_null'),
    fn.count('age').alias('age_null'),
    fn.count('*').alias('row_num')).show()
+-----------+--------+-------+
|weight_null|age_null|row_num|
+-----------+--------+-------+
|          6|       5|      7|
+-----------+--------+-------+

统计所有的字段,存在缺失值的占比

df.select([(1 - fn.count(i) / fn.count('*')).alias(i + '_missing')
           for i in df.columns]).show()
+----------+------------------+--------------+------------------+------------------+------------------+
|id_missing|    weight_missing|height_missing|       age_missing|    gender_missing|    income_missing|
+----------+------------------+--------------+------------------+------------------+------------------+
|       0.0|0.1428571428571429|           0.0|0.2857142857142857|0.1428571428571429|0.7142857142857143|
+----------+------------------+--------------+------------------+------------------+------------------+

转化成百分比,但是拼接百分号有点问题的,也许要转成sql来写了

df.select([
    fn.round((1 - fn.count(i) / fn.count('*')) * 100,
             2).alias(i + '_missingrate') for i in df.columns
]).show()
+--------------+------------------+------------------+---------------+------------------+------------------+
|id_missingrate|weight_missingrate|height_missingrate|age_missingrate|gender_missingrate|income_missingrate|
+--------------+------------------+------------------+---------------+------------------+------------------+
|           0.0|             14.29|               0.0|          28.57|             14.29|             71.43|
+--------------+------------------+------------------+---------------+------------------+------------------+

summary方法也很方便滴

df.summary("count").show()
+-------+---+------+------+---+------+------+
|summary| id|weight|height|age|gender|income|
+-------+---+------+------+---+------+------+
|  count|  7|     6|     7|  5|     6|     2|
+-------+---+------+------+---+------+------+

2.3 删除缺失值所在行

  删除缺失值个数大于某些阈值的行。可以看函数说明https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameNaFunctions

drop(how='any', thresh=None, subset=None)
Parameters:	
    how – ‘any’ or ‘all’. If ‘any’, drop a row if it contains any nulls. If ‘all’, drop a row only if all its values    are null.
    [any-有缺失值就删除该行,all-都是缺失值才删除该行]
    thresh – int, default None If specified, drop rows that have less than thresh non-null values. This overwrites the how parameter.
    [thresh表示该行中,不为null的字段数的上限。e.g. thresh=4表示删除每一行不为null的字段数大于4的]
    subset – optional list of column names to consider.
    [表示判断是否为null的字段,即可能不是对所有字段判断的]
   
df.join(df.rdd.map(lambda x:[x['id'],7-sum([c == None for c in x])]).toDF(['id','null_num']),on='id').show()
+---+------+------+----+------+------+--------+
| id|weight|height| age|gender|income|null_num|
+---+------+------+----+------+------+--------+
|  7| 129.2|   5.3|  42|     M| 76000|       7|
|  6| 124.1|   5.2|null|     F|  null|       5|
|  5| 133.2|   5.7|  54|     F|  null|       6|
|  1| 143.5|   5.6|  28|     M|100000|       7|
|  3|  null|   5.2|null|  null|  null|       3|
|  2| 167.2|   5.4|  45|     M|  null|       6|
|  4| 144.5|   5.9|  33|     M|  null|       6|
+---+------+------+----+------+------+--------+

删除非缺失值个数小于5的行
其中:
id=3的行,缺失值4个,非缺失值3个
id=6的行,缺失值3个,非缺失值4个

df.dropna(thresh=5).show()
+---+------+------+---+------+------+
| id|weight|height|age|gender|income|
+---+------+------+---+------+------+
|  1| 143.5|   5.6| 28|     M|100000|
|  2| 167.2|   5.4| 45|     M|  null|
|  4| 144.5|   5.9| 33|     M|  null|
|  5| 133.2|   5.7| 54|     F|  null|
|  7| 129.2|   5.3| 42|     M| 76000|
+---+------+------+---+------+------+

只对某几个字段,统计这几个字段中缺失值的数据
'height'、 'age'、 'gender'、'income'这四个字段中,
id=3的那行,只有1个非缺失值字段
id=6的那行,只有2个非缺失值字段

df.dropna(how='any',thresh = 3,subset =('height', 'age', 'gender','income')).show()
+---+------+------+---+------+------+
| id|weight|height|age|gender|income|
+---+------+------+---+------+------+
|  1| 143.5|   5.6| 28|     M|100000|
|  2| 167.2|   5.4| 45|     M|  null|
|  4| 144.5|   5.9| 33|     M|  null|
|  5| 133.2|   5.7| 54|     F|  null|
|  7| 129.2|   5.3| 42|     M| 76000|
+---+------+------+---+------+------+

2.4 缺失值填充

  fillna方法可以使用,具体参数如下:

fillna(value, subset=None)
Parameters:	
    *value* – int, long, float, string, bool or dict. Value to replace null values with. If the value is a dict, then subset is ignored and value must be a mapping from column name (string) to replacement value. The replacement value must be an int, long, float, boolean, or string.
    *subset* – optional list of column names to consider. Columns specified in subset that do not have matching data type are ignored. For example, if value is a string, and subset contains a non-string column, then the non-string column is simply ignored.

具体解释看英文即可,建议直接valuedict直接、方便。至于用什么填充缺失值,不在本次讨论之列。

df.fillna(value = {'age':-999,'gender':'missing'}).show()
+---+------+------+----+-------+------+
| id|weight|height| age| gender|income|
+---+------+------+----+-------+------+
|  1| 143.5|   5.6|  28|      M|100000|
|  2| 167.2|   5.4|  45|      M|  null|
|  3|  null|   5.2|-999|missing|  null|
|  4| 144.5|   5.9|  33|      M|  null|
|  5| 133.2|   5.7|  54|      F|  null|
|  6| 124.1|   5.2|-999|      F|  null|
|  7| 129.2|   5.3|  42|      M| 76000|
+---+------+------+----+-------+------+

3 描述统计

3.1 打印字段类型

df = spark.createDataFrame([
    (1, 143.5, 5.6, 28, 'M', 100000),
    (2, 167.2, 5.4, 45, 'M', None),
    (3, None, 5.2, None, None, None),
    (4, 144.5, 5.9, 33, 'M', None),
    (5, 133.2, 5.7, 54, 'F', None),
    (6, 124.1, 5.2, None, 'F', None),
    (7, 129.2, 5.3, 42, 'M', 76000),
], ['id', 'weight', 'height', 'age', 'gender', 'income'])

类似sql里的describe table

df.printSchema()
root
 |-- id: long (nullable = true)
 |-- weight: double (nullable = true)
 |-- height: double (nullable = true)
 |-- age: long (nullable = true)
 |-- gender: string (nullable = true)
 |-- income: long (nullable = true)

3.2 数值统计

使用describe方法,类似R里的summary,python里的describe类似

df.describe().show()
+-------+-----------------+------------------+-------------------+-----------------+------+-----------------+
|summary|               id|            weight|             height|              age|gender|           income|
+-------+-----------------+------------------+-------------------+-----------------+------+-----------------+
|  count|                7|                 6|                  7|                5|     6|                2|
|   mean|              4.0|140.28333333333333|  5.471428571428572|             40.4|  null|          88000.0|
| stddev|2.160246899469287|15.411348632312055|0.26903708365381984|10.21273714534943|  null|16970.56274847714|
|    min|                1|             124.1|                5.2|               28|     F|            76000|
|    max|                7|             167.2|                5.9|               54|     M|           100000|
+-------+-----------------+------------------+-------------------+-----------------+------+-----------------+

4 列操作

4.1 新建数据

from pyspark.sql import SparkSession
# 创建SparkSession对象,调用.builder类
# .appName("testapp")方法给应用程序一个名字;.getOrCreate()方法创建或着获取一个已经创建的SparkSession
spark = SparkSession.builder.appName("testapp").getOrCreate()
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col
# 构建数据
data = [
 (0, 1,Vectors.dense([-1.0, -1.0 ]),),
 (1, 0,Vectors.dense([-1.0, 1.0 ]),),
 (3, 1,Vectors.dense([1.0, 1.0]),),    
 (2,0, Vectors.dense([1.0, -1.0 ]),),
 (3, 1,Vectors.dense([1.0, 1.0]),)]
df = spark.createDataFrame(data, ["id", "label","features"])
df.show()
+---+-----+-----------+
| id|label|   features|
+---+-----+-----------+
|  0|    1|[-1.0,-1.0]|
|  1|    0| [-1.0,1.0]|
|  3|    1|  [1.0,1.0]|
|  2|    0| [1.0,-1.0]|
|  3|    1|  [1.0,1.0]|
+---+-----+-----------+

4.2 修改列名

4.2.1 s e l e c t E x p r selectExpr selectExpr

方法比较笨,十分不灵活。在数据库列数比较少时可以使用,通过别名的方式实现
s e l e c t E x p r selectExpr selectExpr函数可以用类sql的方式处理,注意产生的新的数据框

df.selectExpr('id','label as label_sql').show()
+---+---------+
| id|label_sql|
+---+---------+
|  0|        1|
|  1|        0|
|  3|        1|
|  2|        0|
|  3|        1|
+---+---------+

或者直接别名,相当于复制列,再改变列名

df.selectExpr('*','label as label_sql').show()
+---+-----+-----------+---------+
| id|label|   features|label_sql|
+---+-----+-----------+---------+
|  0|    1|[-1.0,-1.0]|        1|
|  1|    0| [-1.0,1.0]|        0|
|  3|    1|  [1.0,1.0]|        1|
|  2|    0| [1.0,-1.0]|        0|
|  3|    1|  [1.0,1.0]|        1|
+---+-----+-----------+---------+

4.2.2 s e l e c t select select

s e l e c t E x p r selectExpr selectExpr s e l e c t select select变体,逻辑基本一致,代码比较啰嗦

df.select('*',df.label.alias('label_sql')).show()
+---+-----+-----------+---------+
| id|label|   features|label_sql|
+---+-----+-----------+---------+
|  0|    1|[-1.0,-1.0]|        1|
|  1|    0| [-1.0,1.0]|        0|
|  3|    1|  [1.0,1.0]|        1|
|  2|    0| [1.0,-1.0]|        0|
|  3|    1|  [1.0,1.0]|        1|
+---+-----+-----------+---------+
df.select('*',col("label").alias("label_sql")).show()
+---+-----+-----------+---------+
| id|label|   features|label_sql|
+---+-----+-----------+---------+
|  0|    1|[-1.0,-1.0]|        1|
|  1|    0| [-1.0,1.0]|        0|
|  3|    1|  [1.0,1.0]|        1|
|  2|    0| [1.0,-1.0]|        0|
|  3|    1|  [1.0,1.0]|        1|
+---+-----+-----------+---------+

4.2.3 修改指定的列名

直接修改指定的列名,方便直接
最佳方法无疑

df.withColumnRenamed("label", "label_sql").show()
+---+---------+-----------+
| id|label_sql|   features|
+---+---------+-----------+
|  0|        1|[-1.0,-1.0]|
|  1|        0| [-1.0,1.0]|
|  3|        1|  [1.0,1.0]|
|  2|        0| [1.0,-1.0]|
|  3|        1|  [1.0,1.0]|
+---+---------+-----------+

4.2.4 toDF方法

这个方法适合改多个列名,另外不知道toDF是否费内存
文档:

 toDF(*cols) 

    Returns a new class:DataFrame that with new specified column names
    Parameters:	cols – list of new column names (string)
    
df.toDF(*['id1','label1','feature1']).show()
+---+------+-----------+
|id1|label1|   feature1|
+---+------+-----------+
|  0|     1|[-1.0,-1.0]|
|  1|     0| [-1.0,1.0]|
|  3|     1|  [1.0,1.0]|
|  2|     0| [1.0,-1.0]|
|  3|     1|  [1.0,1.0]|
+---+------+-----------+
df.toDF('id1','label1','feature1').show()
+---+------+-----------+
|id1|label1|   feature1|
+---+------+-----------+
|  0|     1|[-1.0,-1.0]|
|  1|     0| [-1.0,1.0]|
|  3|     1|  [1.0,1.0]|
|  2|     0| [1.0,-1.0]|
|  3|     1|  [1.0,1.0]|
+---+------+-----------+

上面的四个方法基本满足日常需求,还有其他方法可以Google~
参考:https://stackoverflow.com/questions/34077353/how-to-change-dataframe-column-names-in-pyspark

4.3 新增一列

单纯的新增列基本没啥必要,更多的应用是对某些列操作,计算结果为新增列
新增列又有两种方式:

  1. 数据框自身列的衍生计算
  2. 其他数据框的赋值

4.2.1 s e l e c t E x p r selectExpr selectExpr

sql可以完成的计算都可以用 s e l e c t E x p r selectExpr selectExpr,但是均为既有列的计算

df.selectExpr('*','label+10 AS label_sql').show()
+---+-----+-----------+---------+
| id|label|   features|label_sql|
+---+-----+-----------+---------+
|  0|    1|[-1.0,-1.0]|       11|
|  1|    0| [-1.0,1.0]|       10|
|  3|    1|  [1.0,1.0]|       11|
|  2|    0| [1.0,-1.0]|       10|
|  3|    1|  [1.0,1.0]|       11|
+---+-----+-----------+---------+

4.2.2 既有列的复杂计算

这里相当于pandas中apply用法,对列进行操作
真正在集群上处理的时候,需要每个节点都安装numpy,因为toDF这个函数用到numpy

def extract(row):
    return (row.id,row.label, ) + tuple(row.features.toArray().tolist())
df.rdd.map(lambda row: extract(row)).toDF(["id","label",'feature_a','feature_b']).show() 
+---+-----+---------+---------+
| id|label|feature_a|feature_b|
+---+-----+---------+---------+
|  0|    1|     -1.0|     -1.0|
|  1|    0|     -1.0|      1.0|
|  3|    1|      1.0|      1.0|
|  2|    0|      1.0|     -1.0|
|  3|    1|      1.0|      1.0|
+---+-----+---------+---------+
df.rdd.map(
    lambda x: [x['id'], x['label']] + x['features'].toArray().tolist()).toDF(
        ['id', 'label', 'feature_a', 'feature_b']).show()
+---+-----+---------+---------+
| id|label|feature_a|feature_b|
+---+-----+---------+---------+
|  0|    1|     -1.0|     -1.0|
|  1|    0|     -1.0|      1.0|
|  3|    1|      1.0|      1.0|
|  2|    0|      1.0|     -1.0|
|  3|    1|      1.0|      1.0|
+---+-----+---------+---------+

也不知道是什么时候的事儿了,什么时候的笔记。时间不停转,记忆不停散。人生究竟是得到的多,还是失去的多。

                                2022-04-24 于南京市江宁区九龙湖

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值