PysparkNote102---DataFrame常用操作2

最新推荐文章于 2023-05-11 18:22:47 发布

维格堂406小队

最新推荐文章于 2023-05-11 18:22:47 发布

阅读量1.4k

点赞数

分类专栏： ★★★PySpark # ★★Pyspark基础文章标签： spark python

本文链接：https://blog.csdn.net/wendaomudong_l2d4/article/details/124388565

版权

★★★PySpark 同时被 2 个专栏收录

13 篇文章 0 订阅

订阅专栏

★★Pyspark基础

13 篇文章 0 订阅

订阅专栏

https://zhuanlan.zhihu.com/p/34901683

1 重复数据筛查

满足以下功能:

筛选出重复的行。
对某一个字段，筛选出重复的值
对某几个字段筛选出重复的值

1.1 重复行

from pyspark.sql import SparkSession
# 创建SparkSession对象，调用.builder类
# .appName("testapp")方法给应用程序一个名字；.getOrCreate()方法创建或着获取一个已经创建的SparkSession
spark = SparkSession.builder.appName("testapp").getOrCreate()

from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col
# 构建数据
data = [
 (0, 1,Vectors.dense([-1.0, -1.0 ]),),
 (1, 0,Vectors.dense([-1.0, 1.0 ]),),
 (3, 1,Vectors.dense([1.0, 1.0]),),    
 (2,0, Vectors.dense([1.0, -1.0 ]),),
 (3, 1,Vectors.dense([1.0, 1.0]),)]
df = spark.createDataFrame(data, ["id", "label","features"])
df.show()

+---+-----+-----------+
| id|label|   features|
+---+-----+-----------+
|  0|    1|[-1.0,-1.0]|
|  1|    0| [-1.0,1.0]|
|  3|    1|  [1.0,1.0]|
|  2|    0| [1.0,-1.0]|
|  3|    1|  [1.0,1.0]|
+---+-----+-----------+

上面的dataframe中有重复的行，需要找出来，并且删除掉。

# 查看去重前后的行数是否发生变化
print('Count of distinct rows:',df.distinct().count())
print('Count of rows:',df.count())

Count of distinct rows: 4
Count of rows: 5

# 删除重复行
df.drop_duplicates().show()

+---+-----+-----------+
| id|label|   features|
+---+-----+-----------+
|  3|    1|  [1.0,1.0]|
|  0|    1|[-1.0,-1.0]|
|  1|    0| [-1.0,1.0]|
|  2|    0| [1.0,-1.0]|
+---+-----+-----------+

# drop_duplicates和dropDuplicates功能一样，名字不同
df.dropDuplicates().show()

+---+-----+-----------+
| id|label|   features|
+---+-----+-----------+
|  3|    1|  [1.0,1.0]|
|  0|    1|[-1.0,-1.0]|
|  1|    0| [-1.0,1.0]|
|  2|    0| [1.0,-1.0]|
+---+-----+-----------+

# 也可以选择某几个字段是否有完全相同的值
df.dropDuplicates(['id','label']).show()

+---+-----+-----------+
| id|label|   features|
+---+-----+-----------+
|  1|    0| [-1.0,1.0]|
|  0|    1|[-1.0,-1.0]|
|  2|    0| [1.0,-1.0]|
|  3|    1|  [1.0,1.0]|
+---+-----+-----------+

1.2 重复字段

# 利用pyspark.sql.functions模块下的函数
from pyspark.sql import functions as fn
# 这里要用到聚合函数，agg
# alias相当于别名，类似sql里的 AS 
# 类似python中的values_count
df.agg(
    fn.count('id').alias('id_count'),
    fn.countDistinct('id').alias('id_ditinctcount'),
    fn.count('label').alias('label_count'),
    fn.countDistinct('label').alias('label_distinctcount'),
).show()

+--------+---------------+-----------+-------------------+
|id_count|id_ditinctcount|label_count|label_distinctcount|
+--------+---------------+-----------+-------------------+
|       5|              4|          5|                  2|
+--------+---------------+-----------+-------------------+

2 缺失值

2.1 按照行统计缺失值个数

df = spark.createDataFrame([
    (1, 143.5, 5.6, 28, 'M', 100000),
    (2, 167.2, 5.4, 45, 'M', None),
    (3, None, 5.2, None, None, None),
    (4, 144.5, 5.9, 33, 'M', None),
    (5, 133.2, 5.7, 54, 'F', None),
    (6, 124.1, 5.2, None, 'F', None),
    (7, 129.2, 5.3, 42, 'M', 76000),
], ['id', 'weight', 'height', 'age', 'gender', 'income'])

df.show()

+---+------+------+----+------+------+
| id|weight|height| age|gender|income|
+---+------+------+----+------+------+
|  1| 143.5|   5.6|  28|     M|100000|
|  2| 167.2|   5.4|  45|     M|  null|
|  3|  null|   5.2|null|  null|  null|
|  4| 144.5|   5.9|  33|     M|  null|
|  5| 133.2|   5.7|  54|     F|  null|
|  6| 124.1|   5.2|null|     F|  null|
|  7| 129.2|   5.3|  42|     M| 76000|
+---+------+------+----+------+------+

按照行统计缺失值个数，用到map函数。

# 转成rdd才能用map，这个类似R里面的apply族函数
# 注意lambda内部，很像是pandas里面的语法，略直观
df.rdd.map(lambda x:[x['id'],sum([c == None for c in x])]).toDF(['id','null_num']).show()

+---+--------+
| id|null_num|
+---+--------+
|  1|       0|
|  2|       1|
|  3|       4|
|  4|       1|
|  5|       1|
|  6|       2|
|  7|       0|
+---+--------+

2.2 按照列统计缺失值个数

判断是否是缺失值，返回true和false

# 判断是否是缺失值
df.select('weight',
          fn.isnull('weight').alias('weight_null'), 
          'age',
          fn.isnull('age').alias('age_null')).show()

+------+-----------+----+--------+
|weight|weight_null| age|age_null|
+------+-----------+----+--------+
| 143.5|      false|  28|   false|
| 167.2|      false|  45|   false|
|  null|       true|null|    true|
| 144.5|      false|  33|   false|
| 133.2|      false|  54|   false|
| 124.1|      false|null|    true|
| 129.2|      false|  42|   false|
+------+-----------+----+--------+

统计缺失值个数,逻辑其实是类sql，count某个字段，过滤缺失值

# 求和统计缺失值个数,逻辑其实是类sql的
# count某个字段，过滤缺失值
df.select(
    fn.count('weight').alias('weight_null'),
    fn.count('age').alias('age_null'),
    fn.count('*').alias('row_num')).show()

+-----------+--------+-------+
|weight_null|age_null|row_num|
+-----------+--------+-------+
|          6|       5|      7|
+-----------+--------+-------+

统计所有的字段，存在缺失值的占比

df.select([(1 - fn.count(i) / fn.count('*')).alias(i + '_missing')
           for i in df.columns]).show()

+----------+------------------+--------------+------------------+------------------+------------------+
|id_missing|    weight_missing|height_missing|       age_missing|    gender_missing|    income_missing|
+----------+------------------+--------------+------------------+------------------+------------------+
|       0.0|0.1428571428571429|           0.0|0.2857142857142857|0.1428571428571429|0.7142857142857143|
+----------+------------------+--------------+------------------+------------------+------------------+

转化成百分比,但是拼接百分号有点问题的，也许要转成sql来写了

df.select([
    fn.round((1 - fn.count(i) / fn.count('*')) * 100,
             2).alias(i + '_missingrate') for i in df.columns
]).show()

+--------------+------------------+------------------+---------------+------------------+------------------+
|id_missingrate|weight_missingrate|height_missingrate|age_missingrate|gender_missingrate|income_missingrate|
+--------------+------------------+------------------+---------------+------------------+------------------+
|           0.0|             14.29|               0.0|          28.57|             14.29|             71.43|
+--------------+------------------+------------------+---------------+------------------+------------------+

summary方法也很方便滴

df.summary("count").show()

+-------+---+------+------+---+------+------+
|summary| id|weight|height|age|gender|income|
+-------+---+------+------+---+------+------+
|  count|  7|     6|     7|  5|     6|     2|
+-------+---+------+------+---+------+------+

2.3 删除缺失值所在行

删除缺失值个数大于某些阈值的行。可以看函数说明https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameNaFunctions。

drop(how='any', thresh=None, subset=None)
Parameters:	
    how – ‘any’ or ‘all’. If ‘any’, drop a row if it contains any nulls. If ‘all’, drop a row only if all its values    are null.
    [any-有缺失值就删除该行，all-都是缺失值才删除该行]
    thresh – int, default None If specified, drop rows that have less than thresh non-null values. This overwrites the how parameter.
    [thresh表示该行中，不为null的字段数的上限。e.g. thresh=4表示删除每一行不为null的字段数大于4的]
    subset – optional list of column names to consider.
    [表示判断是否为null的字段，即可能不是对所有字段判断的]

df.join(df.rdd.map(lambda x:[x['id'],7-sum([c == None for c in x])]).toDF(['id','null_num']),on='id').show()

+---+------+------+----+------+------+--------+
| id|weight|height| age|gender|income|null_num|
+---+------+------+----+------+------+--------+
|  7| 129.2|   5.3|  42|     M| 76000|       7|
|  6| 124.1|   5.2|null|     F|  null|       5|
|  5| 133.2|   5.7|  54|     F|  null|       6|
|  1| 143.5|   5.6|  28|     M|100000|       7|
|  3|  null|   5.2|null|  null|  null|       3|
|  2| 167.2|   5.4|  45|     M|  null|       6|
|  4| 144.5|   5.9|  33|     M|  null|       6|
+---+------+------+----+------+------+--------+

删除非缺失值个数小于5的行
其中:
id=3的行，缺失值4个，非缺失值3个
id=6的行，缺失值3个，非缺失值4个

df.dropna(thresh=5).show()

+---+------+------+---+------+------+
| id|weight|height|age|gender|income|
+---+------+------+---+------+------+
|  1| 143.5|   5.6| 28|     M|100000|
|  2| 167.2|   5.4| 45|     M|  null|
|  4| 144.5|   5.9| 33|     M|  null|
|  5| 133.2|   5.7| 54|     F|  null|
|  7| 129.2|   5.3| 42|     M| 76000|
+---+------+------+---+------+------+

只对某几个字段，统计这几个字段中缺失值的数据
在 'height'、 'age'、 'gender'、'income'这四个字段中，
id=3的那行，只有1个非缺失值字段
id=6的那行，只有2个非缺失值字段

df.dropna(how='any',thresh = 3,subset =('height', 'age', 'gender','income')).show()

+---+------+------+---+------+------+
| id|weight|height|age|gender|income|
+---+------+------+---+------+------+
|  1| 143.5|   5.6| 28|     M|100000|
|  2| 167.2|   5.4| 45|     M|  null|
|  4| 144.5|   5.9| 33|     M|  null|
|  5| 133.2|   5.7| 54|     F|  null|
|  7| 129.2|   5.3| 42|     M| 76000|
+---+------+------+---+------+------+

2.4 缺失值填充

fillna方法可以使用，具体参数如下:

fillna(value, subset=None)
Parameters:	
    *value* – int, long, float, string, bool or dict. Value to replace null values with. If the value is a dict, then subset is ignored and value must be a mapping from column name (string) to replacement value. The replacement value must be an int, long, float, boolean, or string.
    *subset* – optional list of column names to consider. Columns specified in subset that do not have matching data type are ignored. For example, if value is a string, and subset contains a non-string column, then the non-string column is simply ignored.

具体解释看英文即可，建议直接value用dict直接、方便。至于用什么填充缺失值，不在本次讨论之列。

df.fillna(value = {'age':-999,'gender':'missing'}).show()

+---+------+------+----+-------+------+
| id|weight|height| age| gender|income|
+---+------+------+----+-------+------+
|  1| 143.5|   5.6|  28|      M|100000|
|  2| 167.2|   5.4|  45|      M|  null|
|  3|  null|   5.2|-999|missing|  null|
|  4| 144.5|   5.9|  33|      M|  null|
|  5| 133.2|   5.7|  54|      F|  null|
|  6| 124.1|   5.2|-999|      F|  null|
|  7| 129.2|   5.3|  42|      M| 76000|
+---+------+------+----+-------+------+

3 描述统计

3.1 打印字段类型

df = spark.createDataFrame([
    (1, 143.5, 5.6, 28, 'M', 100000),
    (2, 167.2, 5.4, 45, 'M', None),
    (3, None, 5.2, None, None, None),
    (4, 144.5, 5.9, 33, 'M', None),
    (5, 133.2, 5.7, 54, 'F', None),
    (6, 124.1, 5.2, None, 'F', None),
    (7, 129.2, 5.3, 42, 'M', 76000),
], ['id', 'weight', 'height', 'age', 'gender', 'income'])

类似sql里的describe table

df.printSchema()

root
 |-- id: long (nullable = true)
 |-- weight: double (nullable = true)
 |-- height: double (nullable = true)
 |-- age: long (nullable = true)
 |-- gender: string (nullable = true)
 |-- income: long (nullable = true)

3.2 数值统计

使用describe方法，类似R里的summary,python里的describe类似

df.describe().show()

+-------+-----------------+------------------+-------------------+-----------------+------+-----------------+
|summary|               id|            weight|             height|              age|gender|           income|
+-------+-----------------+------------------+-------------------+-----------------+------+-----------------+
|  count|                7|                 6|                  7|                5|     6|                2|
|   mean|              4.0|140.28333333333333|  5.471428571428572|             40.4|  null|          88000.0|
| stddev|2.160246899469287|15.411348632312055|0.26903708365381984|10.21273714534943|  null|16970.56274847714|
|    min|                1|             124.1|                5.2|               28|     F|            76000|
|    max|                7|             167.2|                5.9|               54|     M|           100000|
+-------+-----------------+------------------+-------------------+-----------------+------+-----------------+

4 列操作

4.1 新建数据

from pyspark.sql import SparkSession
# 创建SparkSession对象，调用.builder类
# .appName("testapp")方法给应用程序一个名字；.getOrCreate()方法创建或着获取一个已经创建的SparkSession
spark = SparkSession.builder.appName("testapp").getOrCreate()

from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col
# 构建数据
data = [
 (0, 1,Vectors.dense([-1.0, -1.0 ]),),
 (1, 0,Vectors.dense([-1.0, 1.0 ]),),
 (3, 1,Vectors.dense([1.0, 1.0]),),    
 (2,0, Vectors.dense([1.0, -1.0 ]),),
 (3, 1,Vectors.dense([1.0, 1.0]),)]
df = spark.createDataFrame(data, ["id", "label","features"])
df.show()

+---+-----+-----------+
| id|label|   features|
+---+-----+-----------+
|  0|    1|[-1.0,-1.0]|
|  1|    0| [-1.0,1.0]|
|  3|    1|  [1.0,1.0]|
|  2|    0| [1.0,-1.0]|
|  3|    1|  [1.0,1.0]|
+---+-----+-----------+

4.2 修改列名

4.2.1 $s e l e c t E x p r$

方法比较笨，十分不灵活。在数据库列数比较少时可以使用，通过别名的方式实现
$s e l e c t E x p r$ 函数可以用类sql的方式处理，注意产生的新的数据框

df.selectExpr('id','label as label_sql').show()

+---+---------+
| id|label_sql|
+---+---------+
|  0|        1|
|  1|        0|
|  3|        1|
|  2|        0|
|  3|        1|
+---+---------+

或者直接别名，相当于复制列，再改变列名

df.selectExpr('*','label as label_sql').show()

+---+-----+-----------+---------+
| id|label|   features|label_sql|
+---+-----+-----------+---------+
|  0|    1|[-1.0,-1.0]|        1|
|  1|    0| [-1.0,1.0]|        0|
|  3|    1|  [1.0,1.0]|        1|
|  2|    0| [1.0,-1.0]|        0|
|  3|    1|  [1.0,1.0]|        1|
+---+-----+-----------+---------+

4.2.2 $s e l e c t$

$s e l e c t E x p r$ 是 $s e l e c t$ 变体，逻辑基本一致，代码比较啰嗦

df.select('*',df.label.alias('label_sql')).show()

+---+-----+-----------+---------+
| id|label|   features|label_sql|
+---+-----+-----------+---------+
|  0|    1|[-1.0,-1.0]|        1|
|  1|    0| [-1.0,1.0]|        0|
|  3|    1|  [1.0,1.0]|        1|
|  2|    0| [1.0,-1.0]|        0|
|  3|    1|  [1.0,1.0]|        1|
+---+-----+-----------+---------+

df.select('*',col("label").alias("label_sql")).show()

+---+-----+-----------+---------+
| id|label|   features|label_sql|
+---+-----+-----------+---------+
|  0|    1|[-1.0,-1.0]|        1|
|  1|    0| [-1.0,1.0]|        0|
|  3|    1|  [1.0,1.0]|        1|
|  2|    0| [1.0,-1.0]|        0|
|  3|    1|  [1.0,1.0]|        1|
+---+-----+-----------+---------+

4.2.3 修改指定的列名

直接修改指定的列名，方便直接
最佳方法无疑

df.withColumnRenamed("label", "label_sql").show()

+---+---------+-----------+
| id|label_sql|   features|
+---+---------+-----------+
|  0|        1|[-1.0,-1.0]|
|  1|        0| [-1.0,1.0]|
|  3|        1|  [1.0,1.0]|
|  2|        0| [1.0,-1.0]|
|  3|        1|  [1.0,1.0]|
+---+---------+-----------+

4.2.4 toDF方法

这个方法适合改多个列名，另外不知道toDF是否费内存
文档：

 toDF(*cols) 

    Returns a new class:DataFrame that with new specified column names
    Parameters:	cols – list of new column names (string)

df.toDF(*['id1','label1','feature1']).show()

+---+------+-----------+
|id1|label1|   feature1|
+---+------+-----------+
|  0|     1|[-1.0,-1.0]|
|  1|     0| [-1.0,1.0]|
|  3|     1|  [1.0,1.0]|
|  2|     0| [1.0,-1.0]|
|  3|     1|  [1.0,1.0]|
+---+------+-----------+

df.toDF('id1','label1','feature1').show()

+---+------+-----------+
|id1|label1|   feature1|
+---+------+-----------+
|  0|     1|[-1.0,-1.0]|
|  1|     0| [-1.0,1.0]|
|  3|     1|  [1.0,1.0]|
|  2|     0| [1.0,-1.0]|
|  3|     1|  [1.0,1.0]|
+---+------+-----------+

上面的四个方法基本满足日常需求，还有其他方法可以Google～
参考:https://stackoverflow.com/questions/34077353/how-to-change-dataframe-column-names-in-pyspark

4.3 新增一列

单纯的新增列基本没啥必要，更多的应用是对某些列操作，计算结果为新增列
新增列又有两种方式：

数据框自身列的衍生计算
其他数据框的赋值

4.2.1 $s e l e c t E x p r$

sql可以完成的计算都可以用 $s e l e c t E x p r$ ，但是均为既有列的计算

df.selectExpr('*','label+10 AS label_sql').show()

+---+-----+-----------+---------+
| id|label|   features|label_sql|
+---+-----+-----------+---------+
|  0|    1|[-1.0,-1.0]|       11|
|  1|    0| [-1.0,1.0]|       10|
|  3|    1|  [1.0,1.0]|       11|
|  2|    0| [1.0,-1.0]|       10|
|  3|    1|  [1.0,1.0]|       11|
+---+-----+-----------+---------+

4.2.2 既有列的复杂计算

这里相当于pandas中apply用法，对列进行操作
真正在集群上处理的时候，需要每个节点都安装numpy，因为toDF这个函数用到numpy

def extract(row):
    return (row.id,row.label, ) + tuple(row.features.toArray().tolist())
df.rdd.map(lambda row: extract(row)).toDF(["id","label",'feature_a','feature_b']).show()

+---+-----+---------+---------+
| id|label|feature_a|feature_b|
+---+-----+---------+---------+
|  0|    1|     -1.0|     -1.0|
|  1|    0|     -1.0|      1.0|
|  3|    1|      1.0|      1.0|
|  2|    0|      1.0|     -1.0|
|  3|    1|      1.0|      1.0|
+---+-----+---------+---------+

df.rdd.map(
    lambda x: [x['id'], x['label']] + x['features'].toArray().tolist()).toDF(
        ['id', 'label', 'feature_a', 'feature_b']).show()

+---+-----+---------+---------+
| id|label|feature_a|feature_b|
+---+-----+---------+---------+
|  0|    1|     -1.0|     -1.0|
|  1|    0|     -1.0|      1.0|
|  3|    1|      1.0|      1.0|
|  2|    0|      1.0|     -1.0|
|  3|    1|      1.0|      1.0|
+---+-----+---------+---------+

也不知道是什么时候的事儿了，什么时候的笔记。时间不停转，记忆不停散。人生究竟是得到的多，还是失去的多。

2022-04-24 于南京市江宁区九龙湖

维格堂406小队

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
PysparkNote102---DataFrame常用操作2

https://zhuanlan.zhihu.com/p/349016831 重复数据筛查满足以下功能:筛选出重复的行。对某一个字段，筛选出重复的值对某几个字段筛选出重复的值1.1 重复行from pyspark.sql import SparkSession# 创建SparkSession对象，调用.builder类# .appName("testapp")方法给应用程序一个名字；.getOrCreate()方法创建或着获取一个已经创建的SparkSessionspark =
复制链接

扫一扫