Spark-SQL note

Spark-SQL

spark SQL是 spark 中用于处理结构化数据的一个模块

用 python 操作 RDD 慢:要转换可执行代码,在 JVM 中运行,涉及不同语言引擎间的切换,进行进程间的通信很耗费时间

DataFrame

  • 以RDD为基础的分布式数据集,类似于关系型数据库的二维表
  • DataFrame引入 schema 和 off-heap
    • 1.解决了RDD的缺点
    • 序列化和反序列化开销大
    • 频繁创建和销毁对象造成大量的GC
    • 2.失去了RDD的优点
    • RDD在编译时进行类型检查
    • RDD具有面向对象编程的特性

使用 Spark SQL 编写转换的RDD 比 Spark-core 编写转换的 RDD 慢,1倍左右

sparksql特性

  • 1、易整合
  • 2、统一的数据源访问
  • 3、兼容hive
  • 4、提供了标准的数据库连接(jdbc/odbc)

DataFrame的特点:

  • RDD、DataFrame 被创建之后,就不能更改,只能通过 transformation 生成新的 RDD、DataFrame
  • 惰性计算:只有 action 才会触发 Transformation 的执行
  • DataFrame 和 RDD 都是分布式的

RDD是分布式的Java对象的集合。DataFrame是分布式的Row对象的集合。DataFrame除了提供了比RDD更丰富的算子以外,更重要的特点是提升执行效率、减少数据读取以及执行计划的优化。

实践:

基于RDD创建DataFrame
>>> from pyspark.sql import SparkSession
>>> from pyspark.sql import Row
>>> spark = SparkSession.builder.appName('test').getOrCreate()
>>> sc = spark.sparkContext
>>> l = [('a', 1), ('b', 2), ('c', 3), ('d', 4)]
>>> rdd = sc.parallelize(l)
>>> p = rdd.map(lambda t: Row(alpha=t[0], digit=int(t[1])))
>>> df = spark.createDataFrame(p)
>>> df.show()                                                                   
+-----+-----+
|alpha|digit|
+-----+-----+
|    a|    1|
|    b|    2|
|    c|    3|
|    d|    4|
+-----+-----+
从 csv 读取数据
>>> df = spark.read.format("csv").option("header", "true").load("test.csv")
>>> df.printSchema()
root
 |-- id: string (nullable = true)
 |-- value1: string (nullable = true)
 |-- value2: string (nullable = true)
 |-- value3: string (nullable = true)

>>> df.show() # 传入数字查看前几行
+---+------+------+------+
| id|value1|value2|value3|
+---+------+------+------+
|  1|   123|   1.4|    23|
|  2|   110|  null|    18|
|  3|  null|   2.1|    19|
+---+------+------+------+

>>> df.count()
3
>>> df.columns
['id', 'value1', 'value2', 'value3']
>>> 
DataFrame 的基本操作
>>> df.withColumn('value4', df['value1']*3).show() # 增加一列
+---+------+------+------+------+
| id|value1|value2|value3|value4|
+---+------+------+------+------+
|  1|   123|   1.4|    23| 369.0|
|  2|   110|  null|    18| 330.0|
|  3|  null|   2.1|    19|  null|
+---+------+------+------+------+

>>> df.drop('value3').show() # 删除一列
+---+------+------+
| id|value1|value2|
+---+------+------+
|  1|   123|   1.4|
|  2|   110|  null|
|  3|  null|   2.1|
+---+------+------+

>>> df.describe().show() # 统计信息
+-------+---+-----------------+------------------+------------------+
|summary| id|           value1|            value2|            value3|
+-------+---+-----------------+------------------+------------------+
|  count|  3|                2|                 2|                 3|
|   mean|2.0|            116.5|              1.75|              20.0|
| stddev|1.0|9.192388155425117|0.4949747468305834|2.6457513110645907|
|    min|  1|              110|               1.4|                18|
|    max|  3|              123|               2.1|                23|
+-------+---+-----------------+------------------+------------------+

>>> df.describe('value1').show() # 某一列的统计信息
+-------+-----------------+
|summary|           value1|
+-------+-----------------+
|  count|                2|
|   mean|            116.5|
| stddev|9.192388155425117|
|    min|              110|
|    max|              123|
+-------+-----------------+

>>> df.select('value1', 'value2').show() # 提取部分列
+------+------+
|value1|value2|
+------+------+
|   123|   1.4|
|   110|  null|
|  null|   2.1|
+------+------+

>>> df.select('value1').distinct().count() # 对某列操作
3
分组统计
# 分组统计 groupby(colname).agg({'col':'fun','col2':'fun2'})
>>> df.groupby('value1').agg({'value2':'mean','value3':'min'}).show()
+------+-----------+-----------+                                                
|value1|avg(value2)|min(value3)|
+------+-----------+-----------+
|   110|       null|         18|
|  null|        2.1|         19|
|   123|        1.4|         23|
+------+-----------+-----------+


# avg(), count(), countDistinct(), first(), kurtosis(),
# max(), mean(), min(), skewness(), stddev(), stddev_pop(),
# stddev_samp(), sum(), sumDistinct(), var_pop(), var_samp() and variance()
划分数据集
>>> train, test = df.randomSplit([0.8, 0.2])
自定义函数处理 udf
>>> df.printSchema()
root
 |-- id: string (nullable = true)
 |-- value1: string (nullable = true)
 |-- value2: string (nullable = true)
 |-- value3: string (nullable = true)

>>> from pyspark.sql.types import StringType
>>> from pyspark.sql.functions import udf
>>> def is_eighteen(arg):
...     return 1 if arg == '18' else 0
... 
>>> check = udf(is_eighteen, StringType())
>>> _df = df.withColumn('is_18', check(df['value3'])).filter('value3 <> 0') # != 0
>>> _df.show()
+---+------+------+------+-----+
| id|value1|value2|value3|is_18|
+---+------+------+------+-----+
|  1|   123|   1.4|    23|    0|
|  2|   110|  null|    18|    1|
|  3|  null|   2.1|    19|    0|
+---+------+------+------+-----+

数据清洗
>>> df = spark.createDataFrame([
...   (1, 144.5, 5.9, 33, 'M'),
...   (2, 167.2, 5.4, 45, 'M'),
...   (3, 124.1, 5.2, 23, 'F'),
...   (4, 144.5, 5.9, 33, 'M'),
...   (5, 133.2, 5.7, 54, 'F'),
...   (3, 124.1, 5.2, 23, 'F'),
...   (5, 129.2, 5.3, 42, 'M'),
... ], ['id', 'weight', 'height', 'age', 'gender'])
>>> df.show()
+---+------+------+---+------+
| id|weight|height|age|gender|
+---+------+------+---+------+
|  1| 144.5|   5.9| 33|     M|
|  2| 167.2|   5.4| 45|     M|
|  3| 124.1|   5.2| 23|     F|
|  4| 144.5|   5.9| 33|     M|
|  5| 133.2|   5.7| 54|     F|
|  3| 124.1|   5.2| 23|     F|
|  5| 129.2|   5.3| 42|     M|
+---+------+------+---+------+

>>> # 去除重复的行
... 
>>> 
>>> df2 = df.dropDuplicates()
>>> df2.show()
+---+------+------+---+------+                                                  
| id|weight|height|age|gender|
+---+------+------+---+------+
|  5| 133.2|   5.7| 54|     F|
|  5| 129.2|   5.3| 42|     M|
|  1| 144.5|   5.9| 33|     M|
|  4| 144.5|   5.9| 33|     M|
|  2| 167.2|   5.4| 45|     M|
|  3| 124.1|   5.2| 23|     F|
+---+------+------+---+------+

>>> # 对某列的重复值进行去重
... 
>>> df3 = df2.dropDuplicates(subset=['height'])
>>> df3.show()
+---+------+------+---+------+                                                  
| id|weight|height|age|gender|
+---+------+------+---+------+
|  2| 167.2|   5.4| 45|     M|
|  5| 133.2|   5.7| 54|     F|
|  5| 129.2|   5.3| 42|     M|
|  1| 144.5|   5.9| 33|     M|
|  3| 124.1|   5.2| 23|     F|
+---+------+------+---+------+

>>> 
缺失值处理
>>> df_miss = spark.createDataFrame([
... (1, 143.5, 5.6, 28,'M', 100000),
... (2, 167.2, 5.4, 45,'M', None),
... (3, None , 5.2, None, None, None),
... (4, 144.5, 5.9, 33, 'M', None),
... (5, 133.2, 5.7, 54, 'F', None),
... (6, 124.1, 5.2, None, 'F', None),
... (7, 129.2, 5.3, 42, 'M', 76000),],
...  ['id', 'weight', 'height', 'age', 'gender', 'income'])
>>> # 计算每条记录的缺失值情况
... 
>>> # 迭代方法,通过 rdd 子类
... 
>>> 
>>> df_miss.rdd.ma
df_miss.rdd.map(                     df_miss.rdd.mapPartitionsWithIndex(  df_miss.rdd.mapValues(
df_miss.rdd.mapPartitions(           df_miss.rdd.mapPartitionsWithSplit(  df_miss.rdd.max(
>>> df_miss.rdd.map(lambda row:(row['id'], sum([c==None for c in row]))).collect()
[(1, 0), (2, 1), (3, 4), (4, 1), (5, 1), (6, 2), (7, 0)]
>>> 
>>> df_miss.dropna().show()
+---+------+------+---+------+------+
| id|weight|height|age|gender|income|
+---+------+------+---+------+------+
|  1| 143.5|   5.6| 28|     M|100000|
|  7| 129.2|   5.3| 42|     M| 76000|
+---+------+------+---+------+------+
>>>
>>> df_miss.fillna(1).show()
+---+------+------+---+------+------+
| id|weight|height|age|gender|income|
+---+------+------+---+------+------+
|  1| 143.5|   5.6| 28|     M|100000|
|  2| 167.2|   5.4| 45|     M|     1|
|  3|   1.0|   5.2|  1|  null|     1|
|  4| 144.5|   5.9| 33|     M|     1|
|  5| 133.2|   5.7| 54|     F|     1|
|  6| 124.1|   5.2|  1|     F|     1|
|  7| 129.2|   5.3| 42|     M| 76000|
+---+------+------+---+------+------+

异常值检测
>>> df_outliers = spark.createDataFrame([
... (1, 143.5, 5.3, 28),
... (2, 154.2, 5.5, 45),
... (3, 342.3, 5.1, 99),
... (4, 144.5, 5.5, 33),
... (5, 133.2, 5.4, 54),
... (6, 124.1, 5.1, 21),
... (7, 129.2, 5.3, 42),
... ], ['id', 'weight', 'height', 'age'])
>>> # approxQuantile方法接收三个参数:参数1,列名;参数2:想要计算的分位点,可以是一个点,也可以是一个列表(0和1之间的小数),第三个参数是能容忍的误差,如果是0,代表百分百精确计算。
>>> # 异常值检测
... 
>>> cols = [i for i in df_outliers.col
df_outliers.colRegex(  df_outliers.collect(   df_outliers.columns    
>>> cols = [i for i in df_outliers.columns if i != 'id']
>>> cols
['weight', 'height', 'age']
>>> bounds = dict()
>>> for col in cols:
...     quantiles = df_outliers.approxQuantile(col, [0.25, 0.75], 0.05)
...     IQR = quantiles[1] - quantiles[0]
...     bounds[col] = [quantiles[0]-1.5*IQR, quantiles[1]+1.5*IQR]
... 
>>> bounds
{'weight': [91.69999999999999, 191.7], 'height': [4.499999999999999, 6.1000000000000005], 'age': [-11.0, 93.0]}
>>> # alia 重命名
>>> outliers = df_outliers.select(*['id']+[((df_outliers[c] < bounds[c][0]) | \
... (df_outliers[c] > bounds[c][1])).alias(c + '_o') for c in cols])
>>> outliers
DataFrame[id: bigint, weight_o: boolean, height_o: boolean, age_o: boolean]
>>> outliers.show()
+---+--------+--------+-----+
| id|weight_o|height_o|age_o|
+---+--------+--------+-----+
|  1|   false|   false|false|
|  2|   false|   false|false|
|  3|    true|   false| true|
|  4|   false|   false|false|
|  5|   false|   false|false|
|  6|   false|   false|false|
|  7|   false|   false|false|
+---+--------+--------+-----+

>>> # [*[4]+[1,2,3]] 先进行列表合并,再解包
... 
>>> df_outliers = df_outliers.join(outliers, on='id')
>>> df_outliers.filter('weight_o').select('id', 'weight').show()
+---+------+                                                                    
| id|weight|
+---+------+
|  3| 342.3|
+---+------+

>>> 
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值