Spark-SQL
spark SQL是 spark 中用于处理结构化数据的一个模块
用 python 操作 RDD 慢:要转换可执行代码,在 JVM 中运行,涉及不同语言引擎间的切换,进行进程间的通信很耗费时间
DataFrame
- 以RDD为基础的分布式数据集,类似于关系型数据库的二维表
- DataFrame引入 schema 和 off-heap
- 1.解决了RDD的缺点
- 序列化和反序列化开销大
- 频繁创建和销毁对象造成大量的GC
- 2.失去了RDD的优点
- RDD在编译时进行类型检查
- RDD具有面向对象编程的特性
使用 Spark SQL 编写转换的RDD 比 Spark-core 编写转换的 RDD 慢,1倍左右
sparksql特性
- 1、易整合
- 2、统一的数据源访问
- 3、兼容hive
- 4、提供了标准的数据库连接(jdbc/odbc)
DataFrame的特点:
- RDD、DataFrame 被创建之后,就不能更改,只能通过 transformation 生成新的 RDD、DataFrame
- 惰性计算:只有 action 才会触发 Transformation 的执行
- DataFrame 和 RDD 都是分布式的
RDD是分布式的Java对象的集合。DataFrame是分布式的Row对象的集合。DataFrame除了提供了比RDD更丰富的算子以外,更重要的特点是提升执行效率、减少数据读取以及执行计划的优化。
实践:
基于RDD创建DataFrame
>>> from pyspark.sql import SparkSession
>>> from pyspark.sql import Row
>>> spark = SparkSession.builder.appName('test').getOrCreate()
>>> sc = spark.sparkContext
>>> l = [('a', 1), ('b', 2), ('c', 3), ('d', 4)]
>>> rdd = sc.parallelize(l)
>>> p = rdd.map(lambda t: Row(alpha=t[0], digit=int(t[1])))
>>> df = spark.createDataFrame(p)
>>> df.show()
+-----+-----+
|alpha|digit|
+-----+-----+
| a| 1|
| b| 2|
| c| 3|
| d| 4|
+-----+-----+
从 csv 读取数据
>>> df = spark.read.format("csv").option("header", "true").load("test.csv")
>>> df.printSchema()
root
|-- id: string (nullable = true)
|-- value1: string (nullable = true)
|-- value2: string (nullable = true)
|-- value3: string (nullable = true)
>>> df.show() # 传入数字查看前几行
+---+------+------+------+
| id|value1|value2|value3|
+---+------+------+------+
| 1| 123| 1.4| 23|
| 2| 110| null| 18|
| 3| null| 2.1| 19|
+---+------+------+------+
>>> df.count()
3
>>> df.columns
['id', 'value1', 'value2', 'value3']
>>>
DataFrame 的基本操作
>>> df.withColumn('value4', df['value1']*3).show() # 增加一列
+---+------+------+------+------+
| id|value1|value2|value3|value4|
+---+------+------+------+------+
| 1| 123| 1.4| 23| 369.0|
| 2| 110| null| 18| 330.0|
| 3| null| 2.1| 19| null|
+---+------+------+------+------+
>>> df.drop('value3').show() # 删除一列
+---+------+------+
| id|value1|value2|
+---+------+------+
| 1| 123| 1.4|
| 2| 110| null|
| 3| null| 2.1|
+---+------+------+
>>> df.describe().show() # 统计信息
+-------+---+-----------------+------------------+------------------+
|summary| id| value1| value2| value3|
+-------+---+-----------------+------------------+------------------+
| count| 3| 2| 2| 3|
| mean|2.0| 116.5| 1.75| 20.0|
| stddev|1.0|9.192388155425117|0.4949747468305834|2.6457513110645907|
| min| 1| 110| 1.4| 18|
| max| 3| 123| 2.1| 23|
+-------+---+-----------------+------------------+------------------+
>>> df.describe('value1').show() # 某一列的统计信息
+-------+-----------------+
|summary| value1|
+-------+-----------------+
| count| 2|
| mean| 116.5|
| stddev|9.192388155425117|
| min| 110|
| max| 123|
+-------+-----------------+
>>> df.select('value1', 'value2').show() # 提取部分列
+------+------+
|value1|value2|
+------+------+
| 123| 1.4|
| 110| null|
| null| 2.1|
+------+------+
>>> df.select('value1').distinct().count() # 对某列操作
3
分组统计
# 分组统计 groupby(colname).agg({'col':'fun','col2':'fun2'})
>>> df.groupby('value1').agg({'value2':'mean','value3':'min'}).show()
+------+-----------+-----------+
|value1|avg(value2)|min(value3)|
+------+-----------+-----------+
| 110| null| 18|
| null| 2.1| 19|
| 123| 1.4| 23|
+------+-----------+-----------+
# avg(), count(), countDistinct(), first(), kurtosis(),
# max(), mean(), min(), skewness(), stddev(), stddev_pop(),
# stddev_samp(), sum(), sumDistinct(), var_pop(), var_samp() and variance()
划分数据集
>>> train, test = df.randomSplit([0.8, 0.2])
自定义函数处理 udf
>>> df.printSchema()
root
|-- id: string (nullable = true)
|-- value1: string (nullable = true)
|-- value2: string (nullable = true)
|-- value3: string (nullable = true)
>>> from pyspark.sql.types import StringType
>>> from pyspark.sql.functions import udf
>>> def is_eighteen(arg):
... return 1 if arg == '18' else 0
...
>>> check = udf(is_eighteen, StringType())
>>> _df = df.withColumn('is_18', check(df['value3'])).filter('value3 <> 0') # != 0
>>> _df.show()
+---+------+------+------+-----+
| id|value1|value2|value3|is_18|
+---+------+------+------+-----+
| 1| 123| 1.4| 23| 0|
| 2| 110| null| 18| 1|
| 3| null| 2.1| 19| 0|
+---+------+------+------+-----+
数据清洗
>>> df = spark.createDataFrame([
... (1, 144.5, 5.9, 33, 'M'),
... (2, 167.2, 5.4, 45, 'M'),
... (3, 124.1, 5.2, 23, 'F'),
... (4, 144.5, 5.9, 33, 'M'),
... (5, 133.2, 5.7, 54, 'F'),
... (3, 124.1, 5.2, 23, 'F'),
... (5, 129.2, 5.3, 42, 'M'),
... ], ['id', 'weight', 'height', 'age', 'gender'])
>>> df.show()
+---+------+------+---+------+
| id|weight|height|age|gender|
+---+------+------+---+------+
| 1| 144.5| 5.9| 33| M|
| 2| 167.2| 5.4| 45| M|
| 3| 124.1| 5.2| 23| F|
| 4| 144.5| 5.9| 33| M|
| 5| 133.2| 5.7| 54| F|
| 3| 124.1| 5.2| 23| F|
| 5| 129.2| 5.3| 42| M|
+---+------+------+---+------+
>>> # 去除重复的行
...
>>>
>>> df2 = df.dropDuplicates()
>>> df2.show()
+---+------+------+---+------+
| id|weight|height|age|gender|
+---+------+------+---+------+
| 5| 133.2| 5.7| 54| F|
| 5| 129.2| 5.3| 42| M|
| 1| 144.5| 5.9| 33| M|
| 4| 144.5| 5.9| 33| M|
| 2| 167.2| 5.4| 45| M|
| 3| 124.1| 5.2| 23| F|
+---+------+------+---+------+
>>> # 对某列的重复值进行去重
...
>>> df3 = df2.dropDuplicates(subset=['height'])
>>> df3.show()
+---+------+------+---+------+
| id|weight|height|age|gender|
+---+------+------+---+------+
| 2| 167.2| 5.4| 45| M|
| 5| 133.2| 5.7| 54| F|
| 5| 129.2| 5.3| 42| M|
| 1| 144.5| 5.9| 33| M|
| 3| 124.1| 5.2| 23| F|
+---+------+------+---+------+
>>>
缺失值处理
>>> df_miss = spark.createDataFrame([
... (1, 143.5, 5.6, 28,'M', 100000),
... (2, 167.2, 5.4, 45,'M', None),
... (3, None , 5.2, None, None, None),
... (4, 144.5, 5.9, 33, 'M', None),
... (5, 133.2, 5.7, 54, 'F', None),
... (6, 124.1, 5.2, None, 'F', None),
... (7, 129.2, 5.3, 42, 'M', 76000),],
... ['id', 'weight', 'height', 'age', 'gender', 'income'])
>>> # 计算每条记录的缺失值情况
...
>>> # 迭代方法,通过 rdd 子类
...
>>>
>>> df_miss.rdd.ma
df_miss.rdd.map( df_miss.rdd.mapPartitionsWithIndex( df_miss.rdd.mapValues(
df_miss.rdd.mapPartitions( df_miss.rdd.mapPartitionsWithSplit( df_miss.rdd.max(
>>> df_miss.rdd.map(lambda row:(row['id'], sum([c==None for c in row]))).collect()
[(1, 0), (2, 1), (3, 4), (4, 1), (5, 1), (6, 2), (7, 0)]
>>>
>>> df_miss.dropna().show()
+---+------+------+---+------+------+
| id|weight|height|age|gender|income|
+---+------+------+---+------+------+
| 1| 143.5| 5.6| 28| M|100000|
| 7| 129.2| 5.3| 42| M| 76000|
+---+------+------+---+------+------+
>>>
>>> df_miss.fillna(1).show()
+---+------+------+---+------+------+
| id|weight|height|age|gender|income|
+---+------+------+---+------+------+
| 1| 143.5| 5.6| 28| M|100000|
| 2| 167.2| 5.4| 45| M| 1|
| 3| 1.0| 5.2| 1| null| 1|
| 4| 144.5| 5.9| 33| M| 1|
| 5| 133.2| 5.7| 54| F| 1|
| 6| 124.1| 5.2| 1| F| 1|
| 7| 129.2| 5.3| 42| M| 76000|
+---+------+------+---+------+------+
异常值检测
>>> df_outliers = spark.createDataFrame([
... (1, 143.5, 5.3, 28),
... (2, 154.2, 5.5, 45),
... (3, 342.3, 5.1, 99),
... (4, 144.5, 5.5, 33),
... (5, 133.2, 5.4, 54),
... (6, 124.1, 5.1, 21),
... (7, 129.2, 5.3, 42),
... ], ['id', 'weight', 'height', 'age'])
>>> # approxQuantile方法接收三个参数:参数1,列名;参数2:想要计算的分位点,可以是一个点,也可以是一个列表(0和1之间的小数),第三个参数是能容忍的误差,如果是0,代表百分百精确计算。
>>> # 异常值检测
...
>>> cols = [i for i in df_outliers.col
df_outliers.colRegex( df_outliers.collect( df_outliers.columns
>>> cols = [i for i in df_outliers.columns if i != 'id']
>>> cols
['weight', 'height', 'age']
>>> bounds = dict()
>>> for col in cols:
... quantiles = df_outliers.approxQuantile(col, [0.25, 0.75], 0.05)
... IQR = quantiles[1] - quantiles[0]
... bounds[col] = [quantiles[0]-1.5*IQR, quantiles[1]+1.5*IQR]
...
>>> bounds
{'weight': [91.69999999999999, 191.7], 'height': [4.499999999999999, 6.1000000000000005], 'age': [-11.0, 93.0]}
>>> # alia 重命名
>>> outliers = df_outliers.select(*['id']+[((df_outliers[c] < bounds[c][0]) | \
... (df_outliers[c] > bounds[c][1])).alias(c + '_o') for c in cols])
>>> outliers
DataFrame[id: bigint, weight_o: boolean, height_o: boolean, age_o: boolean]
>>> outliers.show()
+---+--------+--------+-----+
| id|weight_o|height_o|age_o|
+---+--------+--------+-----+
| 1| false| false|false|
| 2| false| false|false|
| 3| true| false| true|
| 4| false| false|false|
| 5| false| false|false|
| 6| false| false|false|
| 7| false| false|false|
+---+--------+--------+-----+
>>> # [*[4]+[1,2,3]] 先进行列表合并,再解包
...
>>> df_outliers = df_outliers.join(outliers, on='id')
>>> df_outliers.filter('weight_o').select('id', 'weight').show()
+---+------+
| id|weight|
+---+------+
| 3| 342.3|
+---+------+
>>>