PySpark - Spark SQL基础

29 篇文章 1 订阅
13 篇文章 1 订阅

Spark SQL 是 Apache Spark 处理结构化数据的模块。

一、初始化 SparkSession

SparkSession 用于创建数据框,将数据框注册为表,执行 SQL 查询,缓存表及读取 Parquet 文件。

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()
sc = spark.sparkContext

二、创建数据框

1、从 RDD 创建

from pyspark.sql.types import *  包括:
__all__ = [
    "DataType", "NullType", "StringType", "BinaryType", "BooleanType", "DateType",
    "TimestampType", "DecimalType", "DoubleType", "FloatType", "ByteType", "IntegerType",
    "LongType", "ShortType", "ArrayType", "MapType", "StructField", "StructType"]
from pyspark.sql import Row
from pyspark.sql.types import *
from collections import namedtuple

(1) 推断 Schema

lines = sc.textFile("people.txt")
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))
peopledf = spark.createDataFrame(people)
peopledf.show()
+---------+---+
|     name|age|
+---------+---+
|    Danae| 27|
|  Claudia| 26|
|Desdemona| 25|
|    Chloe| 30|
|  Felicia| 25|
+---------+---+

(2)指定 Schema

people = parts.map(lambda p: Row(name=p[0], age=int(p[1].strip())))
schemaString = "name age"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)
spark.createDataFrame(people, schema)
print(type(df))
<class 'pyspark.sql.dataframe.DataFrame'>

2、从 Spark 数据源创建

(1)JSON

df = spark.read.json("customer.json")
df.show()
# +-------+---+---------+--------+-----------+
# |address|age|firstName|lastName|phoneNumber|
# +-------+---+---------+--------+-----------+
# |BeiJing| 29|      Tom|   Smith|13911112222|
# |NanJing| 30|     July|     Doe|13922223333|
# +-------+---+---------+--------+-----------+
df2 = spark.read.load("people.json", format="json")
df2.show()
# +---+---------+
# |age|     name|
# +---+---------+
# | 27|    Danae|
# | 26|  Claudia|
# | 25|Desdemona|
# | 30|    Chloe|
# | 25|  Felicia|
# +---+---------+

(2)Parquet 文件

df3 = spark.read.load("people.parquet")
df3.show()
# +---------+---+
# |     name|age|
# +---------+---+
# |    Danae| 27|
# |  Claudia| 26|
# |Desdemona| 25|
# |    Chloe| 30|
# |  Felicia| 25|
# +---------+---+

(3)文本文件

df4 = spark.read.text("people.txt")
df4.show()
# +------------+
# |       value|
# +------------+
# |    Danae,27|
# |  Claudia,26|
# |Desdemona,25|
# |    Chloe,30|
# |  Felicia,25|
# +------------+

3、查阅数据信息

df.dtypes               # 返回 df 的列名与数据类型
df.show()               # 显示 df 的内容
df.head()               # 返回前 n 行数据
df.first()              # 返回第 1 行数据
df.take(2)              # 返回前 n 行数据
df.schema               # 返回 df 的 Schema
df.describe().show()    # 汇总统计数据
df.columns              # 返回 df 的列名
df.count()              # 返回 df 的行数
df.distinct().count()   # 返回 df 中不重复的行数
df.printSchema()        # 返回 df的 Schema
df.explain()            # 返回逻辑与实体方案
返回 df 的列名与数据类型: 
[('address', 'string'), ('age', 'bigint'), ('firstName', 'string'), ('lastName', 'string'), ('phoneNumber', 'string')]
显示 df 的内容: 
+-------+---+---------+--------+-----------+
|address|age|firstName|lastName|phoneNumber|
+-------+---+---------+--------+-----------+
|BeiJing| 29|      Tom|   Smith|13911112222|
|NanJing| 30|     July|     Doe|13922223333|
+-------+---+---------+--------+-----------+
返回前 n 行数据: 
[Row(address='BeiJing', age=29, firstName='Tom', lastName='Smith', phoneNumber='13911112222'), Row(address='NanJing', age=30, firstName='July', lastName='Doe', phoneNumber='13922223333')]
返回前 1 行数据: 
Row(address='BeiJing', age=29, firstName='Tom', lastName='Smith', phoneNumber='13911112222')
返回前 n 行数据: 
[Row(address='BeiJing', age=29, firstName='Tom', lastName='Smith', phoneNumber='13911112222'), Row(address='NanJing', age=30, firstName='July', lastName='Doe', phoneNumber='13922223333')]
返回 df 的 Schema: 
StructType(List(StructField(address,StringType,true),StructField(age,LongType,true),StructField(firstName,StringType,true),StructField(lastName,StringType,true),StructField(phoneNumber,StringType,true)))
汇总统计数据: 
+-------+-------+------------------+---------+--------+-----------------+
|summary|address|               age|firstName|lastName|      phoneNumber|
+-------+-------+------------------+---------+--------+-----------------+
|  count|      2|                 2|        2|       2|                2|
|   mean|   null|              29.5|     null|    null| 1.39166677775E10|
| stddev|   null|0.7071067811865476|     null|    null|7856741.934616441|
|    min|BeiJing|                29|     July|     Doe|      13911112222|
|    max|NanJing|                30|      Tom|   Smith|      13922223333|
+-------+-------+------------------+---------+--------+-----------------+
返回 df 的列名: ['address', 'age', 'firstName', 'lastName', 'phoneNumber']
返回 df 的行数: 2
返回 df 的 Schema: 
root
 |-- address: string (nullable = true)
 |-- age: long (nullable = true)
 |-- firstName: string (nullable = true)
 |-- lastName: string (nullable = true)
 |-- phoneNumber: string (nullable = true)

返回逻辑与实体方案: 
== Physical Plan ==
*(1) FileScan json [address#6,age#7L,firstName#8,lastName#9,phoneNumber#10] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/D:/Mypython/spark/customer.json], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<address:string,age:bigint,firstName:string,lastName:string,phoneNumber:string>

4、重复值

df = df.dropDuplicates()

5、查询

1)Select

df.select("firstName", "lastName").show()                # 显示 firstName、age 的所有条目
+---------+--------+
|firstName|lastName|
+---------+--------+
|      Tom|   Smith|
|     July|     Doe|
+---------+--------+
df.select("firstName", "age", F.explode("phoneNumber").alias("contactInfo"))\
    .select("contactInfo.type", "firstName", "age")\
    .show()                                              # 显示 firstName、age 的所有条目和类型
df.select(df["firstName"], df["age"] + 1).show()         # 显示 firstName 和 age 列的所有记录,并对 age 记录添加1
+---------+---------+
|firstName|(age + 1)|
+---------+---------+
|      Tom|       30|
|     July|       31|
+---------+---------+
df.select(df['age'] > 29).show()                         # 显示所有小于24岁的记录
+----------+
|(age > 29)|
+----------+
|     false|
|      true|
+----------+

(2)When

df.select("firstName", F.when(df.age > 29, 1).otherwise(0)).show()
+---------+--------------------------------------+
|firstName|CASE WHEN (age > 29) THEN 1 ELSE 0 END|
+---------+--------------------------------------+
|      Tom|                                     0|
|     July|                                     1|
+---------+--------------------------------------+
df[df.firstName.isin("Tom", "July")].show()
+-------+---+---------+--------+-----------+
|address|age|firstName|lastName|phoneNumber|
+-------+---+---------+--------+-----------+
|BeiJing| 29|      Tom|   Smith|13911112222|
|NanJing| 30|     July|     Doe|13922223333|
+-------+---+---------+--------+-----------+

(3)Like

df.select("firstName", df.lastName.like("Smith")).show()
+---------+-------------------+
|firstName|lastName LIKE Smith|
+---------+-------------------+
|      Tom|               true|
|     July|              false|
+---------+-------------------+

(4)Startswith & Endswith

df.select("firstName", df.lastName.startswith("Sm")).show()
+---------+------------------------+
|firstName|startswith(lastName, Sm)|
+---------+------------------------+
|      Tom|                    true|
|     July|                   false|
+---------+------------------------+
df.select(df.lastName.endswith("th")).show()
+----------------------+
|endswith(lastName, th)|
+----------------------+
|                  true|
|                 false|
+----------------------+

(5)Substring

df.select(df.firstName.substr(1, 3).alias("name")).show()    # alias 对列进行重命名
+----+
|name|
+----+
| Tom|
| Jul|
+----+

(6)Between

df.select(df.age.between(22, 29)).show()
+-----------------------------+
|((age >= 22) AND (age <= 29))|
+-----------------------------+
|                         true|
|                        false|
+-----------------------------+

6、列操作

(1)添加列

df = df.withColumn('telePhoneNumber', F.explode(df.phoneNumber.number))

(2)修改列

df = df.withColumnRenamed('phoneNumber', 'callNumber')
df.show()
+-------+---+---------+--------+-----------+
|address|age|firstName|lastName| callNumber|
+-------+---+---------+--------+-----------+
|BeiJing| 29|      Tom|   Smith|13911112222|
|NanJing| 30|     July|     Doe|13922223333|
+-------+---+---------+--------+-----------+

(3)删除列

df = df.drop("address", "callNumber")
df = df.drop(df.address).drop(df.callNumber)
df.show()
+---+---------+--------+
|age|firstName|lastName|
+---+---------+--------+
| 29|      Tom|   Smith|
| 30|     July|     Doe|
+---+---------+--------+

7、分组

df.groupBy("age").count().show()  # 按 age 列分组,统计每组人数
+---+-----+
|age|count|
+---+-----+
| 29|    1|
| 30|    1|
+---+-----+

8、筛选

df.filter(df["age"] > 29).show()    # 按 age 列筛选,保留年龄大于29岁的
+---+---------+--------+
|age|firstName|lastName|
+---+---------+--------+
| 30|     July|     Doe|
+---+---------+--------+

9、排序

df.sort(peopledf.age.desc()).show()
df.sort("age", ascending=False).show()
df.orderBy(["age", "city"], ascending=[0, 1]).show()
+---+---------+--------+
|age|firstName|lastName|
+---+---------+--------+
| 30|     July|     Doe|
| 29|      Tom|   Smith|
+---+---------+--------+

10、替换缺失值

df.na.fill(50).show()           # 用一个值替换空值
df.na.drop().show()             # 去除 df 中为空值的行
df.na.replace(10, 20).show()    # 用一个值替换另一个值

11、重分区

df.repartition(10).rdd.getNumPartitions()   # 将 df 拆分为10个分区
df.coalesce(1).getNumPartitions()           # 将 df 合并为1个分区

12、运行 SQL 查询

(1)将数据框注册为视图

df.createGlobalTempView("people")
df.createTempView("customer")
df.createOrReplaceTempView("customer")

(2)查询视图

df5 = spark.sql("SELECT * FROM customer").show()
peopledf2 = spark.sql("SELECT * FROM global_temp.people").show()

13、输出

(1)数据结构

rdd1 = df.rdd
df.toJSON().first()
df.toPandas()

(2)保存至文件

df.select("firstName", "address").write.save("nameAndAddress.parquet")
df.select("firstName", "age").write.save("namesAndAges.json", format="json")

14、终止 SparkSession

spark.stop()

摘自DataCamp
Learn Python for Data Science Interactively

  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值