PySpark - Spark SQL基础

最新推荐文章于 2024-03-07 14:44:34 发布

Rnan-prince

最新推荐文章于 2024-03-07 14:44:34 发布

阅读量562

点赞数

分类专栏： python 大数据数据库文章标签： spark python 大数据 sql

本文链接：https://blog.csdn.net/qq_19446965/article/details/106964769

版权

python 同时被 3 个专栏收录

125 篇文章 8 订阅

订阅专栏

大数据

29 篇文章 1 订阅

订阅专栏

数据库

13 篇文章 1 订阅

订阅专栏

Spark SQL 是 Apache Spark 处理结构化数据的模块。

一、初始化 SparkSession

SparkSession 用于创建数据框，将数据框注册为表，执行 SQL 查询，缓存表及读取 Parquet 文件。

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()
sc = spark.sparkContext

二、创建数据框

1、从 RDD 创建

from pyspark.sql.types import *  包括：
__all__ = [
    "DataType", "NullType", "StringType", "BinaryType", "BooleanType", "DateType",
    "TimestampType", "DecimalType", "DoubleType", "FloatType", "ByteType", "IntegerType",
    "LongType", "ShortType", "ArrayType", "MapType", "StructField", "StructType"]

from pyspark.sql import Row
from pyspark.sql.types import *
from collections import namedtuple

（1）推断 Schema

lines = sc.textFile("people.txt")
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))
peopledf = spark.createDataFrame(people)
peopledf.show()
+---------+---+
|     name|age|
+---------+---+
|    Danae| 27|
|  Claudia| 26|
|Desdemona| 25|
|    Chloe| 30|
|  Felicia| 25|
+---------+---+

（2）指定 Schema

people = parts.map(lambda p: Row(name=p[0], age=int(p[1].strip())))
schemaString = "name age"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)
spark.createDataFrame(people, schema)

print(type(df))
<class 'pyspark.sql.dataframe.DataFrame'>

2、从 Spark 数据源创建

（1）JSON

df = spark.read.json("customer.json")
df.show()
# +-------+---+---------+--------+-----------+
# |address|age|firstName|lastName|phoneNumber|
# +-------+---+---------+--------+-----------+
# |BeiJing| 29|      Tom|   Smith|13911112222|
# |NanJing| 30|     July|     Doe|13922223333|
# +-------+---+---------+--------+-----------+
df2 = spark.read.load("people.json", format="json")
df2.show()
# +---+---------+
# |age|     name|
# +---+---------+
# | 27|    Danae|
# | 26|  Claudia|
# | 25|Desdemona|
# | 30|    Chloe|
# | 25|  Felicia|
# +---+---------+

（2）Parquet 文件

df3 = spark.read.load("people.parquet")
df3.show()
# +---------+---+
# |     name|age|
# +---------+---+
# |    Danae| 27|
# |  Claudia| 26|
# |Desdemona| 25|
# |    Chloe| 30|
# |  Felicia| 25|
# +---------+---+

（3）文本文件

df4 = spark.read.text("people.txt")
df4.show()
# +------------+
# |       value|
# +------------+
# |    Danae,27|
# |  Claudia,26|
# |Desdemona,25|
# |    Chloe,30|
# |  Felicia,25|
# +------------+

3、查阅数据信息

df.dtypes               # 返回 df 的列名与数据类型
df.show()               # 显示 df 的内容
df.head()               # 返回前 n 行数据
df.first()              # 返回第 1 行数据
df.take(2)              # 返回前 n 行数据
df.schema               # 返回 df 的 Schema
df.describe().show()    # 汇总统计数据
df.columns              # 返回 df 的列名
df.count()              # 返回 df 的行数
df.distinct().count()   # 返回 df 中不重复的行数
df.printSchema()        # 返回 df的 Schema
df.explain()            # 返回逻辑与实体方案

返回 df 的列名与数据类型: 
[('address', 'string'), ('age', 'bigint'), ('firstName', 'string'), ('lastName', 'string'), ('phoneNumber', 'string')]
显示 df 的内容: 
+-------+---+---------+--------+-----------+
|address|age|firstName|lastName|phoneNumber|
+-------+---+---------+--------+-----------+
|BeiJing| 29|      Tom|   Smith|13911112222|
|NanJing| 30|     July|     Doe|13922223333|
+-------+---+---------+--------+-----------+
返回前 n 行数据: 
[Row(address='BeiJing', age=29, firstName='Tom', lastName='Smith', phoneNumber='13911112222'), Row(address='NanJing', age=30, firstName='July', lastName='Doe', phoneNumber='13922223333')]
返回前 1 行数据: 
Row(address='BeiJing', age=29, firstName='Tom', lastName='Smith', phoneNumber='13911112222')
返回前 n 行数据: 
[Row(address='BeiJing', age=29, firstName='Tom', lastName='Smith', phoneNumber='13911112222'), Row(address='NanJing', age=30, firstName='July', lastName='Doe', phoneNumber='13922223333')]
返回 df 的 Schema: 
StructType(List(StructField(address,StringType,true),StructField(age,LongType,true),StructField(firstName,StringType,true),StructField(lastName,StringType,true),StructField(phoneNumber,StringType,true)))
汇总统计数据: 
+-------+-------+------------------+---------+--------+-----------------+
|summary|address|               age|firstName|lastName|      phoneNumber|
+-------+-------+------------------+---------+--------+-----------------+
|  count|      2|                 2|        2|       2|                2|
|   mean|   null|              29.5|     null|    null| 1.39166677775E10|
| stddev|   null|0.7071067811865476|     null|    null|7856741.934616441|
|    min|BeiJing|                29|     July|     Doe|      13911112222|
|    max|NanJing|                30|      Tom|   Smith|      13922223333|
+-------+-------+------------------+---------+--------+-----------------+
返回 df 的列名: ['address', 'age', 'firstName', 'lastName', 'phoneNumber']
返回 df 的行数: 2
返回 df 的 Schema: 
root
 |-- address: string (nullable = true)
 |-- age: long (nullable = true)
 |-- firstName: string (nullable = true)
 |-- lastName: string (nullable = true)
 |-- phoneNumber: string (nullable = true)

返回逻辑与实体方案: 
== Physical Plan ==
*(1) FileScan json [address#6,age#7L,firstName#8,lastName#9,phoneNumber#10] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/D:/Mypython/spark/customer.json], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<address:string,age:bigint,firstName:string,lastName:string,phoneNumber:string>

4、重复值

df = df.dropDuplicates()

5、查询

（1）Select

df.select("firstName", "lastName").show()                # 显示 firstName、age 的所有条目
+---------+--------+
|firstName|lastName|
+---------+--------+
|      Tom|   Smith|
|     July|     Doe|
+---------+--------+
df.select("firstName", "age", F.explode("phoneNumber").alias("contactInfo"))\
    .select("contactInfo.type", "firstName", "age")\
    .show()                                              # 显示 firstName、age 的所有条目和类型
df.select(df["firstName"], df["age"] + 1).show()         # 显示 firstName 和 age 列的所有记录，并对 age 记录添加1
+---------+---------+
|firstName|(age + 1)|
+---------+---------+
|      Tom|       30|
|     July|       31|
+---------+---------+
df.select(df['age'] > 29).show()                         # 显示所有小于24岁的记录
+----------+
|(age > 29)|
+----------+
|     false|
|      true|
+----------+

（2）When

df.select("firstName", F.when(df.age > 29, 1).otherwise(0)).show()
+---------+--------------------------------------+
|firstName|CASE WHEN (age > 29) THEN 1 ELSE 0 END|
+---------+--------------------------------------+
|      Tom|                                     0|
|     July|                                     1|
+---------+--------------------------------------+
df[df.firstName.isin("Tom", "July")].show()
+-------+---+---------+--------+-----------+
|address|age|firstName|lastName|phoneNumber|
+-------+---+---------+--------+-----------+
|BeiJing| 29|      Tom|   Smith|13911112222|
|NanJing| 30|     July|     Doe|13922223333|
+-------+---+---------+--------+-----------+

（3）Like

df.select("firstName", df.lastName.like("Smith")).show()
+---------+-------------------+
|firstName|lastName LIKE Smith|
+---------+-------------------+
|      Tom|               true|
|     July|              false|
+---------+-------------------+

（4）Startswith & Endswith

df.select("firstName", df.lastName.startswith("Sm")).show()
+---------+------------------------+
|firstName|startswith(lastName, Sm)|
+---------+------------------------+
|      Tom|                    true|
|     July|                   false|
+---------+------------------------+
df.select(df.lastName.endswith("th")).show()
+----------------------+
|endswith(lastName, th)|
+----------------------+
|                  true|
|                 false|
+----------------------+

（5）Substring

df.select(df.firstName.substr(1, 3).alias("name")).show()    # alias 对列进行重命名
+----+
|name|
+----+
| Tom|
| Jul|
+----+

（6）Between

df.select(df.age.between(22, 29)).show()
+-----------------------------+
|((age >= 22) AND (age <= 29))|
+-----------------------------+
|                         true|
|                        false|
+-----------------------------+

6、列操作

（1）添加列

df = df.withColumn('telePhoneNumber', F.explode(df.phoneNumber.number))

（2）修改列

df = df.withColumnRenamed('phoneNumber', 'callNumber')
df.show()
+-------+---+---------+--------+-----------+
|address|age|firstName|lastName| callNumber|
+-------+---+---------+--------+-----------+
|BeiJing| 29|      Tom|   Smith|13911112222|
|NanJing| 30|     July|     Doe|13922223333|
+-------+---+---------+--------+-----------+

（3）删除列

df = df.drop("address", "callNumber")
df = df.drop(df.address).drop(df.callNumber)
df.show()
+---+---------+--------+
|age|firstName|lastName|
+---+---------+--------+
| 29|      Tom|   Smith|
| 30|     July|     Doe|
+---+---------+--------+

7、分组

df.groupBy("age").count().show()  # 按 age 列分组，统计每组人数
+---+-----+
|age|count|
+---+-----+
| 29|    1|
| 30|    1|
+---+-----+

8、筛选

df.filter(df["age"] > 29).show()    # 按 age 列筛选，保留年龄大于29岁的
+---+---------+--------+
|age|firstName|lastName|
+---+---------+--------+
| 30|     July|     Doe|
+---+---------+--------+

9、排序

df.sort(peopledf.age.desc()).show()
df.sort("age", ascending=False).show()
df.orderBy(["age", "city"], ascending=[0, 1]).show()
+---+---------+--------+
|age|firstName|lastName|
+---+---------+--------+
| 30|     July|     Doe|
| 29|      Tom|   Smith|
+---+---------+--------+

10、替换缺失值

df.na.fill(50).show()           # 用一个值替换空值
df.na.drop().show()             # 去除 df 中为空值的行
df.na.replace(10, 20).show()    # 用一个值替换另一个值

11、重分区

df.repartition(10).rdd.getNumPartitions()   # 将 df 拆分为10个分区
df.coalesce(1).getNumPartitions()           # 将 df 合并为1个分区

12、运行 SQL 查询

（1）将数据框注册为视图

df.createGlobalTempView("people")
df.createTempView("customer")
df.createOrReplaceTempView("customer")

（2）查询视图

df5 = spark.sql("SELECT * FROM customer").show()
peopledf2 = spark.sql("SELECT * FROM global_temp.people").show()

13、输出

（1）数据结构

rdd1 = df.rdd
df.toJSON().first()
df.toPandas()

（2）保存至文件

df.select("firstName", "address").write.save("nameAndAddress.parquet")
df.select("firstName", "age").write.save("namesAndAges.json", format="json")

14、终止 SparkSession

spark.stop()

摘自DataCamp
Learn Python for Data Science Interactively

Rnan-prince

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
PySpark - Spark SQL基础

Spark SQL 是 Apache Spark 处理结构化数据的模块。一、初始化 SparkSessionSparkSession 用于创建数据框，将数据框注册为表，执行 SQL 查询，缓存表及读取 Parquet 文件。from pyspark.sql import SparkSessionspark = SparkSession \ .builder \ .appName("Python Spark SQL basic example") \ .config(
复制链接

扫一扫

专栏目录