PySpark之Spark SQL的使用《七》

最新推荐文章于 2024-01-07 22:27:54 发布

风雨「83」

最新推荐文章于 2024-01-07 22:27:54 发布

阅读量1.5w

点赞数

分类专栏：大数据 python 文章标签： flink kafka hadoop spark mapreduce

本文链接：https://blog.csdn.net/wywinstonwy/article/details/106347174

版权

python 同时被 2 个专栏收录

41 篇文章 1 订阅

订阅专栏

大数据

12 篇文章 1 订阅

订阅专栏

一、Spark SQL简介

Spark SQL is Apache Spark's module for working with structured data.

Spark SQL是一个用于结构化数据处理的Spark模块。与基本的Spark RDD API不同，Spark SQL提供的接口为Spark提供了有关数据结构和正在执行的计算的更多信息。在内部，Spark SQL使用这些额外的信息来执行额外的优化。有几种与Spark SQL交互的方法，包括SQL和Dataset API。当计算结果时，使用的是相同的执行引擎，与表示计算所用的API/语言无关。这种统一意味着开发人员可以很容易地在不同的api之间来回切换，这些api提供了表达给定转换的最自然的方式。

本页上的所有示例都使用Spark发行版中包含的示例数据，可以在Spark -shell、pyspark shell或sparkR shell中运行。

Integrated《集成》

SQL查询与Spark程序无缝结合。

Spark SQL允许您使用SQL或熟悉的DataFrame API在Spark程序中查询结构化数据。适用于Java、Scala、Python和R语言。

results = spark.sql(
  "SELECT * FROM people")
names = results.map(lambda p: p.name)
Apply functions to results of SQL queries.

Uniform Data Access《统一的数据访问》

以同样的方式连接到任何数据源。

DataFrames和SQL提供了一种通用的方法来访问各种数据源，包括Hive、Avro、Parquet、ORC、JSON和JDBC。您甚至可以跨这些源联接数据。

spark.read.json("s3n://...")
  .registerTempTable("json")
results = spark.sql(
  """SELECT *
     FROM people
     JOIN json ...""")

Hive Integration《整合Hive》

在现有仓库上运行SQL或HiveQL查询。

Spark SQL支持HiveQL语法以及Hive SerDes和udf，允许您访问现有的Hive仓库。

Standard Connectivity《标准的连接》

通过JDBC或ODBC连接。

服务器模式为商业智能工具提供了行业标准的JDBC和ODBC连接。

二、Spark SQL操作

spark sql基础操作

from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.types import *

#spark Sql 基础操作
def test():
    # spark = SparkSession.builder.appName('spark005').getOrCreate()
    spark = SparkSession \
        .builder \
        .appName("Python Spark SQL basic example") \
        .config("spark.some.config.option",
                "some-value") \
        .getOrCreate()

    df = spark.read.json('file:///Users/wangyun/Documents/BigData/script/data/people.json')

    df.show()
    df.printSchema()
    df.select('name').show()
    df.select(df['name'], df['age'] + 1).show()
    df.filter(df['age'] > 21).show()
    # Count people by age
    df.groupBy("age").count().show()

    print('createOrReplaceTempView')
    df.createOrReplaceTempView("people")
    sqlDF = spark.sql("SELECT * FROM people")
    sqlDF.show()

    spark.stop()

运行结果：
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)

+-------+
|   name|
+-------+
|Michael|
|   Andy|
| Justin|
+-------+

+-------+---------+
|   name|(age + 1)|
+-------+---------+
|Michael|     null|
|   Andy|       31|
| Justin|       20|
+-------+---------+

+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+

+----+-----+
| age|count|
+----+-----+
|  30|    1|
|  19|    1|
|null|    1|
+----+-----+

createOrReplaceTempView
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+


Process finished with exit code 0

Spark SQL支持两种不同的方法将现有的RDDs转换为数据集。第一种方法使用反射来推断包含特定对象类型的RDD的模式。这种基于反射的方法使代码更简洁，当您在编写Spark应用程序时已经了解模式时，这种方法可以很好地工作。

创建数据集的第二种方法是通过编程接口，该接口允许您构造模式，然后将其应用于现有的RDD。虽然此方法更详细，但它允许您在列及其类型直到运行时才知道时构造数据集。

rdd转化schemedataframe

Spark SQL可以将Row对象的RDD转换为DataFrame，从而推断数据类型。行是通过将一组键/值对作为kwargs传递给Row类来构造的。这个列表的键定义表的列名，类型通过对整个数据集进行采样来推断，类似于对JSON文件执行的推断。

from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.types import *

def schemeReflection():
    spark = SparkSession \
        .builder \
        .appName("Python Spark SQL basic example") \
        .config("spark.some.config.option",
                "some-value") \
        .getOrCreate()
    sc = spark.sparkContext
    lines = sc.textFile(
        "file:///Users/wangyun/Documents/BigData/script/data/people.txt")
    parts = lines.map(lambda l: l.split(","))
    print(parts.collect())

    people = parts.map(
        lambda p: Row(name=p[0], age=int(p[1])))
    # Infer the schema, and register the DataFrame as a table.
    schemaPeople = spark.createDataFrame(people)
    schemaPeople.createOrReplaceTempView("people")
    # SQL can be run over DataFrames that have been registered as a table.
    teenagers = spark.sql(
        "SELECT name FROM people WHERE age >= 13 AND age <= 19")

    # The results of SQL queries are Dataframe objects.
    # rdd returns the content as an :class:`pyspark.RDD` of :class:`Row`.
    teenNames = teenagers.rdd.map(
        lambda p: "Name: " + p.name).collect()
    for name in teenNames:
        print(name)
    # Name: Justin

使用反射推断模式

#创建scheme作用于rdd
def programmascheme():
    spark = SparkSession \
        .builder \
        .appName("Python Spark SQL basic example") \
        .config("spark.some.config.option",
                "some-value") \
        .getOrCreate()
    sc = spark.sparkContext

    # Load a text file and convert each line to a Row.
    lines = sc.textFile(
        "file:///Users/wangyun/Documents/BigData/script/data/people.txt")
    parts = lines.map(lambda l: l.split(","))
    # Each line is converted to a tuple.
    people = parts.map(
        lambda p: (p[0], p[1].strip()))

    # The schema is encoded in a string.
    schemaString = "name age"

    fields = [
        StructField(field_name, StringType(),
                    True) for field_name in
        schemaString.split()]
    schema = StructType(fields)

    # Apply the schema to the RDD.
    schemaPeople = spark.createDataFrame(people,
                                         schema)

    # Creates a temporary view using the DataFrame
    schemaPeople.createOrReplaceTempView("people")

    # SQL can be run over DataFrames that have been registered as a table.
    results = spark.sql("SELECT name FROM people")

    results.show()

    results = spark.sql("SELECT * FROM people")

    results.show()