Spark SQL---入门（一）

最新推荐文章于 2023-11-21 05:27:29 发布

Zhouxk96

最新推荐文章于 2023-11-21 05:27:29 发布

阅读量448

点赞数

本文链接：https://blog.csdn.net/weixin_42051034/article/details/105568875

版权

Spark SQL---入门

1. 入门
参考

1. 入门

1.1 起点：SparkSession

SparkSession类是Spark中所有功能的入口点。要创建一个基本的SparkSession，只需使用SparkSession.builder：

python

from pyspark.sql  import SparkSession

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

在Spark存储库中的“ examples / src / main / python / sql / basic.py”中找到完整的示例代码。

scala

import org.apache.spark.sql.SparkSession

val spark = SparkSession
  .builder()
  .appName("Spark SQL basic example")
  .config("spark.some.config.option", "some-value")
  .getOrCreate()
//对于隐式转换，例如将RDD转换为DataFrames
import spark.implicits._

在Spark存储库中的“ examples / src / main / scala / org / apache / spark / examples / sql / SparkSQLExample.scala”中找到完整的示例代码。

1.2 创建数据框

使用SparkSession，应用程序可以从现有的RDD，Hive表的或Spark数据源创建DataFrame 。

例如，以下内容基于JSON文件的内容创建一个DataFrame：

python

# spark is an existing SparkSession
df = spark.read.json("examples/src/main/resources/people.json")
# Displays the content of the DataFrame to stdout
df.show()
# +----+-------+
# | age|   name|
# +----+-------+
# |null|Michael|
# |  30|   Andy|
# |  19| Justin|
# +----+-------+

在Spark存储库中的“ examples / src / main / python / sql / basic.py”中找到完整的示例代码。

scala

val df = spark.read.json("examples/src/main/resources/people.json")

// Displays the content of the DataFrame to stdout
df.show()
// +----+-------+
// | age|   name|
// +----+-------+
// |null|Michael|
// |  30|   Andy|
// |  19| Justin|
// +----+-------+

在Spark存储库中的“ examples / src / main / scala / org / apache / spark / examples / sql / SparkSQLExample.scala”中找到完整的示例代码。

1.3 无类型的数据集操作（又名DataFrame操作）

DataFrames为Scala，Java，Python和R中的结构化数据操作提供了一种特定于域的语言。

如上所述，在Spark 2.0中，DataFrames只是RowScala和Java API中的的数据集。与强类型的Scala / Java数据集附带的“类型转换”相反，这些操作也称为“非类型转换”。

这里我们包括一些使用数据集进行结构化数据处理的基本示例：

python
在Python中，可以通过属性（df.age）或通过索引（df[‘age’]）访问DataFrame的列。尽管前者便于交互式数据探索，但强烈建议用户使用后者形式，后者是未来的证明，并且不会与列名保持一致，列名也是DataFrame类的属性。

# spark, df are from the previous example
# Print the schema in a tree format
df.printSchema()
# root
# |-- age: long (nullable = true)
# |-- name: string (nullable = true)

# Select only the "name" column
df.select("name").show()
# +-------+
# |   name|
# +-------+
# |Michael|
# |   Andy|
# | Justin|
# +-------+

# Select everybody, but increment the age by 1
df.select(df['name'], df['age'] + 1).show()
# +-------+---------+
# |   name|(age + 1)|
# +-------+---------+
# |Michael|     null|
# |   Andy|       31|
# | Justin|       20|
# +-------+---------+

# Select people older than 21
df.filter(df['age'] > 21).show()
# +---+----+
# |age|name|
# +---+----+
# | 30|Andy|
# +---+----+

# Count people by age
df.groupBy("age").count().show()
# +----+-----+
# | age|count|
# +----+-----+
# |  19|    1|
# |null|    1|
# |  30|    1|
# +----+-----+

在Spark存储库中的“ examples / src / main / python / sql / basic.py”中找到完整的示例代码。

scala

// This import is needed to use the $-notation
import spark.implicits._
// Print the schema in a tree format
df.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)

// Select only the "name" column
df.select("name").show()
// +-------+
// |   name|
// +-------+
// |Michael|
// |   Andy|
// | Justin|
// +-------+

// Select everybody, but increment the age by 1
df.select($"name", $"age" + 1).show()
// +-------+---------+
// |   name|(age + 1)|
// +-------+---------+
// |Michael|     null|
// |   Andy|       31|
// | Justin|       20|
// +-------+---------+

// Select people older than 21
df.filter($"age" > 21).show()
// +---+----+
// |age|name|
// +---+----+
// | 30|Andy|
// +---+----+

// Count people by age
df.groupBy("age").count().show()
// +----+-----+
// | age|count|
// +----+-----+
// |  19|    1|
// |null|    1|
// |  30|    1|
// +----+-----+

在Spark存储库中的“ examples / src / main / scala / org / apache / spark / examples / sql / SparkSQLExample.scala”中找到完整的示例代码。

1.4 以编程方式运行SQL查询

上述的 sql 函数的 SparkSession 使应用程序能够以编程方式运行SQL查询，并以形式返回结果DataFrame。

python

# Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")
sqlDF = spark.sql("SELECT * FROM people")
sqlDF.show()
# +----+-------+
# | age|   name|
# +----+-------+
# |null|Michael|
# |  30|   Andy|
# |  19| Justin|
# +----+-------+

在Spark存储库中的“ examples / src / main / python / sql / basic.py”中找到完整的示例代码。

scala

// Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")

val sqlDF = spark.sql("SELECT * FROM people")
sqlDF.show()
// +----+-------+
// | age|   name|
// +----+-------+
// |null|Michael|
// |  30|   Andy|
// |  19| Justin|
// +----+-------+

在Spark存储库中的“ examples / src / main / scala / org / apache / spark / examples / sql / SparkSQLExample.scala”中找到完整的示例代码。

1.5 全局临时视图

Spark SQL中的临时视图是会话作用域的，如果创建它的会话终止，它将消失。如果要在所有会话之间共享一个临时视图并保持活动状态，直到Spark应用程序终止，则可以创建全局临时视图。全局临时视图与系统保留的数据库相关联global_temp，我们必须使用限定名称来引用它，例如SELECT * FROM global_temp.view1。

python

# Register the DataFrame as a global temporary view
df.createGlobalTempView("people")

# Global temporary view is tied to a system preserved database `global_temp`
spark.sql("SELECT * FROM global_temp.people").show()
# +----+-------+
# | age|   name|
# +----+-------+
# |null|Michael|
# |  30|   Andy|
# |  19| Justin|
# +----+-------+

# Global temporary view is cross-session
spark.newSession().sql("SELECT * FROM global_temp.people").show()
# +----+-------+
# | age|   name|
# +----+-------+
# |null|Michael|
# |  30|   Andy|
# |  19| Justin|
# +----+-------+

在Spark存储库中的“ examples / src / main / python / sql / basic.py”中找到完整的示例代码。

scala

// Register the DataFrame as a global temporary view
df.createGlobalTempView("people")

// Global temporary view is tied to a system preserved database `global_temp`
spark.sql("SELECT * FROM global_temp.people").show()
// +----+-------+
// | age|   name|
// +----+-------+
// |null|Michael|
// |  30|   Andy|
// |  19| Justin|
// +----+-------+

// Global temporary view is cross-session
spark.newSession().sql("SELECT * FROM global_temp.people").show()
// +----+-------+
// | age|   name|
// +----+-------+
// |null|Michael|
// |  30|   Andy|
// |  19| Justin|
// +----+-------+

在Spark存储库中的“ examples / src / main / scala / org / apache / spark / examples / sql / SparkSQLExample.scala”中找到完整的示例代码。

1.5 创建数据集

数据集与RDD相似，但是它们不是使用Java序列化或Kryo，而是使用专用的Encoder对对象进行序列化以进行网络处理或传输。虽然编码器和标准序列化都负责将对象转换为字节，但是编码器是动态生成的代码，并使用一种格式，该格式允许Spark执行许多操作，例如过滤，排序和哈希处理，而无需将字节反序列化为对象。

scala

case class Person(name: String, age: Long)

// Encoders are created for case classes
val caseClassDS = Seq(Person("Andy", 32)).toDS()
caseClassDS.show()
// +----+---+
// |name|age|
// +----+---+
// |Andy| 32|
// +----+---+

// Encoders for most common types are automatically provided by importing spark.implicits._
val primitiveDS = Seq(1, 2, 3).toDS()
primitiveDS.map(_ + 1).collect() // Returns: Array(2, 3, 4)

// DataFrames can be converted to a Dataset by providing a class. Mapping will be done by name
val path = "examples/src/main/resources/people.json"
val peopleDS = spark.read.json(path).as[Person]
peopleDS.show()
// +----+-------+
// | age|   name|
// +----+-------+
// |null|Michael|
// |  30|   Andy|
// |  19| Justin|
// +----+-------+

在Spark存储库中的“ examples / src / main / scala / org / apache / spark / examples / sql / SparkSQLExample.scala”中找到完整的示例代码。

java

import java.util.Arrays;
import java.util.Collections;
import java.io.Serializable;

import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.Encoder;
import org.apache.spark.sql.Encoders;

public static class Person implements Serializable {
  private String name;
  private int age;

  public String getName() {
    return name;
  }

  public void setName(String name) {
    this.name = name;
  }

  public int getAge() {
    return age;
  }

  public void setAge(int age) {
    this.age = age;
  }
}

// Create an instance of a Bean class
Person person = new Person();
person.setName("Andy");
person.setAge(32);

// Encoders are created for Java beans
Encoder<Person> personEncoder = Encoders.bean(Person.class);
Dataset<Person> javaBeanDS = spark.createDataset(
  Collections.singletonList(person),
  personEncoder
);
javaBeanDS.show();
// +---+----+
// |age|name|
// +---+----+
// | 32|Andy|
// +---+----+

// Encoders for most common types are provided in class Encoders
Encoder<Integer> integerEncoder = Encoders.INT();
Dataset<Integer> primitiveDS = spark.createDataset(Arrays.asList(1, 2, 3), integerEncoder);
Dataset<Integer> transformedDS = primitiveDS.map(
    (MapFunction<Integer, Integer>) value -> value + 1,
    integerEncoder);
transformedDS.collect(); // Returns [2, 3, 4]

// DataFrames can be converted to a Dataset by providing a class. Mapping based on name
String path = "examples/src/main/resources/people.json";
Dataset<Person> peopleDS = spark.read().json(path).as(personEncoder);
peopleDS.show();
// +----+-------+
// | age|   name|
// +----+-------+
// |null|Michael|
// |  30|   Andy|
// |  19| Justin|
// +----+-------+

在Spark存储库中的“ examples / src / main / java / org / apache / spark / examples / sql / JavaSparkSQLExample.java”中找到完整的示例代码。

1.6 与RDD互操作

Spark SQL有两种方法将RDD转为DataFrame。

使用反射机制，推导包含指定类型对象RDD的schema。这种基于反射机制的方法使代码更简洁，而且如果你事先知道数据schema，推荐使用这种方式；
编程方式构建一个schema，然后应用到指定RDD上。这种方式更啰嗦，但如果你事先不知道数据有哪些字段，或者数据schema是运行时读取进来的，那么你很可能需要用这种方式。

使用反射推导Schema
Spark SQL的Scala接口支持自动将包含case class对象的RDD转为DataFrame。对应的case class定义了表的schema。case class的参数名通过反射，映射为表的字段名。case class还可以嵌套一些复杂类型，如Seq和Array。RDD隐式转换成DataFrame后，可以进一步注册成表。随后，你就可以对表中数据使用SQL语句查询了。
编程方式定义Schema
如果不能事先通过case class定义schema（例如，记录的字段结构是保存在一个字符串，或者其他文本数据集中，需要先解析，又或者字段对不同用户有所不同），那么你可能需要按以下三个步骤，以编程方式的创建一个DataFrame：

从已有的RDD创建一个包含Row对象的RDD
用StructType创建一个schema，和步骤1中创建的RDD的结构相匹配
把得到的schema应用于包含Row对象的RDD，调用这个方法来实现这一步：SQLContext.createDataFrame

参考

https://spark.apache.org/docs/latest/sql-programming-guide.html

Zhouxk96

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark SQL---入门（一）

Spark SQL---入门1. 入门1.1 起点：SparkSession1.2 创建数据框1.3 无类型的数据集操作（又名DataFrame操作）1.4 以编程方式运行SQL查询1.5 全局临时视图1.5 创建数据集1.6 与RDD互操作1. 入门1.1 起点：SparkSessionSparkSession类是Spark中所有功能的入口点。要创建一个基本的SparkSession，只需...
复制链接

扫一扫