spark sql之读写数据（十二）

最新推荐文章于 2023-07-30 22:06:29 发布

java大数据编程

最新推荐文章于 2023-07-30 22:06:29 发布

阅读量758

点赞数

分类专栏： spark 文章标签： SparkSQL Parquet JSON Hive JDBC

本文链接：https://blog.csdn.net/cold_wolfie/article/details/82117100

版权

简介

Spark SQL支持多种结构化数据源，轻松从各种数据源中读取Row对象。这些数据源包括Parquet、JSON、Hive表及关系型数据库等。

当只使用一部分字段时，Spark SQL可以智能地只扫描这些字段，而不会像hadoopFile方法一样简单粗暴地扫描全部数据。

Parquet

Parquet是一种流行的列式存储格式，可以高效地存储具有嵌套字段的记录。Parquet自动保存原始数据的类型，当写入Parquet文件时，所有的列会自动转为可空约束。

scala

// Encoders for most common types are automatically provided by importing spark.implicits._
import spark.implicits._

val peopleDF = spark.read.json("examples/src/main/resources/people.json")

// DataFrames can be saved as Parquet files, maintaining the schema information
peopleDF.write.parquet("people.parquet")

// Read in the parquet file created above
// Parquet files are self-describing so the schema is preserved
// The result of loading a Parquet file is also a DataFrame
val parquetFileDF = spark.read.parquet("people.parquet")

// Parquet files can also be used to create a temporary view and then used in SQL statements
parquetFileDF.createOrReplaceTempView("parquetFile")
val namesDF = spark.sql("SELECT name FROM parquetFile WHERE age BETWEEN 13 AND 19")
namesDF.map(attributes => "Name: " + attributes(0)).show()
// +------------+
// |       value|
// +------------+
// |Name: Justin|
// +------------+

java

import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;

Dataset<Row> peopleDF = spark.read().json("examples/src/main/resources/people.json");

// DataFrames can be saved as Parquet files, maintaining the schema information
peopleDF.write().parquet("people.parquet");

// Read in the Parquet file created above.
// Parquet files are self-describing so the schema is preserved
// The result of loading a parquet file is also a DataFrame
Dataset<Row> parquetFileDF = spark.read().parquet("people.parquet");

// Parquet files can also be used to create a temporary view and then used in SQL statements
parquetFileDF.createOrReplaceTempView("parquetFile");
Dataset<Row> namesDF = spark.sql("SELECT name FROM parquetFile WHERE age BETWEEN 13 AND 19");
Dataset<String> namesDS = namesDF.map(
    (MapFunction<Row, String>) row -> "Name: " + row.getString(0),
    Encoders.STRING());
namesDS.show();
// +------------+
// |       value|
// +------------+
// |Name: Justin|
// +------------+

python

peopleDF = spark.read.json("examples/src/main/resources/people.json")

# DataFrames can be saved as Parquet files, maintaining the schema information.
peopleDF.write.parquet("people.parquet")

# Read in the Parquet file created above.
# Parquet files are self-describing so the schema is preserved.
# The result of loading a parquet file is also a DataFrame.
parquetFile = spark.read.parquet("people.parquet")

# Parquet files can also be used to create a temporary view and then used in SQL statements.
parquetFile.createOrReplaceTempView("parquetFile")
teenagers = spark.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
teenagers.show()
# +------+
# |  name|
# +------+
# |Justin|
# +------+

CREATE TEMPORARY VIEW parquetTable
USING org.apache.spark.sql.parquet
OPTIONS (
  path "examples/src/main/resources/people.parquet"
)

SELECT * FROM parquetTable

JSON

Spark SQL可以自动推断JSON数据集的结构，并加载为以Row为集合项的Dataset。

默认Spark SQL读取的json文件不是常规的json文件，每一行必须包含一个独立的、自包含的有效JSOn对象。对于常规的多行JSON文件，设置multiLine选项为true即可。

scala

// Primitive types (Int, String, etc) and Product types (case classes) encoders are
// supported by importing this when creating a Dataset.
import spark.implicits._

// A JSON dataset is pointed to by path.
// The path can be either a single text file or a directory storing text files
val path = "examples/src/main/resources/people.json"
val peopleDF = spark.read.json(path)

// The inferred schema can be visualized using the printSchema() method
peopleDF.printSchema()
// root
//  |-- age: long (nullable = true)
//  |-- name: string (nullable = true)

最低0.47元/天解锁文章

java大数据编程

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
spark sql之读写数据（十二）

Spark SQL支持多种结构化数据源，轻松从各种数据源中读取Row对象。这些数据源包括Parquet、JSON、Hive表及关系型数据库等。当只使用一部分字段时，Spark SQL可以智能地只扫描这些字段，而不会像hadoopFile方法一样简单粗暴地扫描全部数据。SparkSQL之读写数据主要内容有：Parquet、JSON、Hive表、JDBC连接关系型数据库。
复制链接

扫一扫