Spark SQL读取外部数据源

最新推荐文章于 2020-08-10 13:09:05 发布

明天你好lk

最新推荐文章于 2020-08-10 13:09:05 发布

阅读量489

点赞数

分类专栏：大数据

本文链接：https://blog.csdn.net/likaiasddsa/article/details/92246433

版权

大数据专栏收录该内容

31 篇文章 1 订阅

订阅专栏

Spark SQL读取外部数据源

1、Spark SQL可以加载任何地方的数据，例如mysql，hive，hdfs，hbase等，而且支持很多种格式如json, parquet, avro, csv格式。
2、通过外部数据源API读取各种格式的数据，会得到一个DataFrame，可以使用DataFrame的API或者SQL的API进行操作。
3、保存操作可以选择使用SaveMode，指定如何保存现有数据。

读取json文件

//标准写法
val df=spark.read.format(“json”).load(“path”)
//另外一种写法
spark.read.json(“path”)

scala> val df=spark.read.format(“json”).load(“file:///opt/software/spark-2.2.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources/people.json”)
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
df.printSchema
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
df.show
±—±------+
| age| name|
±—±------+
|null|Michael|
| 30| Andy|
| 19| Justin|
±—±------+

读取parquet数据

val df=spark.read.format(“parquet”).load(“file:///opt/software/spark-2.2.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources/users.parquet”)
df: org.apache.spark.sql.DataFrame = [name: string, favorite_color: string … 1 more field]
df.show
±-----±-------------±---------------+
| name|favorite_color|favorite_numbers|
±-----±-------------±---------------+
|Alyssa| null| [3, 9, 15, 20]|
| Ben| red| []|
±-----±-------------±---------------+

读取hive中的数据

spark.sql(“show tables”).show
±-------±---------±----------+
|database| tableName|isTemporary|
±-------±---------±----------+
| default|states_raw| false|
| default|states_seq| false|
| default| t1| false|
±-------±---------±----------+
spark.table(“states_raw”).show
±----±-----+
| code| name|
±----±-----+
|hello| java|
|hello|hadoop|
|hello| hive|
|hello| sqoop|
|hello| hdfs|
|hello| spark|
±----±-----+
scala> spark.sql("select name from states_raw ").show
±-----+
| name|
±-----+
| java|
|hadoop|
| hive|
| sqoop|
| hdfs|
| spark|
±-----+

读取mysql中的数据

val jdbcDF = spark.read
.format(“jdbc”)
.option(“url”, “jdbc:mysql://localhost:3306”)
.option(“dbtable”, “basic01.tbls”)
.option(“user”, “root”)
.option(“password”, “123456”)
.load()
scala> jdbcDF.printSchema
root
|-- TBL_ID: long (nullable = false)
|-- CREATE_TIME: integer (nullable = false)
|-- DB_ID: long (nullable = true)
|-- LAST_ACCESS_TIME: integer (nullable = false)
|-- OWNER: string (nullable = true)
|-- RETENTION: integer (nullable = false)
|-- SD_ID: long (nullable = true)
|-- TBL_NAME: string (nullable = true)
|-- TBL_TYPE: string (nullable = true)
|-- VIEW_EXPANDED_TEXT: string (nullable = true)
|-- VIEW_ORIGINAL_TEXT: string (nullable = true)
jdbcDF.show

分区推测（Partition Discovery）

表分区是在像Hive这样的系统中使用的常见优化方法。在分区表中，数据通常存储在不同的目录中，分区列值在每个分区目录的路径中编码。所有内置的文件源（包括Text / CSV / JSON / ORC / Parquet）都能够自动发现和推断分区信息。例如，我们创建如下的目录结构;

hdfs dfs -mkdir -p /user/hive/warehouse/gender=male/country=CN
添加json文件：
people.json
{“name”:“Michael”}
{“name”:“Andy”, “age”:30}
{“name”:“Justin”, “age”:19}
hdfs dfs -put people.json /user/hive/warehouse/gender=male/country=CN
我们使用spark sql读取外部数据源：
val df=spark.read.format(“json”).load("/user/hive/warehouse/gender=male/country=CN/people.json")
scala> df.printSchema
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
scala> df.show
±—±------+
| age| name|
±—±------+
|null|Michael|
| 30| Andy|
| 19| Justin|
±—±------+
我们改变读取的目录
val df=spark.read.format(“json”).load("/user/hive/warehouse/gender=male/")
scala> df.printSchema
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
|-- country: string (nullable = true)
scala> df.show
±—±------±------+
| age| name|country|
±—±------±------+
|null|Michael| CN|
| 30| Andy| CN|
| 19| Justin| CN|
±—±------±------+
Spark SQL将自动从路径中提取分区信息。注意，分区列的数据类型是自动推断的。目前支持数字数据类型，日期，时间戳和字符串类型。

保存数据

注意：1、保存的文件夹不能存在，否则报错
2、保存成文本格式，只能保存一列，否则报错

//保存为json格式
df.write.format(“json”).save(“file:///home/hadoop/data/out1”)

在这里插入图片描述

明天你好lk

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Spark SQL读取外部数据源

Spark SQL读取外部数据源1、Spark SQL可以加载任何地方的数据，例如mysql，hive，hdfs，hbase等，而且支持很多种格式如json, parquet, avro, csv格式。2、通过外部数据源API读取各种格式的数据，会得到一个DataFrame，可以使用DataFrame的API或者SQL的API进行操作。3、保存操作可以选择使用SaveMode，指定如何保存现...
复制链接

扫一扫