产生背景
- Every Spark application starts with loading data and ends with saving data
- Loading and saving Data is not easy
- Parse raw data: text/json/parquet
- Convert data format transformation
- Datasets stored in various Formats/Systems
对应数据存储在各种各样的Systems中的图:
对于用户而言有需求:需要能够方便快速从不同的数据源(json、parquet、rdbms),经过混合处理(json join parquet),再将处理结果以特定的格式(json、parquet)写回到指定的系统(HDFS、S3)上去
Spark SQL 1.2 引入了 外部数据源API
概述
- An extension way to integrate a various of external data sources into Spark SQL
- Can read and write DataFrames using a variety of formats and storage systems
- Data Sources API can automatically prune columns and push filters to the source: Parquet/JDBC
- New API introduced in 1.2
引入外部数据源之后:
目标
- Developer: build libraries for various data sources
- User: easy loading/saving DataFrames
对于开发人员,是否需要把代码合并到spark中?
比如我们开发个可以读取weibo数据的api,是否需要将代码合并到Spark中去呢?
答案是不需要的,我们只需要在使用时 --jars
指定就行
对于用户而言,如何进行使用:
读:
spark.read.format(format)
对于 format 可以分为以下2种:- build-in: json parquet jdbc csv(Spark 2.x 版本是内置的了,对于 1.x 版本需要使用外置的)
- packages: 外部的 并不是spark内置
可以访问:https://spark-packages.org/ 去查找更多的packages
写:
people.write.format("parquet").save("path")
操作parquet数据
Scala方式处理
/**
* Parquet文件操作
*/
object ParquetApp {
def main(args: Array[String]) {
val spark = SparkSession.builder().appName("SparkSessionApp")
.master("local[2]").getOrCreate()
/**
* spark.read.format("parquet").load 这是标准写法
*/
val userDF = spark.read.format("parquet").load("file:///home/hadoop/app/spark-2.2.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources/users.parquet")
userDF.printSchema()
userDF.show()
userDF.select("name","favorite_color").show
userDF.select("name","favorite_color").write.format("json").save("file:///home/hadoop/tmp/jsonout")
spark.read.load("file:///home/hadoop/app/spark-2.2.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources/users.parquet").show
//会报错,因为sparksql默认处理的format就是parquet
spark.read.load("file:///home/hadoop/app/spark-2.2.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources/people.json").show
spark.read.format("parquet").option("path","file:///home/hadoop/app/spark-2.2.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources/users.parquet").load().show
spark.stop()
}
}
SQL方式处理
#注意USING的用法
CREATE TEMPORARY VIEW parquetTable
USING org.apache.spark.sql.parquet
OPTIONS (
path "/home/hadoop/app/spark-2.2.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources/users.parquet"
)
注意点
可能会产生报错:
RuntimeException: file:/home/hadoop/app/spark-2.2.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources/people.json is not a Parquet file
原因:
val DEFAULT_DATA_SOURCE_NAME = SQLConfigBuilder("spark.sql.sources.default")
.doc("The default data source to use in input/output.")
.stringConf
.createWithDefault("parquet")
操作Hive表数据
- 读:
spark.table(tableName)
写:
df.write.saveAsTable(tableName)
写到Hive表中去:
spark.sql("select deptno, count(1) as mount from emp where group by deptno").filter("deptno is not null").write.saveAsTable("hive_table")
注意:在生产环境中一定要注意设置spark.sql.shuffle.partitions,默认是200
关于Spark SQL如何访问Hive表,可以参考这篇博客:Spark SQL整合Hive使用
操作MySQL表数据
方式一
spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/hive")
.option("dbtable", "hive.TBLS")
.option("user", "root")
.option("password", "root")
.option("driver", "com.mysql.jdbc.Driver")
.load()
方式二
val connectionProperties = new Properties()
connectionProperties.put("user", "root")
connectionProperties.put("password", "root")
connectionProperties.put("driver", "com.mysql.jdbc.Driver")
val jdbcDF2 = spark.read.jdbc("jdbc:mysql://localhost:3306", "hive.TBLS", connectionProperties)
方式三(SQL)
CREATE TEMPORARY VIEW jdbcTable
USING org.apache.spark.sql.jdbc
OPTIONS (
url "jdbc:mysql://localhost:3306",
dbtable "hive.TBLS",
user 'root',
password 'root',
driver 'com.mysql.jdbc.Driver'
)