一、Spark SQL支持的外部数据源
Spark SQL开放了一系列接入外部数据源的接口,来让开发者可以实现Spark SQL可以
加载任何地方的数据,例如mysql,hive,hdfs,hbase等,而且支持很多种格式
如json, parquet, avro, csv格式…
Spark SQL的DataFrame接口支持多种数据源的操作。一个DataFrame可以进行RDDs
方式的操作,也可以被注册为临时表。把DataFrame注册为临时表之后,就可以对该
DataFrame执行SQL查询。
二、Spark SQL —> CSV
2.1 读CSV文件
a.有列名
e.g.
package cn.kgc.spark.Test
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}
object Demo_LoadCSVFile {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName(this.getClass.getName)
.setMaster("local[4]")
val sc = new SparkContext(conf)
val customer = sc.textFile("data/customer_details.csv")
customer.map(x=>(x.split(",")))
//创建SparkSession
//通过SparkSession加载csc/json文件
val spark = SparkSession.builder().config(conf).getOrCreate()
//加载csv文件
// .option("header",true)判断文件中的第一行是否为列的名称
val df = spark.read.format("csv").option("header",true)
.load("data/customer_details.csv")
df.show()
df.printSchema()
}
}
result:
"C:\Program Files\Java\jdk1.8.0_231\bin\java.exe" ...
2021-01-09 13:01:07,596 WARN [org.apache.spark.SparkContext] - Using
an existing SparkContext; some configuration may not take effect.
+-----------+----------+-----------+--------------------+------+--------------------+-------+---------+--------------------+--------------------+--------------------+
|customer_id|first_name| last_name| email|gender| address|country| language| job| credit_type| credit_no|
+-----------+----------+-----------+--------------------+------+--------------------+-------+---------+--------------------+--------------------+--------------------+
| 1| Spencer| Raffeorty|sraffeorty0@dropb...| Male| 9274 Lyons Court| China| Khmer|Safety Technician...| jcb| 3589373385487660 |
| 2| Cherye| Poynor| cpoynor1@51.la|Female|1377 Anzinger Avenue| China| Czech| Research Nurse| instapayment| 6376594861844530 |
| 3| Natasha| Abendroth|nabendroth2@scrib...|Female| 2913 Evergreen Lane| China| Yiddish|Budget/Accounting...| visa| 4041591905616350 |
| 4| Huntley| Seally| hseally3@prlog.org| Male| 694 Del Sol Lane| China| Albanian|Environmental Spe...| laser| 677118310740263000 |
| 5| Druci| Coad| dcoad4@weibo.com|Female| 16 Debs Way| China| Hebrew| Teacher| jcb| 3537287259845040 |
| 6| Sayer| Brizell|sbrizell5@opensou...| Male| 71 Banding Terrace| China| Maltese| Accountant IV| americanexpress| 379709885387687 |
| 7| Becca| Brawley|bbrawley6@sitemet...|Female|7 Doe Crossing Ju...| China| Czech|Payment Adjustmen...| jcb| 3545380000000000 |
| 8| Michele| Bastable| mbastable7@sun.com|Female|98 Clyde Gallaghe...| China|Malayalam| Tax Accountant| jcb| 3588130000000000 |
| 9| Marla|Brotherhood|mbrotherhood8@ill...|Female|4538 Fair Oaks Trail| China| Dari| Design Engineer| china-unionpay|5602230000000000000 |
| 10| Lionello| Gogarty|lgogarty9@histats...| Male| 800 Sage Alley| China| Danish| Clinical Specialist|diners-club-carte...| 30290800000000 |
| 11| Camile| Ringer| cringera@army.mil|Female|5060 Fairfield Alley| China| Punjabi| Junior Executive| china-unionpay| 5602210000000000 |
| 12| Gillan| Banbridge|gbanbridgeb@wikip...|Female| 91030 Havey Point| China| Kurdish| Chemical Engineer| jcb| 3555950000000000 |
| 13| Guinna| Damsell|gdamsellc@spiegel.de|Female| 869 Ohio Park| China| Fijian| Analyst Programmer| jcb| 3532010000000000 |
| 14| Octavia| McDugal|omcdugald@rambler.ru|Female| 413 Forster Center| China| English|Desktop Support T...| maestro| 502018000000000000 |
| 15| Anjanette| Penk| apenke@lulu.com|Female| 8154 Schiller Road| China| Swedish| VP Sales| jcb| 3548040000000000 |
| 16| Maura| Teesdale|mteesdalef@globo.com|Female| 9568 Quincy Alley| China| Dutch| Dental Hygienist| jcb| 3582890000000000 |
| 17| Chastity| Neylon| cneylong@wix.com|Female|11952 Northwester...| China| Gujarati| Geologist I| jcb| 3543180000000000 |
| 18| Ralph| Coils|rcoilsh@artisteer...| Male| 5070 Hooker Pass| China| Khmer| Speech Pathologist| mastercard| 5505790000000000 |
| 19| Jehanna| Whybrow| jwhybrowi@wp.com|Female| 83 Maywood Way| China| Fijian| Assistant Professor| visa-electron| 4844570000000000 |
| 20| Ingeberg| Sutehall|isutehallj@feedbu...|Female| 272 Butternut Drive| China| Kazakh| Assistant Manager| visa-electron| 4405830000000000 |
+-----------+----------+-----------+--------------------+------+--------------------+-------+---------+--------------------+--------------------+--------------------+
only showing top 20 rows
root
|-- customer_id: string (nullable = true)
|-- first_name: string (nullable = true)
|-- last_name: string (nullable = true)
|-- email: string (nullable = true)
|-- gender: string (nullable = true)
|-- address: string (nullable = true)
|-- country: string (nullable = true)
|-- language: string (nullable = true)
|-- job: string (nullable = true)
|-- credit_type: string (nullable = true)
|-- credit_no: string (nullable = true)
b.无列名
加载的CSV文件无列名时,需要指定数据结构;
e.g.
"C:\Program Files\Java\jdk1.8.0_231\bin\java.exe" ...
package SparkSQL
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{StringType, StructField, StructType}
object Demo9_LoadCSVFile {
def main(args: Array[String]): Unit = {
//创建SparkSession
val spark = SparkSession.builder()
.master("local[4]") //设置master
.appName(this.getClass.getName) //设置appName
.getOrCreate()
//创建schema信息
val schema = StructType(Array(
StructField("number", StringType, true),
StructField("date", StringType, true),
StructField("orderNo", StringType, true),
StructField("status", StringType, true)
))
val df = spark.read.format("csv")
.schema(schema)
.load("file:///F:\\notes\\java\\SparkFirst\\data2\\orders.csv")
df.show()
df.printSchema()
}
}
result:
+------+-------------------+-------+---------------+
|number| date|orderNo| status|
+------+-------------------+-------+---------------+
| 1|2013-07-25 00:00:00| 11599| CLOSED|
| 2|2013-07-25 00:00:00| 256|PENDING_PAYMENT|
| 3|2013-07-25 00:00:00| 12111| COMPLETE|
| 4|2013-07-25 00:00:00| 8827| CLOSED|
| 5|2013-07-25 00:00:00| 11318| COMPLETE|
| 6|2013-07-25 00:00:00| 7130| COMPLETE|
| 7|2013-07-25 00:00:00| 4530| COMPLETE|
| 8|2013-07-25 00:00:00| 2911| PROCESSING|
| 9|2013-07-25 00:00:00| 5657|PENDING_PAYMENT|
| 10|2013-07-25 00:00:00| 5648|PENDING_PAYMENT|
| 11|2013-07-25 00:00:00| 918| PAYMENT_REVIEW|
| 12|2013-07-25 00:00:00| 1837| CLOSED|
| 13|2013-07-25 00:00:00| 9149|PENDING_PAYMENT|
| 14|2013-07-25 00:00:00| 9842| PROCESSING|
| 15|2013-07-25 00:00:00| 2568| COMPLETE|
| 16|2013-07-25 00:00:00| 7276|PENDING_PAYMENT|
| 17|2013-07-25 00:00:00| 2667| COMPLETE|
| 18|2013-07-25 00:00:00| 1205| CLOSED|
| 19|2013-07-25 00:00:00| 9488|PENDING_PAYMENT|
| 20|2013-07-25 00:00:00| 9198| PROCESSING|
+------+-------------------+-------+---------------+
only showing top 20 rows
root
|-- number: string (nullable = true)
|-- date: string (nullable = true)
|-- orderNo: string (nullable = true)
|-- status: string (nullable = true)
Process finished with exit code 0
2.2 写CSV文件
将DataFrame向外写成CSV格式文件时,除了指定数据追加模式外,还可以指定数据分隔符;
df.write.format("csv").mode("overwrite").option("sep", "\t")
.save("F:\\notes\\java\\SparkFirst\\output\\CSVdownload ")
e.g.
package SparkSQL
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{IntegerType, StringType,
StructField, StructType}
object Demo10_WriteCSVFile {
def main(args: Array[String]): Unit = {
//创建SparkSession
val spark = SparkSession.builder()
.master("local[4]") //设置master
.appName(this.getClass.getName) //设置appName
.getOrCreate()
val sc=spark.sparkContext
import spark.implicits._
val peopleRDD = sc.textFile("data2/people.txt")
//将RDD通过和Schema信息关联,得到DataFrame
//通过StructType构建Schema信息,StructField代表一个字段
//第一个参数是字段名称,第二个参数是字段类型,第三个参数是是否可以为空
val schema = StructType(Array(
StructField("name", StringType, true),
StructField("age", IntegerType, true)
))
//将每行的字符串切割,切割成Array,将其转化为RDD[Row]类型
val peopleRowRDD = peopleRDD.map(_.split(","))
.map(x=>Row(x(0),x(1).toInt))
//将Row类型的RDD和Schema信息关联,创建一个DataFrame
val df = spark.createDataFrame(peopleRowRDD,schema)
df.write.format("csv").mode("overwrite").save("F:\\notes\\java\\
SparkFirst\\output\\CSVdownload")
}
}
result:
三、Spark SQL —> JSON
3.1 读JSON文件
与读取CSV文件类似:
e.g.
package SparkSQL
import org.apache.spark.sql.SparkSession
object Demo4_LoadJSONFile {
def main(args: Array[String]): Unit = {
//创建SparkSession
val spark = SparkSession.builder()
.master("local[4]") //设置master
.appName(this.getClass.getName) //设置appName
.getOrCreate()
import spark.implicits._
//两种加载json/csv文件:spark.read.format("json").load("PATH")
// spark.read.json("PATH")
val sc=spark.sparkContext
val dataFrame = spark.read.json("data/users.json")
dataFrame.show()
}
}
3.2 写JSON文件
与写入CSV格式一样,通过DataFrame.write()方法写入:
df.write.format("json").mode("overwrite")
.save("F:\\notes\\java\\SparkFirst\\output\\JSONdownload")
四、Spark SQL —> Parquet
读&写 Parquet文件
Parquet文件:是一种流行的列式存储格式,以二进制存储,文件中包含数据和元数据
e.g.
package sparksql2
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
object Demo1_ReadWriteParquet {
def main(args: Array[String]): Unit = {
//创建SparkSession
val spark = SparkSession.builder()
.master("local[4]") //设置master
.appName(this.getClass.getName) //设置appName
.getOrCreate()
import spark.implicits._
val sc=spark.sparkContext
val rdd = sc.parallelize(List(("curry", "yellow", Array
(3, 9, 15, 20)), ("jack", "red", null)))
// 1.定义schema信息
val schema = StructType(Array(
StructField("name", StringType, true),
StructField("favorite_color", StringType, true),
StructField("favorite_numbers", ArrayType(IntegerType), true)
))
// 2.准备Row类型的RDD
val rowRDD = rdd.map(x=>Row(x._1,x._2,x._3))
//3.通过Row类型的RDD和schema信息创建DataFrame
val df = spark.createDataFrame(rowRDD,schema)
//4.如果直接写入文件,默认就是Parquet格式
df.write.save("F:\\notes\\java\\SparkFirst\\output\\parquet")
//也可以手动指定parquet格式
//.mode是写入模式,可以是overwrite(覆盖),也可以是append(追加)
df.write.mode("overwrite").parquet("F:\\notes\\java\\SparkFirst
\\output\\parquet")
//5.使用.read.parquet方法可以读取parquet文件
val df2 = spark.read.parquet(("F:\\notes\\java\\SparkFirst\\output
\\parquet"))
df2.show()
}
}
result:
五、Spark SQL —>Hive
5.1 Spark集成Hive
将hive/conf目录下的hive-site.xml复制到spark/conf目录下;
[root@single ~]#cp /opt/software/hadoop/hive110/conf/hive-site.xml
/opt/software/hadoop/spark244/conf
将hive/lib目录下的mysql-connector-java-5.1.38.jar,复制到spark/jars目录下;
[root@single lib]# cp mysql-connector-java-5.1.32.jar
/opt/software/hadoop/spark244/jars/
启动spark-shell并用拷贝的jar包;
spark-shell --jars /opt/software/hadoop/spark244/jars/
mysql-connector-java-5.1.32.jar
访问hive(记得先将Hive元数据服务启动)
scala>spark.sql("show databases").show()
5.2 IDEA连接Hive
MAVEN工程增加依赖(版本不一致见官网:https://mvnrepository.com/);
<!-- spark-hive -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.4.4</version>
</dependency>
<!-- mysql-connector-java -->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.31</version>
</dependency>
将hive-site.xml文件复制到rescourse资源文件夹内并将元数据地址补全;
创建工程连接Hive
package sparksql2
import org.apache.spark.sql.SparkSession
object Demo2_ConnectHive {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.master("local[4]") //设置master
.appName(this.getClass.getName) //设置appName
.enableHiveSupport() //开启Hive支持
.getOrCreate()
val df = spark.sql("select * from pv")
df.show()
}
}
TIPS:有可能出现的问题:MySQL访问被拒绝!可以考虑在linux下登录MySQL将登录用户重新授权修改密码并flush privileges刷新MySQL的系统权限相关表,或者也可以重新启动MySQL服务器使其生效。
mysql>grant all on *.* to 'root'@'%' identified by 'root';
mysql>grant all on *.* to 'root'@'localhost' identified by 'root';
mysql>flush privileges;
六、Spark SQL —>MySQL
创建工程连接MySQL
package sparksql2
import java.util.Properties
import org.apache.spark.sql.SparkSession
object Demo3_ConnectMysql {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.master("local[4]") //设置master
.appName(this.getClass.getName) //设置appName
.enableHiveSupport()
.getOrCreate()
//设置要访问的mysql的url,表名
val url = "jdbc:mysql://192.168.182.130:3306/mysqltest"
val tablename ="student"
val props = new Properties()
//设置要访问的mysql的用户名,密码,Driver
props.setProperty("user","root")
props.setProperty("password","root")
props.setProperty("driver","com.mysql.jdbc.Driver")
//通过spark.read.jdbc方法读取mysql中的数据
val df = spark.read.jdbc(url,tablename,props)
df.show()
}
}
"C:\Program Files\Java\jdk1.8.0_231\bin\java.exe"...
+---+--------+------+--------+
|sid| sname|gender|class_id|
+---+--------+------+--------+
| 1| 孙尚香| 女| 1|
| 2| 貂蝉| 女| 1|
| 3| 刘备| 男| 2|
| 4| 孙二娘| 女| 2|
| 5| 张飞| 男| 3|
| 6| 关羽| 男| 4|
| 7| 林黛玉| 女| 5|
| 8| 薛宝钗| 女| 6|
| 9| 宋江| 男| 6|
| 10| 白骨精| 女| 7|
| 11| 猪八戒| 男| 8|
| 12| 王熙凤| 女| 1|
| 13| 李师师| 女| 2|
| 14| 金翠莲| 女| 9|
| 15| 如花| 女| 1|
| 16| 沙僧| 男| 2|
| 17| 李美丽| 女| 3|
| 18|金角大王| 男| 4|
+---+--------+------+--------+
Process finished with exit code 0