Spark SQL
- Spark SQL可以提供DataFrame API,可以对内部和外部各种数据源执行各种关系操作
- 可以支持大量的数据源和数据分析算法,组合使用Sparl SQL和Spark MLlib
DataFrame
Spark SQL使用的数据抽象是DataFrame
SparkSession
DataFrame创建
- 样例
>>> df = spark.read.json("file:///usr/local/spark/examples/src/main/resources/people.json")
>>> df.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
#独立编写应用程序需要自己创建对象
from Spark import SparkConf, SparkContext
from Spark.sql import SparkSession
spark = SparkSession.builder.config(conf = SparkConf()).getOrCreate()
- 格式
spark.read.text("people.txt") #spark.read.format("text").load("people.txt")
spark.read.json("people.json") #spark.read.format("json").load("people.txt")
spark.read.parquet("people.parquet") #spark.read.format("parquet").load("people.parquet")
DataFrame保存
- 格式
df.write.text("people.txt") #df.write.format("text").save("people.txt")
df.write.json("people.json") #df.write.format("json").save("people.json")
df.write.parquet("people.parquet") #df.write.format("parquet").save("people.parquet")
- 样例
>>> peopleDF = spark.read.json("file:///usr/local/spark/examples/src/main/resources/people.json")
>>> peopleDF = peopleDF.select("name") #覆盖了不好
>>> peopleDF.show()
+-------+
| name|
+-------+
|Michael|
| Andy|
| Justin|
+-------+
>>> peopleDF.write.txt("file:///usr/local/spark/mycode/sparksql/newpeople.txt")
$ cat /usr/local/spark/mycode/sparksql/newpeople.txt
cat: /usr/local/spark/mycode/sparksql/newpeople.txt: 是一个目录
$ cd /usr/local/spark/mycode/sparksql/newpeople.txt/
$ cat part-00000-316126cb-f866-417b-bdb1-13e87135112c-c000.txt
Michael
Andy
Justin
#再次读入时只需要读入newpeople.txt这个目录
DataFrame常用操作
>>> df = spark.read.json("file:///usr/local/spark/examples/src/main/resources/people.json")
#printSchema() 打印出DataFrame的模式(Schema)信息
>>> df.printSchema()
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
#select() 从DataFrame中选取部分列的数据
>>> df.select(df["name"], df["age"] + 1).show()
+-------+---------+
| name|(age + 1)|
+-------+---------+
|Michael| null|
| Andy| 31|
| Justin| 20|
+-------+---------+
#filter() 实现条件查询
df.filter(df["age"] > 20).show()
+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+
#groupBy() 对记录进行分组
>>> df.groupBy("age").count().show()
+----+-----+
| age|count|
+----+-----+
| 19| 1|
|null| 1|
| 30| 1|
+----+-----+
#sort() 对记录进行排序
>>> df.sort(df["age"].desc()).show()
+----+-------+
| age| name|
+----+-------+
| 30| Andy|
| 19| Justin|
|null|Michael|
+----+-------+
>>> df.sort(df["name"].desc(),df["age"].desc()).show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 19| Justin|
| 30| Andy|
+----+-------+
从RDD转换得到DataFrame
- 利用反射机制推断RDD模式
>>> from pyspark.sql import Row
>>> people = spark.sparkContext.\
... textFile("file:///usr/local/spark/examples/src/main/resources/people.txt").\
... map(lambda line: line.split(",")).\
... map(lambda p: Row(name=p[0], age=int(p[1])))
>>> schemaPeople = spark.createDataFrame(people)
#必须注册为临时表才能供下面的查询使用 临时表的名称时people
>>> schemaPeople.createOrReplaceTempView("people")
>>> personsDF = spark.sql("select name,age from people where age > 20")
#DataFrame中的每个元素都是一行记录,包含name和age两个字段,分别用p.name和p.age来获取值
>>> personsRDD=personsDF.rdd.map(lambda p:"Name: "+p.name+ ","+"Age: "+str(p.age))
>>> personsRDD.foreach(print)
Name: Michael,Age: 29
Name: Andy,Age: 30
- 使用编程方式定义RDD模式
>>> from pyspark.sql.types import *
>>> from pyspark.sql import Row
#生成“表头”
#表的模式需要包含字段名称、字段类型和是否允许空值等信息
>>> schemaString = "name age"
>>> fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split(" ")]
>>> schema = StructType(fields)
#生成“表中的记录”
#每条记录都应该被封装到一个Row对象中,并把所有记录的Row对象保存到同一个RDD中
>>> lines = spark.sparkContext.\
... textFile("file:///usr/local/spark/examples/src/main/resources/people.txt")
>>> parts = lines.map(lambda x: x.split(","))
>>> people = parts.map(lambda p: Row(p[0], p[1].strip()))
#把“表头”和“表中的记录”拼装在一起
>>> schemaPeople = spark.createDataFrame(people, schema)
#注册一个临时表供下面查询使用
>>> schemaPeople.createOrReplaceTempView("people")
>>> results = spark.sql("SELECT name,age FROM people") #这里面时MySQL语句
>>> results.show()
+-------+---+
| name|age|
+-------+---+
|Michael| 29|
| Andy| 30|
| Justin| 19|
+-------+---+
使用Spark SQl读写数据库
- 安装数据库
- 登录
$ sudo service mysql start #这里是Ubuntu操作系统,我没试过
$ mysql -u root -p #屏幕会提示输入密码
mysql> create database spark;
mysql> use spark;
mysql> create table student (id int(4), name char(20), gender char(4), age int(4));
mysql> insert into student values(1,'Xueqian','F',23);
mysql> insert into student values(2,'Weiliang','M',24);
mysql> select * from student;
- 读取Mysql数据库中的数据(下载jdbc库)
>>> jdbcDF = spark.read \
.format("jdbc") \
.option("driver","com.mysql.jdbc.Driver") \ #JDBC驱动程序
.option("url", "jdbc:mysql://localhost:3306/spark") \ #数据库的链接地址
.option("dbtable", "student") \ #访问刚刚创建的student表
.option("user", "root") \ #用户名
.option("password", "123456") \ #密码
.load()
>>> jdbcDF.show()
+---+--------+------+---+
| id| name|gender|age|
+---+--------+------+---+
| 1| Xueqian| F| 23|
| 2|Weiliang| M| 24|
+---+--------+------+---+
- 向MySQL数据库写入数据
from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark import SparkContext,SparkConf
from pyspark.sql import SparkSession
spark = SparkSession.builder.config(conf = SparkConf()).getOrCreate()
#下面设置模式信息
schema = StructType([StructField("id", IntegerType(), True), \
StructField("name", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("age", IntegerType(), True)])
#下面设置两条数据,表示两个学生的信息
studentRDD = spark \
.sparkContext \
.parallelize(["3 Rongcheng M 26","4 Guanhua M 27"]) \
.map(lambda x:x.split(" "))
#下面创建Row对象,每个Row对象都是rowRDD中的一行
rowRDD = studentRDD.map(lambda p:Row(int(p[0].strip()), p[1].strip(), p[2].strip(), int(p[3].strip())))
#建立起Row对象和模式之间的对应关系,也就是把数据和模式对应起来
studentDF = spark.createDataFrame(rowRDD, schema)
#写入数据库
prop = {}
prop['user'] = 'root'
prop['password'] = '123456'
prop['driver'] = "com.mysql.jdbc.Driver"
studentDF.write.jdbc("jdbc:mysql://localhost:3306/spark",'student','append', prop)
#执行后在MySQL中查询student表
mysql> select * from student;
习题
- 请阐述Hive中SQL查询转化为MapReduce作业的具体过程
- 请阐述Shark和Hive的关系以及Shark有什么缺陷
- 请阐述Shark和SparkSQL的关系
- 请分析SparkSQL出现的原因
- RDD和DataFrame有什么区别
- SparkSQL支持读写哪些类型的数据
- 从RDD转换得到DataFrame可以有哪两种方式
- 使用编程方式定义RDD模式的基本步骤是什么
- 为了使SparkSQL能够访问MySQL数据库,需要做哪些准备工作