前言
前面,我们学习了Scala及Spark 的基本知识。今天开始进入Spark SQL的学习。
什么是Spark SQL
Spark SQL是Apache Spark的用于处理结构化数据的模块,它提供了一个编程抽象叫做DataFrame并且作为分布式SQL查询引擎的作用。Spark SQL一般用于处理离线数据。
DataFrame
DataFrame和DataSet 是Spark SQL的重点,Spark SQL对数据的操作基本都使用DataFrame和DataSet。
DataFrame在概念上可以理解为关系型数据库的表。
DataFrame基本操作
创建DataFrame的方式有多种,其中最简单的方式是基于带格式的数据源进行创建,如读取json格式的数据,就可以创建DataFrame了。
#读取一个json格式的文件,它返回的结果就是一个DataFrame
scala> var df3 = spark.read.json("/root/emp.json")
[Stage 0:> (0 + 0) / 2]20/01/08 11:04:07 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
df3: org.apache.spark.sql.DataFrame = [comm: string, deptno: bigint ... 6 more fields]
#查看数据
scala> df3.show
+----+------+-----+------+----------+---------+----+----+
|comm|deptno|empno| ename| hiredate| job| mgr| sal|
+----+------+-----+------+----------+---------+----+----+
| | 20| 7369| SMITH|1980/12/17| CLERK|7902| 800|
| 300| 30| 7499| ALLEN| 1981/2/20| SALESMAN|7698|1600|
| 500| 30| 7521| WARD| 1981/2/22| SALESMAN|7698|1250|
| | 20| 7566| JONES| 1981/4/2| MANAGER|7839|2975|
|1400| 30| 7654|MARTIN| 1981/9/28| SALESMAN|7698|1250|
| | 30| 7698| BLAKE| 1981/5/1| MANAGER|7839|2850|
| | 10| 7782| CLARK| 1981/6/9| MANAGER|7839|2450|
| | 20| 7788| SCOTT| 1987/4/19| ANALYST|7566|3000|
| | 10| 7839| KING|1981/11/17|PRESIDENT| |5000|
| 0| 30| 7844|TURNER| 1981/9/8| SALESMAN|7698|1500|
| | 20| 7876| ADAMS| 1987/5/23| CLERK|7788|1100|
| | 30| 7900| JAMES| 1981/12/3| CLERK|7698| 950|
| | 20| 7902| FORD| 1981/12/3| ANALYST|7566|3000|
| | 10| 7934|MILLER| 1982/1/23| CLERK|7782|1300|
+----+------+-----+------+----------+---------+----+----+
scala>
#查看表结构
scala> df3.printSchema
root
|-- comm: string (nullable = true)
|-- deptno: long (nullable = true)
|-- empno: long (nullable = true)
|-- ename: string (nullable = true)
|-- hiredate: string (nullable = true)
|-- job: string (nullable = true)
|-- mgr: string (nullable = true)
|-- sal: long (nullable = true)
scala>
#查看员工的姓名和薪水
scala> df3.select("ename","sal").show
+------+----+
| ename| sal|
+------+----+
| SMITH| 800|
| ALLEN|1600|
| WARD|1250|
| JONES|2975|
|MARTIN|1250|
| BLAKE|2850|
| CLARK|2450|
| SCOTT|3000|
| KING|5000|
|TURNER|1500|
| ADAMS|1100|
| JAMES| 950|
| FORD|3000|
|MILLER|1300|
+------+----+
#查看员工的薪水,并且给员工统一涨薪500块钱。
scala> df3.select($"ename",$"sal"+500).show
+------+-----------+
| ename|(sal + 500)|
+------+-----------+
| SMITH| 1300|
| ALLEN| 2100|
| WARD| 1750|
| JONES| 3475|
|MARTIN| 1750|
| BLAKE| 3350|
| CLARK| 2950|
| SCOTT| 3500|
| KING| 5500|
|TURNER| 2000|
| ADAMS| 1600|
| JAMES| 1450|
| FORD| 3500|
|MILLER| 1800|
+------+-----------+
查询薪水大于2000的员工
scala> df3.filter($"sal">2000).show
+----+------+-----+-----+----------+---------+----+----+
|comm|deptno|empno|ename| hiredate| job| mgr| sal|
+----+------+-----+-----+----------+---------+----+----+
| | 20| 7566|JONES| 1981/4/2| MANAGER|7839|2975|
| | 30| 7698|BLAKE| 1981/5/1| MANAGER|7839|2850|
| | 10| 7782|CLARK| 1981/6/9| MANAGER|7839|2450|
| | 20| 7788|SCOTT| 1987/4/19| ANALYST|7566|3000|
| | 10| 7839| KING|1981/11/17|PRESIDENT| |5000|
| | 20| 7902| FORD| 1981/12/3| ANALYST|7566|3000|
+----+------+-----+-----+----------+---------+----+----+
scala>
#分组统计,统计每个部门的员工数
scala> df3.groupBy($"deptno").count.show
+------+-----+
|deptno|count|
+------+-----+
| 10| 3|
| 30| 6|
| 20| 5|
+------+-----+
scala>
其他的例子,请参考API:
http://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.sql.Dataset