SparkSQL环境入口
●在spark2.0版本之前
SQLContext是创建DataFrame和执行SQL的入口 HiveContext通过hive sql语句操作hive表数据,兼容hive操作,hiveContext继承自SQLContext。 ●在spark2.0之后
这些都统一于SparkSession,SparkSession 封装了SqlContext及HiveContext
实现了SQLContext及HiveContext所有功能 通过SparkSession还可以获取到SparkConetxt 直接使用spark-shell进入命令行客户端就可以看到如下SparkSession的变量名为spark
/export/servers/spark/bin/spark-shell
获取/创建DataFrame/DataSet
vim /root/person.txt
1 zhangsan 20
2 lisi 29
3 wangwu 25
4 zhaoliu 30
5 tianqi 35
6 kobe 40
hadoop fs -put /root/person.txt /
2.在spark-shell中执行如下命令读取数据
val lineRDD= sc.textFile("hdfs://node01:8020/person.txt").map(_.split(" ")) //RDD[Array[String]]
3.定义case class(相当于表的schema)
case class Person(id:Int, name:String, age:Int)
val personRDD = lineRDD.map(x => Person(x(0).toInt, x(1), x(2).toInt)) //RDD[Person]
val personDF = personRDD.toDF
//val personDS = personRDD.toDS
personDF.show
personDF. printSchema
personDF. createOrReplaceTempView( "t_person" )
spark.sql("select id,name from t_person where id > 3").show
9.注意: 也可以通过SparkSession构建DataFrame,但是没有完整的约束
val dataFrame= spark. read. text( "hdfs://node01:8020/person.txt" )
dataFrame. show
dataFrame. printSchema
DSL-领域特定查询语言
1. 查看name字段的数据
personDF. select( personDF. col( "name" ) ) . show
personDF. select( personDF( "name" ) ) . show
personDF. select( col( "name" ) ) . show
personDF. select( "name" ) . show
2. 查看 name 和age字段数据
personDF. select( personDF. col( "name" ) , personDF. col( "age" ) ) . show
personDF. select( "name" , "age" ) . show
3. 查询所有的name和age,并将age+ 1
personDF. select( personDF. col( "name" ) , personDF. col( "age" ) + 1 ) . show
personDF. select( personDF( "name" ) , personDF( "age" ) + 1 ) . show
personDF. select( col( "name" ) , col( "age" ) + 1 ) . show
personDF. select( "name" , "age" ) . show
personDF. select( $"name" , $"age" , $"age" + 1 ) . show
4. 过滤age大于等于25 的,使用filter方法过滤
personDF. filter( personDF. col( "age" ) >= 25 ) . show
personDF. filter( $"age" > 25 ) . show
5. 统计年龄大于30 的人数
personDF. filter( col( "age" ) > 30 ) . count( )
personDF. filter( $"age" > 30 ) . count( )
6. 按年龄进行分组并统计相同年龄的人数
personDF. groupBy( "age" ) . count( ) . show
SQL-结构化查询语言
personDF. createOrReplaceTempView( "t_person" )
spark. sql( "select * from t_person" ) . show
1. 显示表的描述信息
spark. sql( "desc t_person" ) . show
2. 查询年龄最大的前两名
spark. sql( "select * from t_person order by age desc limit 2" ) . show
3. 查询年龄大于30 的人的信息
spark. sql( "select * from t_person where age > 30 " ) . show
4. 使用SQL风格完成DSL中的需求
spark. sql( "select name, age + 1 from t_person" ) . show
spark. sql( "select name, age from t_person where age > 25" ) . show
spark. sql( "select count(age) from t_person where age > 30" ) . show
spark. sql( "select age, count(age) from t_person group by age" ) . show