Spark-shell连接MySQL
- 将hive/conf里面的 hive-site.xml复制到spark/conf/
- 将hive/lib里面的mysql-connector-java-5.1.38.jar复制到spark/jars/
需要重新启动spark-shell - 读取MySQL文件,返回一个dataFrame
[root@hadoop001 software]# mysql -uroot -pok
mysql> create database school;
Query OK, 1 row affected (0.00 sec)
mysql> source /software/schoolmysql50Bak.sql
[root@hadoop001 sbin]# spark-shell
读取student表
scala> val studentDF=spark.read.format("jdbc").options(Map("url"->"jdbc:mysql://hadoop001:3306/school","driver"-> "com.mysql.jdbc.Driver","dbtable"->"school.Student","user"->"root","password"->"ok")).load()
读取Score表
val scoreDF=spark.read.format("jdbc").options(Map("url"->"jdbc:mysql://hadoop001:3306/school","driver"->"com.mysql.jdbc.Driver","dbtable"->"school.Score","user"->"root","password"->"ok")).load
读取Teacher表
val teacherDF=spark.read.format("jdbc").options(Map("url"->"jdbc:mysql://hadoop001:3306/school","driver"->"com.mysql.jdbc.Driver","dbtable"->"school.Teacher","user"->"root","password"->"ok")).load
读取Course表
val courseDF=spark.read.format("jdbc").options(Map("url"->"jdbc:mysql://hadoop001:3306/school","driver"->"com.mysql.jdbc.Driver","dbtable"->"school.Course","user"->"root","password"->"ok")).load
1、查询"01"课程比"02"课程成绩高的学生的信息及课程分数:
scala> scoreDF.as("s1").join(scoreDF.as("s2"),"s_id").filter("s1.c_id=1 and s2.c_id=2 and s1.s_score>s2.s_score").join(studentDF,"s_id").show
+----+----+-------+----+-------+------+----------+-----+
|s_id|c_id|s_score|c_id|s_score|s_name| s_birth|s_sex|
+----+----+-------+----+-------+------+----------+-----+
| 02| 01| 70| 02| 60| 钱电|1990-12-21| 男|
| 04| 01| 50| 02| 30| 李云|1990-08-06| 男|
+----+----+-------+----+-------+------+----------+-----+
2、查询"01"课程比"02"课程成绩低的学生的信息及课程分数:
scala> scoreDF.as("s1").join(scoreDF.as("s2"),"s_id").filter("s1.c_id=1 and s2.c_id=2 and s1.s_score<s2.s_score").join(studentDF,"s_id").show
+----+----+-------+----+-------+------+----------+-----+
|s_id|c_id|s_score|c_id|s_score|s_name| s_birth|s_sex|
+----+----+-------+----+-------+------+----------+-----+
| 01| 01| 80| 02| 90| 赵雷|1990-01-01| 男|
| 05| 01| 76| 02| 87| 周梅|1991-12-01| 女|
+----+----+-------+----+-------+------+----------+-----+
3、查询平均成绩大于等于60 分的同学的学生编号和学生姓名和平均成绩:
scala> scoreDF.as("s1").groupBy("s_id").avg("s_score").join(studentDF.as("s2"),"s_id").filter($"avg(s_score)">=60).show
+----+-----------------+------+----------+-----+
|s_id| avg(s_score)|s_name| s_birth|s_sex|
+----+-----------------+------+----------+-----+
| 07| 93.5| 郑竹|1989-07-01| 女|
| 01|89.66666666666667| 赵雷|1990-01-01| 男|
| 05| 81.5| 周梅|1991-12-01| 女|
| 03| 80.0| 孙风|1990-05-20| 男|
| 02| 70.0| 钱电|1990-12-21| 男|
+----+-----------------+------+----------+-----+
4、查询平均成绩小于60 分的同学的学生编号和学生姓名和平均成绩(包括有成绩的和无成绩的):
scala> studentDF.as("s2")
.join((scoreDF.as("s1").groupBy("s_id").avg("s_score"))
.as("s3"),Seq("s_id"),"left_outer").as("s")
.withColumnRenamed("avg(s_score)","A").where((col("A")<60)||(col("A").isNull)).show
+----+------+----------+-----+------------------+
|s_id|s_name| s_birth|s_sex| A|
+----+------+----------+-----+------------------+
| 08| 王菊|1990-01-20| 女| null|
| 06| 吴兰|1992-03-01| 女| 32.5|
| 04| 李云|1990-08-06| 男|33.333333333333336|
+----+------+----------+-----+------------------+
5、查询所有同学的学生编号、学生姓名、选课总数、所有课程的总成绩:
scala> studentDF.join(scoreDF.groupBy("s_id").count,
Seq("s_id"),"left_outer").join(scoreDF.groupBy("s_id").sum("s_score"),Seq("s_id"),"left_outer").show
+----+------+----------+-----+-----+------------+
|s_id|s_name| s_birth|s_sex|count|sum(s_score)|
+----+------+----------+-----+-----+------------+
| 07| 郑竹|1989-07-01| 女| 2| 187|
| 01| 赵雷|1990-01-01| 男| 3| 269|
| 05| 周梅|1991-12-01| 女| 2| 163|
| 08| 王菊|1990-01-20| 女| null| null|
| 03| 孙风|1990-05-20| 男| 3| 240|
| 02| 钱电|1990-12-21| 男| 3| 210|
| 06| 吴兰|1992-03-01| 女| 2| 65|
| 04| 李云|1990-08-06| 男| 3| 100|
+----+------+----------+-----+-----+------------+
6、查询"李"姓老师的数量:
scala> teacherDF.where("t_name like '李%'").select("t_id").count
scala> teacherDF.where("t_name like '李%'").select("t_id").count
res5: Long = 1
7、查询学过"张三"老师授课的同学的信息:
scoreDF.join(courseDF,"c_id").join(teacherDF,"t_id").filter("t_name='张三'").join(studentDF,"s_id").show
8、查询没学过"张三"老师授课的同学的信息:
scala> studentDF.join(scoreDF.join(courseDF,"c_id").join(teacherDF,"t_id"),Seq("s_id"),"left_outer").where("t_name!='张三' or t_name is null").show
+----+------+----------+-----+----+----+-------+------+------+
|s_id|s_name| s_birth|s_sex|t_id|c_id|s_score|c_name|t_name|
+----+------+----------+-----+----+----+-------+------+------+
| 07| 郑竹|1989-07-01| 女| 03| 03| 98| 英语| 王五|
| 01| 赵雷|1990-01-01| 男| 03| 03| 99| 英语| 王五|
| 01| 赵雷|1990-01-01| 男| 02| 01| 80| 语文| 李四|
| 05| 周梅|1991-12-01| 女| 02| 01| 76| 语文| 李四|
| 08| 王菊|1990-01-20| 女|null|null| null| null| null|
| 03| 孙风|1990-05-20| 男| 03| 03| 80| 英语| 王五|
| 03| 孙风|1990-05-20| 男| 02| 01| 80| 语文| 李四|
| 02| 钱电|1990-12-21| 男| 03| 03| 80| 英语| 王五|
| 02| 钱电|1990-12-21| 男| 02| 01| 70| 语文| 李四|
| 06| 吴兰|1992-03-01| 女| 03| 03| 34| 英语| 王五|
| 06| 吴兰|1992-03-01| 女| 02| 01| 31| 语文| 李四|
| 04| 李云|1990-08-06| 男| 03| 03| 20| 英语| 王五|
| 04| 李云|1990-08-06| 男| 02| 01| 50| 语文| 李四|
+----+------+----------+-----+----+----+-------+------+------+
9、查询学过编号为"01"并且也学过编号为"02"的课程的同学的信息:
scala> studentDF.join(scoreDF.filter("c_id=1"),"s_id").join(scoreDF.filter("c_id=2"),"s_id").show
+----+------+----------+-----+----+-------+----+-------+
|s_id|s_name| s_birth|s_sex|c_id|s_score|c_id|s_score|
+----+------+----------+-----+----+-------+----+-------+
| 01| 赵雷|1990-01-01| 男| 01| 80| 02| 90|
| 05| 周梅|1991-12-01| 女| 01| 76| 02| 87|
| 03| 孙风|1990-05-20| 男| 01| 80| 02| 80|
| 02| 钱电|1990-12-21| 男| 01| 70| 02| 60|
| 04| 李云|1990-08-06| 男| 01| 50| 02| 30|
+----+------+----------+-----+----+-------+----+-------+
10、查询学过编号为"01"但是没有学过编号为"02"的课程的同学的信息:
scala> studentDF.join(scoreDF.where("c_id=2"),Seq("s_id"),"left_outer").as("s2").where("s2.c_id is null").join(scoreDF.where("c_id=1"),"s_id").show
+----+------+----------+-----+----+-------+----+-------+
|s_id|s_name| s_birth|s_sex|c_id|s_score|c_id|s_score|
+----+------+----------+-----+----+-------+----+-------+
| 06| 吴兰|1992-03-01| 女|null| null| 01| 31|
+----+------+----------+-----+----+-------+----+-------+
11、查询没有学全所有课程的同学的信息:
scala> studentDF.join(scoreDF.groupBy("s_id").count.as("s1"),Seq("s_id"),"left_outer").where("s1.count<3 or s1.count is null").show
+----+------+----------+-----+-----+
|s_id|s_name| s_birth|s_sex|count|
+----+------+----------+-----+-----+
| 07| 郑竹|1989-07-01| 女| 2|
| 05| 周梅|1991-12-01| 女| 2|
| 08| 王菊|1990-01-20| 女| null|
| 06| 吴兰|1992-03-01| 女| 2|
+----+------+----------+-----+-----+
12、查询至少有一门课与学号为"01"的同学所学相同的同学的信息:
scala> studentDF.join(scoreDF,"s_id").as("a").join(scoreDF.select("c_id").where("s_id=1").as("b"),"c_id").as("c").select("s_id").distinct.where("s_id!=1").join(studentDF,"s_id").show
+----+------+----------+-----+
|s_id|s_name| s_birth|s_sex|
+----+------+----------+-----+
| 07| 郑竹|1989-07-01| 女|
| 05| 周梅|1991-12-01| 女|
| 03| 孙风|1990-05-20| 男|
| 02| 钱电|1990-12-21| 男|
| 06| 吴兰|1992-03-01| 女|
| 04| 李云|1990-08-06| 男|
+----+------+----------+-----+
13、查询和"01"号的同学学习的课程完全相同的其他同学的信息:
scala> studentDF.join(scoreDF,"s_id").as("a").join(scoreDF.where("s_id=1").as("b"),"c_id").groupBy("a.s_id").count.where(s"count=${scoreDF.where("s_id=1").count} and a.s_id!=1").join(studentDF,"s_id").show
+----+-----+------+----------+-----+
|s_id|count|s_name| s_birth|s_sex|
+----+-----+------+----------+-----+
| 03| 3| 孙风|1990-05-20| 男|
| 02| 3| 钱电|1990-12-21| 男|
| 04| 3| 李云|1990-08-06| 男|
+----+-----+------+----------+-----+
14、查询没学过"张三"老师讲授的任一门课程的学生姓名:
scala> studentDF.join(scoreDF,"s_id").join(courseDF,"c_id").join(teacherDF.where("t_name='张三'"),"t_id").as("a").select("s_id").join(studentDF.as("b"),Seq("s_id"),"right_outer").where("a.s_id is null").select("s_name").show
+------+
|s_name|
+------+
| 王菊|
| 吴兰|
+------+
15、查询两门及其以上不及格课程的同学的学号,姓名及其平均成绩:
scala> scoreDF.where("s_score<60").groupBy("s_id").count.where("count>=2").join(scoreDF,"s_id").groupBy("s_id").avg("s_score").join(studentDF,"s_id").show
+----+------------------+------+----------+-----+
|s_id| avg(s_score)|s_name| s_birth|s_sex|
+----+------------------+------+----------+-----+
| 06| 32.5| 吴兰|1992-03-01| 女|
| 04|33.333333333333336| 李云|1990-08-06| 男|
+----+------------------+------+----------+-----+
16、检索"01"课程分数小于60,按分数降序排列的学生信息:
scala> scoreDF.where("c_id=1").join(studentDF,Seq("s_id"),"right_outer").where("s_score<60 or s_score is null").orderBy($"s_score".desc).show
+----+----+-------+------+----------+-----+
|s_id|c_id|s_score|s_name| s_birth|s_sex|
+----+----+-------+------+----------+-----+
| 04| 01| 50| 李云|1990-08-06| 男|
| 06| 01| 31| 吴兰|1992-03-01| 女|
| 07|null| null| 郑竹|1989-07-01| 女|
| 08|null| null| 王菊|1990-01-20| 女|
+----+----+-------+------+----------+-----+
17、按平均成绩从高到低显示所有学生的所有课程的成绩以及平均成绩:
scala> studentDF.join(scoreDF,Seq("s_id"),"left_outer").groupBy("s_id").avg("s_score").join(studentDF.join(scoreDF,"s_id"),Seq("s_id"),"left_outer").orderBy($"avg(s_score)".desc).show
+----+------------------+------+----------+-----+----+-------+
|s_id| avg(s_score)|s_name| s_birth|s_sex|c_id|s_score|
+----+------------------+------+----------+-----+----+-------+
| 07| 93.5| 郑竹|1989-07-01| 女| 02| 89|
| 07| 93.5| 郑竹|1989-07-01| 女| 03| 98|
| 01| 89.66666666666667| 赵雷|1990-01-01| 男| 02| 90|
| 01| 89.66666666666667| 赵雷|1990-01-01| 男| 03| 99|
| 01| 89.66666666666667| 赵雷|1990-01-01| 男| 01| 80|
| 05| 81.5| 周梅|1991-12-01| 女| 01| 76|
| 05| 81.5| 周梅|1991-12-01| 女| 02| 87|
| 03| 80.0| 孙风|1990-05-20| 男| 02| 80|
| 03| 80.0| 孙风|1990-05-20| 男| 01| 80|
| 03| 80.0| 孙风|1990-05-20| 男| 03| 80|
| 02| 70.0| 钱电|1990-12-21| 男| 02| 60|
| 02| 70.0| 钱电|1990-12-21| 男| 01| 70|
| 02| 70.0| 钱电|1990-12-21| 男| 03| 80|
| 04|33.333333333333336| 李云|1990-08-06| 男| 01| 50|
| 04|33.333333333333336| 李云|1990-08-06| 男| 02| 30|
| 04|33.333333333333336| 李云|1990-08-06| 男| 03| 20|
| 06| 32.5| 吴兰|1992-03-01| 女| 01| 31|
| 06| 32.5| 吴兰|1992-03-01| 女| 03| 34|
| 08| null| null| null| null|null| null|
+----+------------------+------+----------+-----+----+-------+
18、查询各科成绩最高分、最低分和平均分:以如下形式显示:课程ID,课程name
,最高分,最低分,平均分,及格率,中等率,优良率,优秀率:
19、按各科成绩进行排序,并显示排名:
20、查询学生的总成绩并进行排名:
scala> studentDF.join(scoreDF,"s_id").groupBy("s_id").sum("s_score").orderBy($"sum(s_score)".desc).show
+----+------------+
|s_id|sum(s_score)|
+----+------------+
| 01| 269|
| 03| 240|
| 02| 210|
| 07| 187|
| 05| 163|
| 04| 100|
| 06| 65|
+----+------------+
21、查询不同老师所教不同课程平均分从高到低显示:
scala> scoreDF.join(courseDF,"c_id").join(teacherDF,"t_id").groupBy("t_id","c_id").avg("s_score").orderBy($"avg(s_score)".desc).show
+----+----+-----------------+
|t_id|c_id| avg(s_score)|
+----+----+-----------------+
| 01| 02|72.66666666666667|
| 03| 03| 68.5|
| 02| 01| 64.5|
+----+----+-----------------+
22、查询所有课程的成绩第2 名到第3 名的学生信息及该课程成绩:
scala> scoreDF.selectExpr("*","row_number() over(partition by c_id order by s_score desc) rank").where("rank between 2 and 3").join(studentDF,"s_id").show
+----+----+-------+----+------+----------+-----+
|s_id|c_id|s_score|rank|s_name| s_birth|s_sex|
+----+----+-------+----+------+----------+-----+
| 07| 03| 98| 2| 郑竹|1989-07-01| 女|
| 07| 02| 89| 2| 郑竹|1989-07-01| 女|
| 05| 01| 76| 3| 周梅|1991-12-01| 女|
| 05| 02| 87| 3| 周梅|1991-12-01| 女|
| 03| 01| 80| 2| 孙风|1990-05-20| 男|
| 02| 03| 80| 3| 钱电|1990-12-21| 男|
+----+----+-------+----+------+----------+-----+
23、统计各科成绩各分数段人数:课程编号,课程名称,[100-85],[85-70],[70-60],[0-60]及所占百分比:
24、查询学生平均成绩及其名次:
scala> scoreDF.groupBy("s_id").avg("s_score").selectExpr("*",s"row_number() over(order by 'avg(s_score)' desc) as rank").show
+----+------------------+----+
|s_id| avg(s_score)|rank|
+----+------------------+----+
| 07| 93.5| 1|
| 01| 89.66666666666667| 2|
| 05| 81.5| 3|
| 03| 80.0| 4|
| 02| 70.0| 5|
| 06| 32.5| 6|
| 04|33.333333333333336| 7|
+----+------------------+----+
25、查询各科成绩前三名的记录
scala> scoreDF.selectExpr("*","row_number() over(partition by c_id order by s_score desc) rank").where("rank<=3").show
+----+----+-------+----+
|s_id|c_id|s_score|rank|
+----+----+-------+----+
| 01| 01| 80| 1|
| 03| 01| 80| 2|
| 05| 01| 76| 3|
| 01| 03| 99| 1|
| 07| 03| 98| 2|
| 02| 03| 80| 3|
| 01| 02| 90| 1|
| 07| 02| 89| 2|
| 05| 02| 87| 3|
+----+----+-------+----+
26、查询每门课程被选修的学生数:
scala> scoreDF.groupBy("c_id").count.show
+----+-----+
|c_id|count|
+----+-----+
| 01| 6|
| 03| 6|
| 02| 6|
+----+-----+
27、查询出只有两门课程的全部学生的学号和姓名:
scala> scoreDF.groupBy("s_id").count.where("count=2").join(studentDF,"s_id").show
+----+-----+------+----------+-----+
|s_id|count|s_name| s_birth|s_sex|
+----+-----+------+----------+-----+
| 07| 2| 郑竹|1989-07-01| 女|
| 05| 2| 周梅|1991-12-01| 女|
| 06| 2| 吴兰|1992-03-01| 女|
+----+-----+------+----------+-----+
28、查询男生、女生人数:
studentDF.groupBy("s_sex").count.show
+-----+-----+
|s_sex|count|
+-----+-----+
| 男| 4|
| 女| 4|
+-----+-----+
29、查询名字中含有"风"字的学生信息:
studentDF.select("s_name like '%风%'").show
+----+------+----------+-----+
|s_id|s_name| s_birth|s_sex|
+----+------+----------+-----+
| 03| 孙风|1990-05-20| 男|
+----+------+----------+-----+
30、查询同名同性学生名单,并统计同名人数:
studentDF.groupBy("s_name").count.where("count>1").show
+------+-----+
|s_name|count|
+------+-----+
+------+-----+
31、查询1990年出生的学生名单:
studentDF.where("year(s_birth)=1990").show
+----+------+----------+-----+
|s_id|s_name| s_birth|s_sex|
+----+------+----------+-----+
| 01| 赵雷|1990-01-01| 男|
| 02| 钱电|1990-12-21| 男|
| 03| 孙风|1990-05-20| 男|
| 04| 李云|1990-08-06| 男|
| 08| 王菊|1990-01-20| 女|
+----+------+----------+-----+
32、查询每门课程的平均成绩,结果按平均成绩降序排列,平均成绩相同时,按课程编号升序排列:
scoreDF.groupBy("c_id").avg("s_score")