MYSQl数据库
SELECT COUNT(*) FROM t_person;
结果
Hive
1001号 是01年级 成绩99 需求显示一下平均成绩 + 上原来的数据
Hive JDBC访问
启动hiveserver2服务
pp@hadoop102 hive]$ bin/hiveserver2
启动beeline
[pp@hadoop102 hive]$ bin/beeline
Beeline version 1.2.1 by Apache Hive
beeline>
连接hiveserver2
beeline> !connect jdbc:hive2://hadoop102:10000(回车)
Connecting to jdbc:hive2://hadoop102:10000
Enter username for jdbc:hive2://hadoop102:10000: atguigu(回车)
Enter password for jdbc:hive2://hadoop102:10000: (直接回车)
Connected to: Apache Hive (version 1.2.1)
Driver: Hive JDBC (version 1.2.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://hadoop102:10000> show databases;
+----------------+--+
| database_name |
+----------------+--+
| default |
| hive_db2 |
+----------------+--+3
先做数据 数据准备:name,orderdate,cost
创建本地business.txt,导入数据
[pp@hadoop102 datas]$ vi business.txt
创建hive表并导入数据
create table business(
name string,
orderdate string,
cost int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
load data local inpath "/opt/module/datas/business.txt" into table business;
[pp@hadoop101 datas]$ cat business.txt
jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94
0: jdbc:hive2://hadoop101:10000> use my_db;
No rows affected (0.036 seconds)
0: jdbc:hive2://hadoop101:10000> show tables;
+---------------+--+
| tab_name |
+---------------+--+
| business |
| dept |
| dept_par |
| dept_par2 |
| emp |
| emp_sex |
| location |
| movie_info |
| person_info |
| score |
| stu_external |
| student |
+---------------+--+
查询business表 jack 哪个日期,购买数据的总额
0: jdbc:hive2://hadoop101:10000> select * from business;
+----------------+---------------------+----------------+--+
| business.name | business.orderdate | business.cost |
+----------------+---------------------+----------------+--+
| jack | 2017-01-01 | 10 |
| tony | 2017-01-02 | 15 |
| jack | 2017-02-03 | 23 |
| tony | 2017-01-04 | 29 |
| jack | 2017-01-05 | 46 |
| jack | 2017-04-06 | 42 |
| tony | 2017-01-07 | 50 |
| jack | 2017-01-08 | 55 |
| mart | 2017-04-08 | 62 |
| mart | 2017-04-09 | 68 |
| neil | 2017-05-10 | 12 |
| mart | 2017-04-11 | 75 |
| neil | 2017-06-12 | 80 |
| mart | 2017-04-13 | 94 |
+----------------+---------------------+----------------+--+
需求: 按需求查询数据
(1)查询在2017年4月份购买过的顾客及总人数
最终的结果
jack 2
mart 2
如何实现呢? 第一步 hive查询走 MR
select name
from business
where orderdate like '2017-04%'
group by name;
结果
+-------+--+ 在这个的窗口下为聚合的窗口,
| name |
+-------+--+
| jack |
| mart |
+-------+--+
没有使用开窗函数前
select name,count(*)
from business
where orderdate like '2017-04%'
group by name;
结果是这个 不能满足需求 这种情况下需要使用开窗函数
+-------+------+--+
| name | _c1 |
+-------+------+--+
| jack | 1 |
| mart | 4 |
+-------+------+--+
over指的就是开窗函数, 默认的窗口是我们的整个窗口,over主要决定的是我们开窗函数的范围,count(*)这个聚合函数会被我们每条数据调用一次.
最终的接口
select name, count(*) over()
from business
where orderdate like '2017-04%'
group by name;
结果
+-------+-----------------+--+
| name | count_window_0 |
+-------+-----------------+--+
| mart | 2 |
| jack | 2 |
+-------+-----------------+--+
执行逻辑先对name进行一次分组,得到一个窗口,在对这样的窗口为一个聚合窗口,over的作用就是指聚合的窗口,就是分组后整个窗口,聚合函数会被每一条数据都调用一次.
总结
over():结合聚合函数一起使用,决定数据的聚合范围,默认的聚合范围是整个数据窗口。
(2)查询顾客的购买明细及月购买总额 原始数据
+----------------+---------------------+----------------+--+
| business.name | business.orderdate | business.cost |
+----------------+---------------------+----------------+--+
| jack | 2017-01-01 | 10 |
| tony | 2017-01-02 | 15 |
| jack | 2017-02-03 | 23 |
| tony | 2017-01-04 | 29 |
| jack | 2017-01-05 | 46 |
| jack | 2017-04-06 | 42 |
| tony | 2017-01-07 | 50 |
| jack | 2017-01-08 | 55 |
| mart | 2017-04-08 | 62 |
| mart | 2017-04-09 | 68 |
| neil | 2017-05-10 | 12 |
| mart | 2017-04-11 | 75 |
| neil | 2017-06-12 | 80 |
| mart | 2017-04-13 | 94 |
+----------------+---------------------+----------------+--+
需求分析 第二的需求聚合的范围改变了,是按照月聚合的,那么这个需求如何实现? 首先要使用我们的聚合函数.
直接用over是按我们的整个窗口进行聚合的,eg: select *,sum(cost) over() 求得是我们的整个的总和,数据不对.
使用partition by 会按照我们的聚合范围进行一个分区,聚合的范围是按照我们分区后的数据进行的聚合.
查看month函数的用法
0: jdbc:hive2://hadoop101:10000> desc function extended month;
+----------------------------------------------------------------------------+--+
| tab_name |
+----------------------------------------------------------------------------+--+
| month(param) - Returns the month component of the date/timestamp/interval |
| param can be one of: |
| 1. A string in the format of 'yyyy-MM-dd HH:mm:ss' or 'yyyy-MM-dd'. |
| 2. A date value |
| 3. A timestamp value |
| 4. A year-month interval valueExample: |
| > SELECT month('2009-07-30') FROM src LIMIT 1; |
| 7 |
+----------------------------------------------------------------------------+--+
eg:
0: jdbc:hive2://hadoop101:10000> select month('2017-01-01');
+------+--+
| _c0 |
+------+--+
| 1 |
+------+--+
最终的sql语句 按月份进行一个分组
select *, sum(cost) over(partition by month(orderdate))
from business;
最终结果
+----------------+---------------------+----------------+---------------+--+
| business.name | business.orderdate | business.cost | sum_window_0 |
+----------------+---------------------+----------------+---------------+--+
| jack | 2017-01-01 | 10 | 205 |
| jack | 2017-01-08 | 55 | 205 |
| tony | 2017-01-07 | 50 | 205 |
| jack | 2017-01-05 | 46 | 205 |
| tony | 2017-01-04 | 29 | 205 |
| tony | 2017-01-02 | 15 | 205 |
| jack | 2017-02-03 | 23 | 23 |
| mart | 2017-04-13 | 94 | 341 |
| jack | 2017-04-06 | 42 | 341 |
| mart | 2017-04-11 | 75 | 341 |
| mart | 2017-04-09 | 68 | 341 |
| mart | 2017-04-08 | 62 | 341 |
| neil | 2017-05-10 | 12 | 12 |
| neil | 2017-06-12 | 80 | 80 |
+----------------+---------------------+----------------+---------------+--+
如果不加partition by 的sql
select *, sum(cost) over()
from business;
不加partition by 的结果
+----------------+---------------------+----------------+---------------+--+
| business.name | business.orderdate | business.cost | sum_window_0 |
+----------------+---------------------+----------------+---------------+--+
| mart | 2017-04-13 | 94 | 661 |
| neil | 2017-06-12 | 80 | 661 |
| mart | 2017-04-11 | 75 | 661 |
| neil | 2017-05-10 | 12 | 661 |
| mart | 2017-04-09 | 68 | 661 |
| mart | 2017-04-08 | 62 | 661 |
| jack | 2017-01-08 | 55 | 661 |
| tony | 2017-01-07 | 50 | 661 |
| jack | 2017-04-06 | 42 | 661 |
| jack | 2017-01-05 | 46 | 661 |
| tony | 2017-01-04 | 29 | 661 |
| jack | 2017-02-03 | 23 | 661 |
| tony | 2017-01-02 | 15 | 661 |
| jack | 2017-01-01 | 10 | 661 |
+----------------+---------------------+----------------+---------------+--+
其他语句的测试
0: jdbc:hive2://hadoop101:10000> select *, sum(cost) from business;
Error: Error while compiling statement: FAILED: SemanticException [Error 10025]: Expression not in GROUP BY key name (state=42000,code=10025)
0: jdbc:hive2://hadoop101:10000> select sum(cost) from business;
结果
+------+--+
| _c0 |
+------+--+
| 661 |
+------+--+
结论:
partition by:将数据进行分组后聚合
(3)上述的场景, 将每个顾客的cost按照日期进行累加
分析 聚合范围在动态变化,第一条聚合一条,第二条聚合两条,第三条聚合三条,
我们最终要得到的结果,如下图所示
+----------------+---------------------+----------------+--+
| business.name | business.orderdate | business.cost | sum
+----------------+---------------------+----------------+--+
| jack | 2017-01-01 | 10 | 10
| jack | 2017-01-05 | 46 | 56
| jack | 2017-02-03 | 23 | 79
| jack | 2017-04-06 | 42 | 121
| tony | 2017-01-02 | 15 | 15
| tony | 2017-01-04 | 29 | 44
order by 排序,
sql语句 按照name进行分区,按照日期进行排序,在用sum进行求和
select *,sum(cost) over(partition by name order by orderdate)
from business;
结果
+----------------+---------------------+----------------+---------------+--+
| business.name | business.orderdate | business.cost | sum_window_0 |
+----------------+---------------------+----------------+---------------+--+
| jack | 2017-01-01 | 10 | 10 |
| jack | 2017-01-05 | 46 | 56 |
| jack | 2017-01-08 | 55 | 111 |
| jack | 2017-02-03 | 23 | 134 |
| jack | 2017-04-06 | 42 | 176 |
| mart | 2017-04-08 | 62 | 62 |
| mart | 2017-04-09 | 68 | 130 |
| mart | 2017-04-11 | 75 | 205 |
| mart | 2017-04-13 | 94 | 299 |
| neil | 2017-05-10 | 12 | 12 |
| neil | 2017-06-12 | 80 | 92 |
| tony | 2017-01-02 | 15 | 15 |
| tony | 2017-01-04 | 29 | 44 |
| tony | 2017-01-07 | 50 | 94 |
+----------------+---------------------+----------------+---------------+--+
结论
order by:(1) 排序,(2)改变聚合范围:从开始行到当前行的一个聚合(这也是它默认的聚合范围)
这种需求还可以 使用窗口子句实现来实现
CURRENT ROW:当前行
n PRECEDING:往前n行数据
n FOLLOWING:往后n行数据
UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING表示到后面的终点
窗口字句的sql
select *,sum(cost) over(partition by name order by orderdate
rows between 1 PRECEDING and 1 FOLLOWING)
from business;
结果
+----------------+---------------------+----------------+---------------+--+
| business.name | business.orderdate | business.cost | sum_window_0 |
+----------------+---------------------+----------------+---------------+--+
| jack | 2017-01-01 | 10 | 56 |
| jack | 2017-01-05 | 46 | 111 |
| jack | 2017-01-08 | 55 | 124 |
| jack | 2017-02-03 | 23 | 120 |
| jack | 2017-04-06 | 42 | 65 |
| mart | 2017-04-08 | 62 | 130 |
| mart | 2017-04-09 | 68 | 205 |
| mart | 2017-04-11 | 75 | 237 |
| mart | 2017-04-13 | 94 | 169 |
| neil | 2017-05-10 | 12 | 92 |
| neil | 2017-06-12 | 80 | 92 |
| tony | 2017-01-02 | 15 | 44 |
| tony | 2017-01-04 | 29 | 94 |
| tony | 2017-01-07 | 50 | 79 |
+----------------+---------------------+----------------+---------------+--+
需求变了,不是从当前行了,而是变成到最后行的累加
结论
窗口子句:对聚合数据进行进一步的动态划分
Spark SQL
hive中的HQL本质就是通过写sql语句,简化MapReduce的开发,
早期数据库叫DBA,很牛的.
数据要分们别类的管理起来. 但是学习的成本太高.
不会写sql,怎么办,出现了Hibernate = HQL + Java + XML 可以带替正个sql,但是数据库无法优化,应为它不知道业务.
最后被Mybatis给替换了.
大数据来了和我们的关系型数据库不一样,得写大量的MR,学习的成本是比较高的,那么能不能简化一下,只是mysql的数据量小,大叔的数据量比较多,于是Hive => HQL => MR,Hive出来了,通过写SQL来简化MR
SparkSQL把MR给替换了,但是每一个都写RDD比较麻烦
于是就出现了Shark => Hive, Shark是比较强大的.Shark把它变成RDD,在2013年的时候就把Shark这个框架给了Hive了,SparkSQL自己发明了自己的框架.思想来自于Hive,所以底层可以和Hive连接,可以Hive进行无缝对接.通过写SparkSQL可以自动转换为RDD.
SparkSQl就是参考了HIve然后形成了自己的语法.Hive走MR比较慢,而SparkSQl是转换的RDD,然后提交的集群执行,执行效率是非常快的.
DataFrame记录了数据的结构信息,即schema,