1.什么是HIVE
1.hive是一个工具,用于将输入的sql语句翻译成MapReduce程序,对HDFS中的数据进行增删改查和运算,并将结果存入HDFS中。
2.HIVE的工作机制
核心:
HIVE 利用HDFS来存储文件;
利用Mapreduce来做数据分析
利用mysql来为用户提供查询接口
3. HIVE的DDL(数据库定义语言
database definition language)
1.建库
create database 库名;
2.建表
建表之前必须先选定库;use 库名;
a.内部表
表建在HDFS的 /user/hive/warehouse/表名
create table 表名(………) row format delimited fields terminated by ‘,’;
|
b.外部表
表建在了HDFS的/xx/yy/目录下
create external table 表名(……) row format delimited fields terminated by ‘ ,’ location ‘/xx/yy/’;
|
注意:
内部表被drop的时候,表目录(就是以表名创建的文件夹)会被删除,表的元数据也会被删除。
外部表被drop的时候,表的元数据会被删除;但是表的(其实创建表的时候就没有以表名创建文件夹)目录还在!
3.修改表定义
修改表名:alter table 原表名 rename to 新表名
删除表:drop table 表名
4.分区表
分区表的实质:就是在表目录下还可以建子目录,以便于在查询的时候可以指定具体的子目录进行查询,提高查询的效率
内部表分区建表语法:
Create table 表名(……) partitioned by() row format delimited fields terminated by ‘,’; |
导入数据到指定的分区:
Load data local inpath ‘本地路径’ into table t_2 partition(day=’2018-04-15’,city=’beijing’) |
4. hive的DML(数据库操作语言
database manipulation language )
sql运算模型一:逐行运算模型(逐行表达式,逐行过滤)
where用于逐行过滤
sql运算模式二:分组运算模型(分组表达式,分组过滤)
group by 用于将数据库分组
having 用于将分组过滤
sql的join联表机制:
join的实质就是讲多个表的数据连城一个表,作为查询的输入数据集!
表 a: +---------+-------+--+ | a.name | a.id | +---------+-------+--+ | a | 1 | | b | 2 | | c | 3 | | d | 4 | +---------+-------+--+ |
表 b: +---------+-------------+--+ | b.name | b.nickname | +---------+-------------+--+ | a | xx | | b | yy | | d | zz | | e | pp | +---------+-------------+--+
|
1. 笛卡尔积连接 :表一, 表二
select * from a , b;
示意图: +---------+-------+---------+-------------+--+ | a.name | a.id | b.name | b.nickname | +---------+-------+---------+-------------+--+ | a | 1 | a | xx | | b | 2 | a | xx | | c | 3 | a | xx | | d | 4 | a | xx | | a | 1 | b | yy | | b | 2 | b | yy | | c | 3 | b | yy | | d | 4 | b | yy | | a | 1 | d | zz | | b | 2 | d | zz | | c | 3 | d | zz | | d | 4 | d | zz | | a | 1 | e | pp | | b | 2 | e | pp | | c | 3 | e | pp | | d | 4 | e | pp | +---------+-------+---------+-------------+--+ |
2.内连接:
select * from a join b on a.name=b.name;
示意图: +---------+-------+---------+-------------+--+ | a.name | a.id | b.name | b.nickname | +---------+-------+---------+-------------+--+ | a | 1 | a | xx | | b | 2 | b | yy | | d | 4 | d | zz | +---------+-------+---------+-------------+--+ |
3.左外连接:
select *
from
a left outer join b
on a.name=b.name;
示意图: +---------+-------+---------+-------------+--+ | a.name | a.id | b.name | b.nickname | +---------+-------+---------+-------------+--+ | a | 1 | a | xx | | b | 2 | b | yy | | c | 3 | NULL | NULL | | d | 4 | d | zz | +---------+-------+---------+-------------+--+
|
4. 右外连接:
select *
from
a right outer join b
on a.name=b.name;
示意图: +---------+-------+---------+-------------+--+ | a.name | a.id | b.name | b.nickname | +---------+-------+---------+-------------+--+ | a | 1 | a | xx | | b | 2 | b | yy | | d | 4 | d | zz | | NULL | NULL | e | pp | +---------+-------+---------+-------------+--+
|
5. 全外连接:表一 full join 表二
select *
from
a full outer join b
on a.name=b.name;
示意图: +---------+-------+---------+-------------+--+ | a.name | a.id | b.name | b.nickname | +---------+-------+---------+-------------+--+ | a | 1 | a | xx | | b | 2 | b | yy | | c | 3 | NULL | NULL | | d | 4 | d | zz | | NULL | NULL | e | pp | +---------+-------+---------+-------------+--+
|
6.左半连接:表一 left semi join 表二
左半连接没有右表的信息
select *
from
a left semi join b
on a.name=b.name;
示意图: +---------+-------+--+ | a.name | a.id | +---------+-------+--+ | a | 1 | | b | 2 | | d | 4 | +---------+-------+--+ |
注意: left semi join的 select子句中,不能有右表的字段
Error: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:15 Invalid table alias or column reference 'nickname': (possible column names are: name, id) (state=42000,code=10004)
|
错误:在编译语句时出错:Failed:SemanticException[Error 10004]:第1行:15无效表别名或列引用‘'nickname'’:(可能的列名是:name,id)(状态=42000,代码=10004) |
Map集合
建表:
Create table t_map(int id,name string, ifno map<string,string>)
Row format delimited fields terminated by ’,’
Collection items terminated by ’#’
Map keys terminated by ’:’;
## 取map字段的所有key
select id,name,map_keys(family) as fa,age from t_map;
## 取map字段的所有value
select id,name,map_values(family) as mv,age from t_map;
## 综合:查询有brother的用户信息 select id,name,family,age from t_map ##where是逐行查询的判断筛选语句## where array_contains(map_keys(family),'brother'); |
Array集合
Create table t_array (id int,name string,ifno array<string>)
Row format delimited fields terminated by ‘,’
Collection items terminated ‘:’;
select * from t_array where array_contains(num,14);
select name,size(num) as lenth from t_array;
select name,num[0] as lenth from t_array; |
Struct对象
Create table t_struct(id int,name string,bean struct<age:int,name:string,addr:string>)
Row format delimited fields terminated by ‘,’
Collection items terminated by ’:’;
查询 select id,name,ifno.age from t_struct;
select id,name,ifno.sex,ifno.addr from t_struct; |
5.HIVE的函数表达式
1.数学运算函数
select round(5.4); ## 5 四舍五入
select round(5.1345,3) ; ##5.135
select ceil(5.4) ; // select ceiling(5.4) ; ## 6 向上取整
select floor(5.4); ## 5 向下取整
select abs(-5.4) ; ## 5.4 绝对值
对行取最大值:
select greatest(3,5,6) ; ## 6
对行取最小值:
select least(3,5,6) ; ##求多个输入参数中的最小值
对组求最大值:
select max(age) from t_person group by ..;
对组求最小值:
select min(age) from t_person group by...;
2.字符串函数
substr(string str, int start) ## 截取子串
substring(string str, int start)
示例:select substr("abcdefg",2) ;
substr(string, int start, int len)
substring(string, int start, int len)
示例:select substr("abcdefg",2,3) ; ## bcd
concat(string A, string B...) ## 拼接字符串
concat_ws(string SEP, string A, string B...)
示例:select concat("ab","xy") ; ## abxy
select concat_ws('.','192','168','137');
+--------------+--+ | _c0 | +--------------+--+ | 192.168.137 | +--------------+--+ |
length(string A)
示例:select length("192.168.33.44"); ## 13
split(string str, string pat) ## 切分字符串
示例: select split("192.168.33.44","\\.") ;
|
upper(string str) ##转大写
lower(string str) ##转小写
3.时间函数
select current_timestamp; ## 返回值类型:timestamp,获取当前的时间戳(详细时间信息)
|
select current_date; ## 返回值类型:date,获取当前的日期
|
## unix时间戳转字符串格式——from_unixtime
from_unixtime(bigint unixtime[, string format])
示例:select from_unixtime(unix_timestamp());
select from_unixtime(unix_timestamp(),"yyyy/MM/dd HH:mm:ss");
## 字符串格式转unix时间戳——unix_timestamp:返回值是一个长整数类型
## 如果不带参数,取当前时间的秒数时间戳long--(距离格林威治时间1970-1-1 0:0:0秒的差距)
select unix_timestamp();
unix_timestamp(string date, string pattern)
示例: select unix_timestamp("2017-08-10 17:50:30");
select unix_timestamp("2017-08-10 17:50:30","yyyy-MM-dd HH:mm:ss");
## 将字符串转成日期date
select to_date("2017-09-17 16:58:32");
4.if语句
原图:
+---------------------+--------------------------------+-----------------+--+ | t_movie.movie_name | t_movie.starring | t_movie.m_date | +---------------------+--------------------------------+-----------------+--+ | 战狼2 | ["吴京","吴刚","卢靖姗"] | 2017-08-16 | | 三生三世十里桃花 | ["刘亦菲","杨洋"] | 2017-08-20 | | 羞羞的铁拳 | ["沈腾","玛丽","艾伦"] | 2017-12-20 | | 人民的名义 | ["陆毅","张丰毅","吴刚","许亚军","张志坚"] | 2017-06-30 | +---------------------+--------------------------------+-----------------+--+ | |
select movie_name,if(array_contains(starring,'吴刚'),'好片','烂片'),m_date From t_movie;
|
示意图: +-------------+------+-------------+--+ | movie_name | _c1 | m_date | +-------------+------+-------------+--+ | 战狼2 | 好片 | 2017-08-16 | | 三生三世十里桃花 | 烂片 | 2017-08-20 | | 羞羞的铁拳 | 烂片 | 2017-12-20 | | 人民的名义 | 好片 | 2017-06-30 | +-------------+------+-------------+--+ |
5.case when语句
select movie_name,m_date, case when m_date>'2017-08-08' then'新片' when m_date<='2017-08-08' then '老片' end as movie from t_movie;
|
原表:
+---------------------+--------------------------------+-----------------+--+ | t_movie.movie_name | t_movie.starring | t_movie.m_date | +---------------------+--------------------------------+-----------------+--+ | 战狼2 | ["吴京","吴刚","卢靖姗"] | 2017-08-16 | | 三生三世十里桃花 | ["刘亦菲","杨洋"] | 2017-08-20 | | 羞羞的铁拳 | ["沈腾","玛丽","艾伦"] | 2017-12-20 | | 人民的名义 | ["陆毅","张丰毅","吴刚","许亚军","张志坚"] | 2017-06-30 | +---------------------+--------------------------------+-----------------+--+ |
示意表:
+-------------+-------------+--------+--+ | movie_name | m_date | movie | +-------------+-------------+--------+--+ | 战狼2 | 2017-08-16 | 新片 | | 三生三世十里桃花 | 2017-08-20 | 新片 | | 羞羞的铁拳 | 2017-12-20 | 新片 | | 人民的名义 | 2017-06-30 | 老片 | +-------------+-------------+--------+--+ |
6.常见分组聚合函数
sum(字段) : 求这个字段在一个组中的所有值的和 avg(字段) : 求这个字段在一个组中的所有值的平均值 max(字段) :求这个字段在一个组中的所有值的最大值 min(字段) :求这个字段在一个组中的所有值的最小值
|
7.表生成函数
1.explode()
使用explode()对数组字段“炸裂”
select explode(starring) as actors from t_movie; |
+--------+--+ | movie | +--------+--+ | 吴京 | | 吴刚 | | 卢靖姗 | | 刘亦菲 | | 杨洋 | | 沈腾 | | 玛丽 | | 艾伦 | | 陆毅 | | 张丰毅 | | 吴刚 | | 许亚军 | | 张志坚 | +--------+--+ |
2. lateral view()
示例表: +-------------+-------------+-------------+--+ | movie_name | tmp.actors | m_date | +-------------+-------------+-------------+--+ | 战狼2 | 吴京 | 2017-08-16 | | 战狼2 | 吴刚 | 2017-08-16 | | 战狼2 | 卢靖姗 | 2017-08-16 | | 三生三世十里桃花 | 刘亦菲 | 2017-08-20 | | 三生三世十里桃花 | 杨洋 | 2017-08-20 | | 羞羞的铁拳 | 沈腾 | 2017-12-20 | | 羞羞的铁拳 | 玛丽 | 2017-12-20 | | 羞羞的铁拳 | 艾伦 | 2017-12-20 | | 人民的名义 | 陆毅 | 2017-06-30 | | 人民的名义 | 张丰毅 | 2017-06-30 | | 人民的名义 | 吴刚 | 2017-06-30 | | 人民的名义 | 许亚军 | 2017-06-30 | | 人民的名义 | 张志坚 | 2017-06-30 | +-------------+-------------+-------------+--+ |
理解: lateral view 相当于两个表在join
左表:是原表
右表:是explode(某个集合字段)之后产生的表
而且:这个join只在同一行的数据间进行
2. 将分散表按照name分组聚合成一行
select name, concat_ws('_',collect_set(project)) as project from project_promotion group by name;
|
+-------------------------+----------------------------+--+ | project_promotion.name | project_promotion.project | +-------------------------+----------------------------+--+ | 小明 | 语文 | | 小明 | 数学 | | 小明 | 英语 | | 小明 | 物理 | | 小明 | 化学 | | 小红 | 生物 | | 小红 | 历史 | | 小红 | 生化 | | 小红 | 政治 | +-------------------------+----------------------------+--+ |
结果表 +-------+-----------------+--+ | name | project | +-------+-----------------+--+ | 小明 | 语文_数学_英语_物理_化学 | | 小红 | 生物_历史_生化_政治 | +-------+-----------------+--+ |
3. json解析函数:表生成函数
1.Json解析到本地:
Insert overwrite local directory ‘/root/hivedata/’ Row fromat delimited fields terminated by ’\t’ Select Json_tuple(json,"movie","rate","timeStamp","uid") as(movie,rate,ts,uid1) From t_rate;
|
3. json解析到HDFS
Insert overwrite directory ‘/root/hivedata/’ Select Json_tuple(json,"movie","rate","timeStamp","uid") as(movie,rate,ts,uid1) From t_rate;
|
+----------------------------------------------------------+--+ | t_json.json | +----------------------------------------------------------+--+ | {"id":1,"name":"zhangsan1","age":18,"city":"beijing"} | | {"id":2,"name":"zhangsan2","age":28,"city":"shanghai"} | | {"id":3,"name":"zhangsan3","age":38,"city":"shanghai"} | | {"id":4,"name":"zhangsan4","age":17,"city":"beijing"} | | {"id":5,"name":"zhangsan5","age":19,"city":"beijing"} | | {"id":6,"name":"zhangsan6","age":26,"city":"guangzhou"} | | {"id":7,"name":"zhangsan7","age":17,"city":"beijing"} | | {"id":8,"name":"zhangsan8","age":14,"city":"guangzhou"} | | {"id":9,"name":"zhangsan9","age":48,"city":"beijing"} | +----------------------------------------------------------+--+ |
需要做各种统计分析:
发现,直接对json做sql查询不方便,需要将json数据解析成普通的结构化数据表。可以采用hive中内置的json_tuple()函数
实现步骤:
1、创建一个原始表用来对应原始的json数据
create table t_json(json string);
load data local inpath ‘/root/rating.json’ into table t_json;
Select Json_tuple(“id”,”name”,”age”,”city”) as (id,name,age,city) From t_json;
|
+-----+------------+------+------------+--+ | id | name | age | city | +-----+------------+------+------------+--+ | 1 | zhangsan1 | 18 | beijing | | 2 | zhangsan2 | 28 | shanghai | | 3 | zhangsan3 | 38 | shanghai | | 4 | zhangsan4 | 17 | beijing | | 5 | zhangsan5 | 19 | beijing | | 6 | zhangsan6 | 26 | guangzhou | | 7 | zhangsan7 | 17 | beijing | | 8 | zhangsan8 | 14 | guangzhou | | 9 | zhangsan9 | 48 | beijing | +-----+------------+------+------------+--+ |
8. 窗口分析函数
1.row_number() over() TOPN
原表:
+-----+------+-------+-------+-------+--+ | id | age | name | addr | rown | +-----+------+-------+-------+-------+--+ | 4 | 16 | d | 上海 | 1 | | 4 | 16 | d | 上海 | 2 | | 3 | 22 | c | 上海 | 3 | | 3 | 22 | c | 上海 | 4 | | 9 | 26 | f | 上海 | 5 | | 9 | 26 | f | 上海 | 6 | | 6 | 26 | f | 上海 | 7 | | 6 | 26 | f | 上海 | 8 | | 10 | 14 | b | 北京 | 1 | | 10 | 14 | b | 北京 | 2 | | 1 | 18 | a | 北京 | 3 | | 1 | 18 | a | 北京 | 4 | | 2 | 19 | b | 北京 | 5 | | 2 | 19 | b | 北京 | 6 | | 7 | 16 | d | 深圳 | 1 | | 7 | 16 | d | 深圳 | 2 | | 8 | 30 | e | 深圳 | 3 | | 5 | 30 | e | 深圳 | 4 | | 5 | 30 | e | 深圳 | 5 | | 8 | 30 | e | 深圳 | 6 | +-----+------+-------+-------+-------+--+ |
//查询每个地区年龄最大的两条
Select id,age,name,addr from (Select Id,age,name,addr,row_number()over(partition by addr order by age desc) as rown From t_rown)o1 Where o1.rown<=2;
|
示例表: +-----+------+-------+-------+--+ | id | age | name | addr | +-----+------+-------+-------+--+ | 6 | 26 | f | 上海 | | 9 | 26 | f | 上海 | | 2 | 19 | b | 北京 | | 2 | 19 | b | 北京 | | 8 | 30 | e | 深圳 | | 5 | 30 | e | 深圳 | +-----+------+-------+-------+--+ |
2.sum() over()——级联求和
重要提示 | sum()over()的累加范围指定语法:
sum() over(partition by x order by y rows between 8 preceding and current row) sum() over(partition by x order by y rows between 8 preceding and 5 following) sum() over(partition by x order by y rows between unbounded preceding and 5 following) sum() over(partition by x order by y rows between unbounded preceding and unbounded following)
|
select name,month,amt, sum(amt) over(partition by name order by month rows between unbounded preceding and current row) as accumulate
from t_tmp;
|
hive 窗口分析函数
select * from t_access;
+----------------+---------------------------------+-----------------------+--------------+--+
| t_access.ip | t_access.url | t_access.access_time | t_access.dt |
+----------------+---------------------------------+-----------------------+--------------+--+
| 192.168.33.3 | http://www.edu360.cn/stu | 2017-08-04 15:30:20 | 20170804 |
| 192.168.33.3 | http://www.edu360.cn/teach | 2017-08-04 15:35:20 | 20170804 |
| 192.168.33.4 | http://www.edu360.cn/stu | 2017-08-04 15:30:20 | 20170804 |
| 192.168.33.4 | http://www.edu360.cn/job | 2017-08-04 16:30:20 | 20170804 |
| 192.168.33.5 | http://www.edu360.cn/job | 2017-08-04 15:40:20 | 20170804 |
| 192.168.33.3 | http://www.edu360.cn/stu | 2017-08-05 15:30:20 | 20170805 |
| 192.168.44.3 | http://www.edu360.cn/teach | 2017-08-05 15:35:20 | 20170805 |
| 192.168.33.44 | http://www.edu360.cn/stu | 2017-08-05 15:30:20 | 20170805 |
| 192.168.33.46 | http://www.edu360.cn/job | 2017-08-05 16:30:20 | 20170805 |
| 192.168.33.55 | http://www.edu360.cn/job | 2017-08-05 15:40:20 | 20170805 |
| 192.168.133.3 | http://www.edu360.cn/register | 2017-08-06 15:30:20 | 20170806 |
| 192.168.111.3 | http://www.edu360.cn/register | 2017-08-06 15:35:20 | 20170806 |
| 192.168.34.44 | http://www.edu360.cn/pay | 2017-08-06 15:30:20 | 20170806 |
| 192.168.33.46 | http://www.edu360.cn/excersize | 2017-08-06 16:30:20 | 20170806 |
| 192.168.33.55 | http://www.edu360.cn/job | 2017-08-06 15:40:20 | 20170806 |
| 192.168.33.46 | http://www.edu360.cn/excersize | 2017-08-06 16:30:20 | 20170806 |
| 192.168.33.25 | http://www.edu360.cn/job | 2017-08-06 15:40:20 | 20170806 |
| 192.168.33.36 | http://www.edu360.cn/excersize | 2017-08-06 16:30:20 | 20170806 |
| 192.168.33.55 | http://www.edu360.cn/job | 2017-08-06 15:40:20 | 20170806 |
+----------------+---------------------------------+-----------------------+--------------+--+
## LAG() over() 函数
求每个人的每次访问记录后加上他的上次访问时间
select ip,url,access_time,
row_number() over(partition by ip order by access_time) as rn,
lag(access_time,1,0) over(partition by ip order by access_time)as last_access_time
from t_access;
+----------------+---------------------------------+----------------------+-----+----------------------+--+
| ip | url | access_time | rn | last_access_time |
+----------------+---------------------------------+----------------------+-----+----------------------+--+
| 192.168.111.3 | http://www.edu360.cn/register | 2017-08-06 15:35:20 | 1 | 0 |
| 192.168.133.3 | http://www.edu360.cn/register | 2017-08-06 15:30:20 | 1 | 0 |
| 192.168.33.25 | http://www.edu360.cn/job | 2017-08-06 15:40:20 | 1 | 0 |
| 192.168.33.3 | http://www.edu360.cn/stu | 2017-08-04 15:30:20 | 1 | 0 |
| 192.168.33.3 | http://www.edu360.cn/teach | 2017-08-04 15:35:20 | 2 | 2017-08-04 15:30:20 |
| 192.168.33.3 | http://www.edu360.cn/stu | 2017-08-05 15:30:20 | 3 | 2017-08-04 15:35:20 |
| 192.168.33.36 | http://www.edu360.cn/excersize | 2017-08-06 16:30:20 | 1 | 0 |
| 192.168.33.4 | http://www.edu360.cn/stu | 2017-08-04 15:30:20 | 1 | 0 |
| 192.168.33.4 | http://www.edu360.cn/job | 2017-08-04 16:30:20 | 2 | 2017-08-04 15:30:20 |
| 192.168.33.44 | http://www.edu360.cn/stu | 2017-08-05 15:30:20 | 1 | 0 |
| 192.168.33.46 | http://www.edu360.cn/job | 2017-08-05 16:30:20 | 1 | 0 |
| 192.168.33.46 | http://www.edu360.cn/excersize | 2017-08-06 16:30:20 | 2 | 2017-08-05 16:30:20 |
| 192.168.33.46 | http://www.edu360.cn/excersize | 2017-08-06 16:30:20 | 3 | 2017-08-06 16:30:20 |
| 192.168.33.5 | http://www.edu360.cn/job | 2017-08-04 15:40:20 | 1 | 0 |
| 192.168.33.55 | http://www.edu360.cn/job | 2017-08-05 15:40:20 | 1 | 0 |
| 192.168.33.55 | http://www.edu360.cn/job | 2017-08-06 15:40:20 | 2 | 2017-08-05 15:40:20 |
| 192.168.33.55 | http://www.edu360.cn/job | 2017-08-06 15:40:20 | 3 | 2017-08-06 15:40:20 |
| 192.168.34.44 | http://www.edu360.cn/pay | 2017-08-06 15:30:20 | 1 | 0 |
| 192.168.44.3 | http://www.edu360.cn/teach | 2017-08-05 15:35:20 | 1 | 0 |
+----------------+---------------------------------+----------------------+-----+----------------------+--+
## LEAD() over() 函数
select ip,url,access_time,
row_number() over(partition by ip order by access_time) as rn,
lead(access_time,1,0) over(partition by ip order by access_time)as pre_access_time
from t_access;
+----------------+---------------------------------+----------------------+-----+----------------------+--+
| ip | url | access_time | rn | pre_access_time |
+----------------+---------------------------------+----------------------+-----+----------------------+--+
| 192.168.111.3 | http://www.edu360.cn/register | 2017-08-06 15:35:20 | 1 | 0 |
| 192.168.133.3 | http://www.edu360.cn/register | 2017-08-06 15:30:20 | 1 | 0 |
| 192.168.33.25 | http://www.edu360.cn/job | 2017-08-06 15:40:20 | 1 | 0 |
| 192.168.33.3 | http://www.edu360.cn/stu | 2017-08-04 15:30:20 | 1 | 2017-08-04 15:35:20 |
| 192.168.33.3 | http://www.edu360.cn/teach | 2017-08-04 15:35:20 | 2 | 2017-08-05 15:30:20 |
| 192.168.33.3 | http://www.edu360.cn/stu | 2017-08-05 15:30:20 | 3 | 0 |
| 192.168.33.36 | http://www.edu360.cn/excersize | 2017-08-06 16:30:20 | 1 | 0 |
| 192.168.33.4 | http://www.edu360.cn/stu | 2017-08-04 15:30:20 | 1 | 2017-08-04 16:30:20 |
| 192.168.33.4 | http://www.edu360.cn/job | 2017-08-04 16:30:20 | 2 | 0 |
| 192.168.33.44 | http://www.edu360.cn/stu | 2017-08-05 15:30:20 | 1 | 0 |
| 192.168.33.46 | http://www.edu360.cn/job | 2017-08-05 16:30:20 | 1 | 2017-08-06 16:30:20 |
| 192.168.33.46 | http://www.edu360.cn/excersize | 2017-08-06 16:30:20 | 2 | 2017-08-06 16:30:20 |
| 192.168.33.46 | http://www.edu360.cn/excersize | 2017-08-06 16:30:20 | 3 | 0 |
| 192.168.33.5 | http://www.edu360.cn/job | 2017-08-04 15:40:20 | 1 | 0 |
| 192.168.33.55 | http://www.edu360.cn/job | 2017-08-05 15:40:20 | 1 | 2017-08-06 15:40:20 |
| 192.168.33.55 | http://www.edu360.cn/job | 2017-08-06 15:40:20 | 2 | 2017-08-06 15:40:20 |
| 192.168.33.55 | http://www.edu360.cn/job | 2017-08-06 15:40:20 | 3 | 0 |
| 192.168.34.44 | http://www.edu360.cn/pay | 2017-08-06 15:30:20 | 1 | 0 |
| 192.168.44.3 | http://www.edu360.cn/teach | 2017-08-05 15:35:20 | 1 | 0 |
+----------------+---------------------------------+----------------------+-----+----------------------+--+
## FIRST_VALUE() over() 函数
例:取每个用户访问的第一个页面
select ip,url,access_time,
row_number() over(partition by ip order by access_time) as rn,
first_value(url) over(partition by ip order by access_time rows between unbounded preceding and unbounded following)as last_access_time
from t_access;
+----------------+---------------------------------+----------------------+-----+---------------------------------+--+
| ip | url | access_time | rn | last_access_time |
+----------------+---------------------------------+----------------------+-----+---------------------------------+--+
| 192.168.111.3 | http://www.edu360.cn/register | 2017-08-06 15:35:20 | 1 | http://www.edu360.cn/register |
| 192.168.133.3 | http://www.edu360.cn/register | 2017-08-06 15:30:20 | 1 | http://www.edu360.cn/register |
| 192.168.33.25 | http://www.edu360.cn/job | 2017-08-06 15:40:20 | 1 | http://www.edu360.cn/job |
| 192.168.33.3 | http://www.edu360.cn/stu | 2017-08-04 15:30:20 | 1 | http://www.edu360.cn/stu |
| 192.168.33.3 | http://www.edu360.cn/teach | 2017-08-04 15:35:20 | 2 | http://www.edu360.cn/stu |
| 192.168.33.3 | http://www.edu360.cn/stu | 2017-08-05 15:30:20 | 3 | http://www.edu360.cn/stu |
| 192.168.33.36 | http://www.edu360.cn/excersize | 2017-08-06 16:30:20 | 1 | http://www.edu360.cn/excersize |
| 192.168.33.4 | http://www.edu360.cn/stu | 2017-08-04 15:30:20 | 1 | http://www.edu360.cn/stu |
| 192.168.33.4 | http://www.edu360.cn/job | 2017-08-04 16:30:20 | 2 | http://www.edu360.cn/stu |
| 192.168.33.44 | http://www.edu360.cn/stu | 2017-08-05 15:30:20 | 1 | http://www.edu360.cn/stu |
| 192.168.33.46 | http://www.edu360.cn/job | 2017-08-05 16:30:20 | 1 | http://www.edu360.cn/job |
| 192.168.33.46 | http://www.edu360.cn/excersize | 2017-08-06 16:30:20 | 2 | http://www.edu360.cn/job |
| 192.168.33.46 | http://www.edu360.cn/excersize | 2017-08-06 16:30:20 | 3 | http://www.edu360.cn/job |
| 192.168.33.5 | http://www.edu360.cn/job | 2017-08-04 15:40:20 | 1 | http://www.edu360.cn/job |
| 192.168.33.55 | http://www.edu360.cn/job | 2017-08-05 15:40:20 | 1 | http://www.edu360.cn/job |
| 192.168.33.55 | http://www.edu360.cn/job | 2017-08-06 15:40:20 | 2 | http://www.edu360.cn/job |
| 192.168.33.55 | http://www.edu360.cn/job | 2017-08-06 15:40:20 | 3 | http://www.edu360.cn/job |
| 192.168.34.44 | http://www.edu360.cn/pay | 2017-08-06 15:30:20 | 1 | http://www.edu360.cn/pay |
| 192.168.44.3 | http://www.edu360.cn/teach | 2017-08-05 15:35:20 | 1 | http://www.edu360.cn/teach |
+----------------+---------------------------------+----------------------+-----+---------------------------------+--+
## LAST_VALUE() over() 函数
例:取每个用户访问的最后一个页面
select ip,url,access_time,
row_number() over(partition by ip order by access_time) as rn,
last_value(url) over(partition by ip order by access_time rows between unbounded preceding and unbounded following)as last_access_time
from t_access;
+----------------+---------------------------------+----------------------+-----+---------------------------------+--+
| ip | url | access_time | rn | last_access_url |
+----------------+---------------------------------+----------------------+-----+---------------------------------+--+
| 192.168.111.3 | http://www.edu360.cn/register | 2017-08-06 15:35:20 | 1 | http://www.edu360.cn/register |
| 192.168.133.3 | http://www.edu360.cn/register | 2017-08-06 15:30:20 | 1 | http://www.edu360.cn/register |
| 192.168.33.25 | http://www.edu360.cn/job | 2017-08-06 15:40:20 | 1 | http://www.edu360.cn/job |
| 192.168.33.3 | http://www.edu360.cn/stu | 2017-08-04 15:30:20 | 1 | http://www.edu360.cn/stu |
| 192.168.33.3 | http://www.edu360.cn/teach | 2017-08-04 15:35:20 | 2 | http://www.edu360.cn/stu |
| 192.168.33.3 | http://www.edu360.cn/stu | 2017-08-05 15:30:20 | 3 | http://www.edu360.cn/stu |
| 192.168.33.36 | http://www.edu360.cn/excersize | 2017-08-06 16:30:20 | 1 | http://www.edu360.cn/excersize |
| 192.168.33.4 | http://www.edu360.cn/stu | 2017-08-04 15:30:20 | 1 | http://www.edu360.cn/stu |
| 192.168.33.4 | http://www.edu360.cn/job | 2017-08-04 16:30:20 | 2 | http://www.edu360.cn/stu |
| 192.168.33.44 | http://www.edu360.cn/stu | 2017-08-05 15:30:20 | 1 | http://www.edu360.cn/stu |
| 192.168.33.46 | http://www.edu360.cn/job | 2017-08-05 16:30:20 | 1 | http://www.edu360.cn/job |
| 192.168.33.46 | http://www.edu360.cn/excersize | 2017-08-06 16:30:20 | 2 | http://www.edu360.cn/job |
| 192.168.33.46 | http://www.edu360.cn/excersize | 2017-08-06 16:30:20 | 3 | http://www.edu360.cn/job |
| 192.168.33.5 | http://www.edu360.cn/job | 2017-08-04 15:40:20 | 1 | http://www.edu360.cn/job |
| 192.168.33.55 | http://www.edu360.cn/job | 2017-08-05 15:40:20 | 1 | http://www.edu360.cn/job |
| 192.168.33.55 | http://www.edu360.cn/job | 2017-08-06 15:40:20 | 2 | http://www.edu360.cn/job |
| 192.168.33.55 | http://www.edu360.cn/job | 2017-08-06 15:40:20 | 3 | http://www.edu360.cn/job |
| 192.168.34.44 | http://www.edu360.cn/pay | 2017-08-06 15:30:20 | 1 | http://www.edu360.cn/pay |
| 192.168.44.3 | http://www.edu360.cn/teach | 2017-08-05 15:35:20 | 1 | http://www.edu360.cn/teach |
+----------------+---------------------------------+----------------------+-----+---------------------------------+--+
/*
累计报表--分析函数实现版
*/
先求出每个人每个月的总金额
select id,month,
sum(fee) as amount
from t_test
group by id,month;
+-----+----------+-------+--+
| id | month | amount|
+-----+----------+-------+--+
| A | 2015-01 | 33 |
| A | 2015-02 | 10 |
| A | 2015-03 | 20 |
| B | 2015-01 | 30 |
| B | 2015-02 | 15 |
| B | 2015-03 | 45 |
| C | 2015-01 | 30 |
| C | 2015-02 | 40 |
| C | 2015-03 | 30 |
+-----+----------+-------+--+
-- sum() over() 函数
select id
,month
,amount
,sum(amount) over(partition by id order by month rows between unbounded preceding and current row)
from
tmp;
select id,month,sum(cnt) as cnts
from t_access_time
group by id,month;
tmp
+-----+----------+-------+--+
| id | month | cnts |
+-----+----------+-------+--+
| A | 2015-01 | 33 | cxx
| A | 2015-02 | 10 | xx
| A | 2015-03 | 20 |
| B | 2015-01 | 30 |
| B | 2015-02 | 15 |
| B | 2015-03 | 45 |
| C | 2015-01 | 30 |
| C | 2015-02 | 40 |
| C | 2015-03 | 30 |
+-----+----------+-------+--+
select id,month,cnts,
sum(cnts) over(partition by id order by month rows between unbounded preceding and current row ) as accumulate_cnts
from (select id,month,sum(cnt) as cnts
from t_access_time
group by id,month) tmp;