现有有用户每月销售额的表
用户 月份(日期给省了) 销售金额
A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
A,2015-02,4
A,2015-02,6
B,2015-02,10
B,2015-02,5
现在需要输出报表,各用户每月销售金额与总金额(比如2月份总金额应该是1月和2月销售额之和)
+-------------+----------+-------+---------+
| username | month | sal | totalc |
+-------------+----------+-------+---------+
| A | 2015-01 | 33.0 | 33.0 |
| A | 2015-02 | 10.0 | 43.0 |
| B | 2015-01 | 30.0 | 30.0 |
| B | 2015-02 | 15.0 | 45.0 |
+-------------+----------+-------+---------+
思路
可以先求出各个用户每个月销售额
+-----------+----------+---------+
| username | month | salary |
+-----------+----------+---------+
| A | 2015-01 | 33 |
| A | 2015-02 | 10 |
| B | 2015-01 | 30 |
| B | 2015-02 | 15 |
+-----------+----------+---------+
然后将这张表自己跟自己内连接,条件为A.username=B.username
+-------------+----------+-----------+-------------+----------+-----------+--+
| a.username | a.month | a.salary | b.username | b.month | b.salary |
+-------------+----------+-----------+-------------+----------+-----------+--+
| A | 2015-01 | 33 | A | 2015-01 | 33 |
| A | 2015-01 | 33 | A | 2015-02 | 10 |
| A | 2015-02 | 10 | A | 2015-01 | 33 |
| A | 2015-02 | 10 | A | 2015-02 | 10 |
| B | 2015-01 | 30 | B | 2015-01 | 30 |
| B | 2015-01 | 30 | B | 2015-02 | 15 |
| B | 2015-02 | 15 | B | 2015-01 | 30 |
| B | 2015-02 | 15 | B | 2015-02 | 15 |
+-------------+----------+-----------+-------------+----------+-----------+--+
左边为A表字段,右边为B表字段。接着根据A表的username和month进行分组,而对B表的salary进行求和,比如A用户的1月份为一组的话,那么求和会是B表的A用户的1月份和2月份销售额之和,显然不行,实际需要的是小于等于1月份的销售额之和,所以需要加判断条件B.month<=A.month。
hive上进行操作
准备工作
将数据写入文件
创建表time_money
将文件数据导入表
[root@mini1 ~]# vi time_money.data;
A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
A,2015-02,4
A,2015-02,6
B,2015-02,10
B,2015-02,5
0: jdbc:hive2://localhost:10000> create table time_money(username string,month string, sal string)
0: jdbc:hive2://localhost:10000> row format delimited fields terminated by ',';
No rows affected (0.095 seconds)
0: jdbc:hive2://localhost:10000> load data local inpath '/root/time_money.data' into table time_money;
0: jdbc:hive2://localhost:10000> select * from time_money;
+----------------------+-------------------+-----------------+--+
| time_money.username | time_money.month | time_money.sal |
+----------------------+-------------------+-----------------+--+
| A | 2015-01 | 5 |
| A | 2015-01 | 15 |
| B | 2015-01 | 5 |
| A | 2015-01 | 8 |
| B | 2015-01 | 25 |
| A | 2015-01 | 5 |
| A | 2015-02 | 4 |
| A | 2015-02 | 6 |
| B | 2015-02 | 10 |
| B | 2015-02 | 5 |
+----------------------+-------------------+-----------------+--+
第一步:
求每个用户的各个月份销售额
0: jdbc:hive2://localhost:10000> select username,month,sum(sal) sal from time_money group by username,month;
+-----------+----------+-------+--+
| username | month | sal |
+-----------+----------+-------+--+
| A | 2015-01 | 33.0 |
| A | 2015-02 | 10.0 |
| B | 2015-01 | 30.0 |
| B | 2015-02 | 15.0 |
+-----------+----------+-------+--+
第二步:
将上面的表自己跟自己内连接,条件为A.username=B.usernmae
0: jdbc:hive2://localhost:10000> select *
0: jdbc:hive2://localhost:10000> from
0: jdbc:hive2://localhost:10000> (select username,month,sum(sal)as sal from time_money group by username,month) A
0: jdbc:hive2://localhost:10000> inner join
0: jdbc:hive2://localhost:10000> (select username,month,sum(sal)as sal from time_money group by username,month) B
0: jdbc:hive2://localhost:10000> on
0: jdbc:hive2://localhost:10000> A.username=B.username;
+-------------+----------+--------+-------------+----------+--------+--+
| a.username | a.month | a.sal | b.username | b.month | b.sal |
+-------------+----------+--------+-------------+----------+--------+--+
| A | 2015-01 | 33.0 | A | 2015-01 | 33.0 |
| A | 2015-01 | 33.0 | A | 2015-02 | 10.0 |
| A | 2015-02 | 10.0 | A | 2015-01 | 33.0 |
| A | 2015-02 | 10.0 | A | 2015-02 | 10.0 |
| B | 2015-01 | 30.0 | B | 2015-01 | 30.0 |
| B | 2015-01 | 30.0 | B | 2015-02 | 15.0 |
| B | 2015-02 | 15.0 | B | 2015-01 | 30.0 |
| B | 2015-02 | 15.0 | B | 2015-02 | 15.0 |
+-------------+----------+--------+-------------+----------+--------+--+
第三步:
从上一步的结果中
进行分组查询,分组的字段是a.username a.month
求月累计值: 将b.month <= a.month的所有b.salary求和即可
0: jdbc:hive2://localhost:10000> select A.username, A.month,max(A.sal) sal,sum(B.sal) totalc
0: jdbc:hive2://localhost:10000> from
0: jdbc:hive2://localhost:10000> (select username,month,sum(sal)as sal from time_money group by username,month) A
0: jdbc:hive2://localhost:10000> inner join
0: jdbc:hive2://localhost:10000> (select username,month,sum(sal)as sal from time_money group by username,month) B
0: jdbc:hive2://localhost:10000> on
0: jdbc:hive2://localhost:10000> A.username = B.username
0: jdbc:hive2://localhost:10000> where B.month <= A.month
0: jdbc:hive2://localhost:10000> group by A.username,A.month
0: jdbc:hive2://localhost:10000> order by A.username,A.month;
+-------------+----------+-------+---------+--+
| a.username | a.month | sal | totalc |
+-------------+----------+-------+---------+--+
| A | 2015-01 | 33.0 | 33.0 |
| A | 2015-02 | 10.0 | 43.0 |
| B | 2015-01 | 30.0 | 30.0 |
| B | 2015-02 | 15.0 | 45.0 |
+-------------+----------+-------+---------+--+
注:建表的时候应该讲sal设置为int类型而不是string,那样就不会出现小数。