使用Hive进行手机流量统计

最近面试,发现很多公司在使用hive对数据进行处理。
hive是hadoop家族成员,是一种解析like sql语句的框架。它封装了常用MapReduce任务,让你像执行sql一样操作存储在HDFS的表。
hive的表分为两种,内表和外表。
Hive 创建内部表时,会将数据移动到数据仓库指向的路径;若创建外部表,仅记录数据所在的路径,不对数据的位置做任何改变。
在删除表的时候,内部表的元数据和数据会被一起删除, 而外部表只删除元数据,不删除数据。这样外部表相对来说更加安全些,数据组织也更加灵活,方便共享源数据。
Hive的内外表,还有一个Partition的分区的知识点,用于避免全表扫描,快速检索。后期的文章会提到。

接下来开始正式开始《Hive统计手机流量》
原始数据:

1363157985066   13726230503     00-FD-07-A4-72-B8:CMCC  120.196.100.82  i02.c.aliimg.com                24      27      2481    24681   200
1363157995052   13826544101     5C-0E-8B-C7-F1-E0:CMCC  120.197.40.4                    4       0       264     0       200
1363157991076   13926435656     20-10-7A-28-CC-0A:CMCC  120.196.100.99                  2       4       132     1512    200
1363154400022   13926251106     5C-0E-8B-8B-B1-50:CMCC  120.197.40.4                    4       0       240     0       200
1363157993044   18211575961     94-71-AC-CD-E6-18:CMCC-EASY     120.196.100.99  iface.qiyi.com  瑙.?缃..        15   2      1527    2106    200
1363157995074   84138413        5C-0E-8B-8C-E8-20:7DaysInn      120.197.40.4    122.72.52.12            20      16      4116    1432    200
1363157993055   13560439658     C4-17-FE-BA-DE-D9:CMCC  120.196.100.99                  18      15      1116    954     200
1363157995033   15920133257     5C-0E-8B-C7-BA-20:CMCC  120.197.40.4    sug.so.360.cn   淇℃.瀹..        20      20   156    2936    200

操作步骤:

#配置好Hive之后,使用hive命令启动hive框架。hive启动属于懒加载模式,会比较慢
hive;
#使用show databases命令查看当前数据库信息
hive> show databases;
OK
default
hive 
Time taken: 3.389 seconds
#使用 use hive命令,使用指定的数据库 hive数据库是我之前创建的
use hive
#创建表,这里是创建内表。内表加载hdfs上的数据,会将被加载文件中的内容剪切走。
#外表没有这个问题,所以在实际的生产环境中,建议使用外表。
create table ll(reportTime string,msisdn string,apmac string,acmac string,host string,siteType string,upPackNum bigint,downPackNum bigint,upPayLoad bigint,downPayLoad bigint,httpStatus string)row format delimited fields terminated by '\t';
#加载数据,这里是从hdfs加载数据,也可用linux下加载数据 需要local关键字
load data inpath'/HTTP_20130313143750.dat' into table ll;
#数据加载完毕之后,hdfs的
#执行hive 的like sql语句,对数据进行统计
select msisdn,sum(uppacknum),sum(downpacknum),sum(uppayload),sum(downpayload) from ll group by msisdn;

执行结果如下:

hive> select msisdn,sum(uppacknum),sum(downpacknum),sum(uppayload),sum(downpayload) from ll group by msisdn;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201307160252_0006, Tracking URL = http://hadoop0:50030/jobdetails.jsp?jobid=job_201307160252_0006
Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job  -Dmapred.job.tracker=hadoop0:9001 -kill job_201307160252_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2013-07-17 19:51:42,599 Stage-1 map = 0%,  reduce = 0%
2013-07-17 19:52:40,474 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 48.5 sec
2013-07-17 19:52:41,690 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 48.5 sec
2013-07-17 19:52:42,693 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 48.5 sec
2013-07-17 19:52:43,698 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 48.5 sec
2013-07-17 19:52:44,702 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 48.5 sec
2013-07-17 19:52:45,707 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 48.5 sec
2013-07-17 19:52:46,712 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 48.5 sec
2013-07-17 19:52:47,715 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 48.5 sec
2013-07-17 19:52:48,721 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 48.5 sec
2013-07-17 19:52:49,758 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 48.5 sec
2013-07-17 19:52:50,763 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 48.5 sec
2013-07-17 19:52:51,772 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 50.0 sec
2013-07-17 19:52:52,775 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 50.0 sec
2013-07-17 19:52:53,779 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 50.0 sec
MapReduce Total cumulative CPU time: 50 seconds 0 msec
Ended Job = job_201307160252_0006
MapReduce Jobs Launched: 
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 50.0 sec   HDFS Read: 2787075 HDFS Write: 16518 SUCCESS
Total MapReduce CPU Time Spent: 50 seconds 0 msec
OK
13402169727	171	108	11286	130230
13415807477	2067	1683	169668	1994181
13416127574	1501	1094	161963	802756
13416171820	113	99	10630	32120
13417106524	160	128	18688	13088
13418002498	240	256	22136	86896
13418090588	456	351	98934	67470
13418117364	264	152	29436	49966
13418173218	37680	48348	2261286	73159722
13418666750	22432	26482	1395648	39735552
13420637670	20	20	1480	1480
......
Time taken: 75.24 seconds

使用MapReduce进行手机流量统计:http://www.maoxiangyi.cn/index.php/archives/256

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值