最近面试,发现很多公司在使用hive对数据进行处理。
hive是hadoop家族成员,是一种解析like sql语句的框架。它封装了常用MapReduce任务,让你像执行sql一样操作存储在HDFS的表。
hive的表分为两种,内表和外表。
Hive 创建内部表时,会将数据移动到数据仓库指向的路径;若创建外部表,仅记录数据所在的路径,不对数据的位置做任何改变。
在删除表的时候,内部表的元数据和数据会被一起删除, 而外部表只删除元数据,不删除数据。这样外部表相对来说更加安全些,数据组织也更加灵活,方便共享源数据。
Hive的内外表,还有一个Partition的分区的知识点,用于避免全表扫描,快速检索。后期的文章会提到。
接下来开始正式开始《Hive统计手机流量》
原始数据:
1363157985066 13726230503 00-FD-07-A4-72-B8:CMCC 120.196.100.82 i02.c.aliimg.com 24 27 2481 24681 200 1363157995052 13826544101 5C-0E-8B-C7-F1-E0:CMCC 120.197.40.4 4 0 264 0 200 1363157991076 13926435656 20-10-7A-28-CC-0A:CMCC 120.196.100.99 2 4 132 1512 200 1363154400022 13926251106 5C-0E-8B-8B-B1-50:CMCC 120.197.40.4 4 0 240 0 200 1363157993044 18211575961 94-71-AC-CD-E6-18:CMCC-EASY 120.196.100.99 iface.qiyi.com 瑙.?缃.. 15 2 1527 2106 200 1363157995074 84138413 5C-0E-8B-8C-E8-20:7DaysInn 120.197.40.4 122.72.52.12 20 16 4116 1432 200 1363157993055 13560439658 C4-17-FE-BA-DE-D9:CMCC 120.196.100.99 18 15 1116 954 200 1363157995033 15920133257 5C-0E-8B-C7-BA-20:CMCC 120.197.40.4 sug.so.360.cn 淇℃.瀹.. 20 20 156 2936 200
操作步骤:
#配置好Hive之后,使用hive命令启动hive框架。hive启动属于懒加载模式,会比较慢 hive; #使用show databases命令查看当前数据库信息 hive> show databases; OK default hive Time taken: 3.389 seconds #使用 use hive命令,使用指定的数据库 hive数据库是我之前创建的 use hive; #创建表,这里是创建内表。内表加载hdfs上的数据,会将被加载文件中的内容剪切走。 #外表没有这个问题,所以在实际的生产环境中,建议使用外表。 create table ll(reportTime string,msisdn string,apmac string,acmac string,host string,siteType string,upPackNum bigint,downPackNum bigint,upPayLoad bigint,downPayLoad bigint,httpStatus string)row format delimited fields terminated by '\t'; #加载数据,这里是从hdfs加载数据,也可用linux下加载数据 需要local关键字 load data inpath'/HTTP_20130313143750.dat' into table ll; #数据加载完毕之后,hdfs的 #执行hive 的like sql语句,对数据进行统计 select msisdn,sum(uppacknum),sum(downpacknum),sum(uppayload),sum(downpayload) from ll group by msisdn;
执行结果如下:
hive> select msisdn,sum(uppacknum),sum(downpacknum),sum(uppayload),sum(downpayload) from ll group by msisdn; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> Starting Job = job_201307160252_0006, Tracking URL = http://hadoop0:50030/jobdetails.jsp?jobid=job_201307160252_0006 Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=hadoop0:9001 -kill job_201307160252_0006 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2013-07-17 19:51:42,599 Stage-1 map = 0%, reduce = 0% 2013-07-17 19:52:40,474 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec 2013-07-17 19:52:41,690 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec 2013-07-17 19:52:42,693 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec 2013-07-17 19:52:43,698 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec 2013-07-17 19:52:44,702 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec 2013-07-17 19:52:45,707 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec 2013-07-17 19:52:46,712 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec 2013-07-17 19:52:47,715 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec 2013-07-17 19:52:48,721 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec 2013-07-17 19:52:49,758 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec 2013-07-17 19:52:50,763 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec 2013-07-17 19:52:51,772 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 50.0 sec 2013-07-17 19:52:52,775 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 50.0 sec 2013-07-17 19:52:53,779 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 50.0 sec MapReduce Total cumulative CPU time: 50 seconds 0 msec Ended Job = job_201307160252_0006 MapReduce Jobs Launched: Job 0: Map: 1 Reduce: 1 Cumulative CPU: 50.0 sec HDFS Read: 2787075 HDFS Write: 16518 SUCCESS Total MapReduce CPU Time Spent: 50 seconds 0 msec OK 13402169727 171 108 11286 130230 13415807477 2067 1683 169668 1994181 13416127574 1501 1094 161963 802756 13416171820 113 99 10630 32120 13417106524 160 128 18688 13088 13418002498 240 256 22136 86896 13418090588 456 351 98934 67470 13418117364 264 152 29436 49966 13418173218 37680 48348 2261286 73159722 13418666750 22432 26482 1395648 39735552 13420637670 20 20 1480 1480 ...... Time taken: 75.24 seconds
使用MapReduce进行手机流量统计:http://www.maoxiangyi.cn/index.php/archives/256