某网站使用Apache服务器搭建公司的网站,Apache每小时产生一个日志,每天产生24个gz的包,gz的包总大小为200M
- 创建日志表格:
hive>
CREATE TABLE log
(ipaddress STRING,identity STRING,user STRING,time STRING,method STRING,request STRING,protocol STRING,status STRING,size STRING,referer STRING,agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES
("input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) (\"[^ ]*) ([^ ]*) ([^ ]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s %11$s")STORED AS TEXTFILE;
- 手动导入1月份Apache日志:
> load data local inpath '/home/hduser/log/access_log-20140223' overwrite into table apachelog;
Copying data from file:/home/hduser/log/access_log-20140223
Copying file: file:/home/hduser/log/access_log-20140223
Loading data to table default.apachelog
rmr: DEPRECATED: Please use 'rm -r' instead.
Deleted /user/hive/warehouse/apachelog
Table default.apachelog stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 10588368, raw_data_size: 0]
OK
Time taken: 3.057 seconds
- 使用Python脚本批量导入1月份Apache日志:
[hduser@master ~]$ vim loadlog.sh
#!/usr/local/env python
# -*- coding:utf-8 -*-
import os
hour1=[ "%02d"%(i) for i in range(23)]
for i in hour1:
print "正在导入1月份%s点的数据"%i
cmd = "hive -S -e \"load data local inpath '/home/hduser/cachelog/cache.dayoo.com_1_20140101%s_log.gz' into table log\"" %i
print cmd
os.system(cmd)
[hduser@master ~]$ python loadlog.sh
- 查看导入的记录行数:
[hduser@master ~]$ hive
hive> add jar /home/hduser/hive-0.11.0/lib/hive-contrib-0.11.0.jar;
Added /home/hduser/hive-0.11.0/lib/hive-contrib-0.11.0.jar to class path
Added resource: /home/hduser/hive-0.11.0/lib/hive-contrib-0.11.0.jar
hive> select count(*) from log;
........................省略运行过程
5313762 #总共有5百多万行的数据
Time taken: 292.139 seconds, Fetched: 1 row(s)
- 查看访问量排行:
可以根据执行的结果来对访问量大的页面进而优化处理。
hive> SELECT request, COUNT(1) AS numrequest FROM log GROUP BY request SORT BY numrequest DESC LIMIT 10;
..........................省略运行过程
http://img0.dayoo.com/header.css 82538
http://life.dayoo.com/health/ 76785
http://img0.dayoo.com/l/100901/dylist.css 66463
http://club.dayoo.com/topic_list_www.dy?b=cityonline&page=5&n=30 63705
http://club.dayoo.com/topic_list_www.dy?b=strangers&page=5&n=30 59187
http://www.dayoo.com/public/footer/copyright.html 42094
http://club.dayoo.com/topic_list_www.dy?b=cityonline&page=4&n=30 42012
http://img0.dayoo.com/dylc/ad/files/2010/zhangjiurong/shengyayiyuan/20131008/dyw.html 41677
http://img0.dayoo.com/dylc/ad/files/2010/linshitupianku/aiguangzhou/aiguangzhou.html 41482
http://rnd.dayoo.com/javascript/jquery/jquery-1.4.2.js 40061
Time taken: 1139.001 seconds, Fetched: 20 row(s)
- 查看IP访问排行:
可以根据IP地址的分布情况,分析出网站的用户类型等数据
hive> SELECT ipaddress, COUNT(1) AS numrequest FROM log GROUP BY ipaddress SORT BY numrequest DESC LIMIT 10;
..........................省略运行过程
123.151.47.114 214071
116.236.228.19 194238
59.42.241.199 92858
59.42.241.198 92639
123.108.110.196 75841
123.126.105.65 49279
123.151.43.97 31796
42.156.139.46 30928
42.120.160.46 30918
42.120.161.46 30913
Time taken: 443.762 seconds, Fetched: 10 row(s)