Hive日志分析实践例子

日志格式为:
36.248.169.9 - - [22/Sep/2013:01:21:45 +0800] "GET /mapreduce/hadoop-terasort-analyse/ HTTP/1.1" 200 22166 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)"
36.248.169.9 - - [22/Sep/2013:01:21:48 +0800] "POST /wp-comments-post.php HTTP/1.1" 302 520 "http://dongxicheng.org/mapreduce/hadoop-terasort-analyse/" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"
36.248.169.9 - - [22/Sep/2013:01:21:51 +0800] "GET /mapreduce/hadoop-terasort-analyse/ HTTP/1.1" 200 22172 "http://dongxicheng.org/mapreduce/hadoop-terasort-analyse/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)"
36.248.169.9 - - [22/Sep/2013:01:22:34 +0800] "GET /mapreduce/hadoop-terasort-analyse HTTP/1.1" 301 425 "http://dongxicheng.org/mapreduce/hadoop-terasort-analyse/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)"
36.248.169.9 - - [22/Sep/2013:01:22:34 +0800] "GET /mapreduce/hadoop-terasort-analyse HTTP/1.1" 301 425 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)"
220.181.89.164 - - [22/Sep/2013:01:22:31 +0800] "GET /category/data-mining/ HTTP/1.1" 200 15675 "-" "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)"

伪分布式环境下:
(1)
a) 
mapper_hive.py如下
#!/usr/bin/env python
import sys
# input comes from STDIN
for line in sys.stdin:
    data = line.split()
    print "ip=%s|time=%s|path=%s" % (data[0],data[3][1:],data[6])

b) 上传本地日志到hdfs
[user@fakeDistnode hadoop-1.0.4]$  bin/hadoop fs -put ~/Sep-2013 /user/user/hive_hwork/Sep-2013
c) 按要求输出
[user@fakeDistnode hive-0.9.0]$  bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar -mapper ~/Desktop/mapper_hive.py -input /user/user/hive_hwork/Sep-2013/*/* -output /user/user/hive_hwork/Sep_2013_filter

(2)
a)创建表 s_access_log
create table if not exists s_access_log(
log map<string,string>)
row format delimited
collection items terminated by '|'
map keys terminated by '='
stored as textfile;
hive> show tables;
OK
s_access_log
Time taken: 4.09 seconds

(3)导入数据
a)
hive>  load data inpath '/user/user/hive_hwork/Sep_2013_filter/part-00000' into table s_access_log;                                           
Loading data to table default.s_access_log
OK
Time taken: 0.605 seconds
hive> select * from s_access_log limit 5;
OK
{"ip":"1.10.213.225","time":"25/Sep/2013:21:09:38","path":"/wp-login.php\t"}
{"ip":"1.10.213.225","time":"25/Sep/2013:21:29:55","path":"/wp-login.php\t"}
{"ip":"1.162.173.111","time":"28/Sep/2013:14:37:17","path":"/?author=1\t"}
{"ip":"1.162.173.111","time":"28/Sep/2013:14:37:18","path":"/?feed=rss2\t"}
{"ip":"1.162.173.111","time":"28/Sep/2013:14:37:19","path":"/feed/\t"}
Time taken: 0.28 seconds

(4)统计结果
1)独立IP数目
hive>  select count(distinct log['ip']) from s_access_log;
OK
13349

2)统计每天的PV (没有进行重复的过滤)
hive> select substring(log["time"],0,11),count(*) from s_access_log group by substring(log["time"],0,11);  
20/Sep/2013 78975
21/Sep/2013 81408
22/Sep/2013 97257
23/Sep/2013 121201
24/Sep/2013 115658
25/Sep/2013 129944
26/Sep/2013 149390
27/Sep/2013 123892
28/Sep/2013 103551
29/Sep/2013 150813
30/Sep/2013 777

3)统计每天的UV
hive> select  count(distinct(log["ip"])),substring(log["time"],0,11) from s_access_log group by substring(log["time"],0,11);
1776 20/Sep/2013
2301 21/Sep/2013
2420 22/Sep/2013
2855 23/Sep/2013
2755 24/Sep/2013
3246 25/Sep/2013
2416 26/Sep/2013
2514 27/Sep/2013
1837 28/Sep/2013
2278 29/Sep/2013
52 30/Sep/2013
Time taken: 42.539 seconds

4)IP为 175.44.30.93的用户每天访问的url总数
先把每天的url(去重)结果存入t_url_perday
create table t_url_perday as select  count(distinct(log["path"])) as url,substring(log["time"],0,11) as time from s_access_log where log["ip"]="175.44.30.93"        
group by substring(log["time"],0,11);   
然后查询该表进行avg 
select avg(url) from t_url_perday;
45.0

5)
select distinct(log["path"]) from s_access_log where log["ip"]="175.44.30.93" and log["time"]>="29/Sep/2013:00:10:16" and log["time"]<="29/Sep/2013:09:10:16"
/apache-mesos/meso-architecture
/apache-mesos/meso-architecture/
/cluster-managemant/puppet
/cluster-managemant/puppet/
/hadoop-corona/corona-clustermanager-details
/hadoop-corona/corona-clustermanager-details/
/mapreduce-nextgen/apache-hadoop-2-0-3-published
/mapreduce-nextgen/apache-hadoop-2-0-3-published/
/mapreduce-nextgen/client-codes
/mapreduce-nextgen/client-codes/
/mapreduce-nextgen/hadoop-2-0-terms-explained
/mapreduce-nextgen/hadoop-2-0-terms-explained/
/mapreduce-nextgen/hadoop-internals-mapreduce-reference
/mapreduce-nextgen/hadoop-internals-mapreduce-reference/
/mapreduce-nextgen/hadoop-jira-yarn-3
/mapreduce-nextgen/hadoop-jira-yarn-3/
/mapreduce-nextgen/hadoop-jira-yarn-392
/mapreduce-nextgen/hadoop-jira-yarn-392/
/mapreduce-nextgen/use-hadoop-yarn-to-build-a-cloud-system
/mapreduce-nextgen/use-hadoop-yarn-to-build-a-cloud-system/
/mapreduce-nextgen/yarnmrv2-mrappmaster-containerallocator
/mapreduce-nextgen/yarnmrv2-mrappmaster-containerallocator/
/mapreduce-nextgen/yarnmrv2-resource-manager-nm-manager
/mapreduce-nextgen/yarnmrv2-resource-manager-nm-manager/
/mapreduce/hadoop-mapreduce-failure-tolerance
/mapreduce/hadoop-mapreduce-failure-tolerance/
/mapreduce/hadoop-rumen-introduction
/mapreduce/hadoop-rumen-introduction/
/mapreduce/hadoop-streaming-programming
/mapreduce/hadoop-streaming-programming/
/mapreduce/hadoop-terasort-analyse
/mapreduce/hadoop-terasort-analyse/
/mapreduce/hdfs-small-files-solution
/mapreduce/hdfs-small-files-solution/
/search-engine/kafka
/search-engine/kafka/
/search-engine/scribe-intro
/search-engine/scribe-intro/
/structure/heap
/structure/heap/
/structure/segment-tree
/structure/segment-tree/
/structure/suffix-array
/structure/suffix-array/
/wp-comments-post.php
Time taken: 34.821 seconds
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值