日志格式为:
36.248.169.9 - - [22/Sep/2013:01:21:45 +0800] "GET /mapreduce/hadoop-terasort-analyse/ HTTP/1.1" 200 22166 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)"
36.248.169.9 - - [22/Sep/2013:01:21:48 +0800] "POST /wp-comments-post.php HTTP/1.1" 302 520 "http://dongxicheng.org/mapreduce/hadoop-terasort-analyse/" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"
36.248.169.9 - - [22/Sep/2013:01:21:51 +0800] "GET /mapreduce/hadoop-terasort-analyse/ HTTP/1.1" 200 22172 "http://dongxicheng.org/mapreduce/hadoop-terasort-analyse/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)"
36.248.169.9 - - [22/Sep/2013:01:22:34 +0800] "GET /mapreduce/hadoop-terasort-analyse HTTP/1.1" 301 425 "http://dongxicheng.org/mapreduce/hadoop-terasort-analyse/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)"
36.248.169.9 - - [22/Sep/2013:01:22:34 +0800] "GET /mapreduce/hadoop-terasort-analyse HTTP/1.1" 301 425 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)"
220.181.89.164 - - [22/Sep/2013:01:22:31 +0800] "GET /category/data-mining/ HTTP/1.1" 200 15675 "-" "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)"
36.248.169.9 - - [22/Sep/2013:01:21:48 +0800] "POST /wp-comments-post.php HTTP/1.1" 302 520 "http://dongxicheng.org/mapreduce/hadoop-terasort-analyse/" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"
36.248.169.9 - - [22/Sep/2013:01:21:51 +0800] "GET /mapreduce/hadoop-terasort-analyse/ HTTP/1.1" 200 22172 "http://dongxicheng.org/mapreduce/hadoop-terasort-analyse/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)"
36.248.169.9 - - [22/Sep/2013:01:22:34 +0800] "GET /mapreduce/hadoop-terasort-analyse HTTP/1.1" 301 425 "http://dongxicheng.org/mapreduce/hadoop-terasort-analyse/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)"
36.248.169.9 - - [22/Sep/2013:01:22:34 +0800] "GET /mapreduce/hadoop-terasort-analyse HTTP/1.1" 301 425 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)"
220.181.89.164 - - [22/Sep/2013:01:22:31 +0800] "GET /category/data-mining/ HTTP/1.1" 200 15675 "-" "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)"
伪分布式环境下:
(1)
a)
mapper_hive.py如下
#!/usr/bin/env python
import sys
# input comes from STDIN
for line in sys.stdin:
data = line.split()
print "ip=%s|time=%s|path=%s" % (data[0],data[3][1:],data[6])
b) 上传本地日志到hdfs
[user@fakeDistnode hadoop-1.0.4]$
bin/hadoop fs -put ~/Sep-2013 /user/user/hive_hwork/Sep-2013
c) 按要求输出
[user@fakeDistnode hive-0.9.0]$
bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar -mapper ~/Desktop/mapper_hive.py -input /user/user/hive_hwork/Sep-2013/*/* -output /user/user/hive_hwork/Sep_2013_filter
(2)
a)创建表
s_access_log
create table if not exists s_access_log(
log map<string,string>)
row format delimited
collection items terminated by '|'
map keys terminated by '='
stored as textfile;
hive> show tables;
OK
s_access_log
Time taken: 4.09 seconds
(3)导入数据
a)
hive>
load data inpath '/user/user/hive_hwork/Sep_2013_filter/part-00000' into table s_access_log;
Loading data to table default.s_access_log
OK
Time taken: 0.605 seconds
hive> select * from s_access_log limit 5;
OK
{"ip":"1.10.213.225","time":"25/Sep/2013:21:09:38","path":"/wp-login.php\t"}
{"ip":"1.10.213.225","time":"25/Sep/2013:21:29:55","path":"/wp-login.php\t"}
{"ip":"1.162.173.111","time":"28/Sep/2013:14:37:17","path":"/?author=1\t"}
{"ip":"1.162.173.111","time":"28/Sep/2013:14:37:18","path":"/?feed=rss2\t"}
{"ip":"1.162.173.111","time":"28/Sep/2013:14:37:19","path":"/feed/\t"}
Time taken: 0.28 seconds
(4)统计结果
1)独立IP数目
hive>
select count(distinct log['ip']) from s_access_log;
OK
13349
2)统计每天的PV (没有进行重复的过滤)
hive> select substring(log["time"],0,11),count(*) from s_access_log group by substring(log["time"],0,11);
20/Sep/2013
78975
21/Sep/2013
81408
22/Sep/2013
97257
23/Sep/2013
121201
24/Sep/2013
115658
25/Sep/2013
129944
26/Sep/2013
149390
27/Sep/2013
123892
28/Sep/2013
103551
29/Sep/2013
150813
30/Sep/2013
777
3)统计每天的UV
hive> select count(distinct(log["ip"])),substring(log["time"],0,11) from s_access_log group by substring(log["time"],0,11);
1776
20/Sep/2013
2301
21/Sep/2013
2420
22/Sep/2013
2855
23/Sep/2013
2755
24/Sep/2013
3246
25/Sep/2013
2416
26/Sep/2013
2514
27/Sep/2013
1837
28/Sep/2013
2278
29/Sep/2013
52
30/Sep/2013
Time taken: 42.539 seconds
4)IP为
175.44.30.93的用户每天访问的url总数
先把每天的url(去重)结果存入t_url_perday
group by substring(log["time"],0,11);
然后查询该表进行avg
select avg(url) from t_url_perday;
45.0
5)
select distinct(log["path"]) from s_access_log where log["ip"]="175.44.30.93" and log["time"]>="29/Sep/2013:00:10:16" and log["time"]<="29/Sep/2013:09:10:16"
/apache-mesos/meso-architecture
/apache-mesos/meso-architecture/
/cluster-managemant/puppet
/cluster-managemant/puppet/
/hadoop-corona/corona-clustermanager-details
/hadoop-corona/corona-clustermanager-details/
/mapreduce-nextgen/apache-hadoop-2-0-3-published
/mapreduce-nextgen/apache-hadoop-2-0-3-published/
/mapreduce-nextgen/client-codes
/mapreduce-nextgen/client-codes/
/mapreduce-nextgen/hadoop-2-0-terms-explained
/mapreduce-nextgen/hadoop-2-0-terms-explained/
/mapreduce-nextgen/hadoop-internals-mapreduce-reference
/mapreduce-nextgen/hadoop-internals-mapreduce-reference/
/mapreduce-nextgen/hadoop-jira-yarn-3
/mapreduce-nextgen/hadoop-jira-yarn-3/
/mapreduce-nextgen/hadoop-jira-yarn-392
/mapreduce-nextgen/hadoop-jira-yarn-392/
/mapreduce-nextgen/use-hadoop-yarn-to-build-a-cloud-system
/mapreduce-nextgen/use-hadoop-yarn-to-build-a-cloud-system/
/mapreduce-nextgen/yarnmrv2-mrappmaster-containerallocator
/mapreduce-nextgen/yarnmrv2-mrappmaster-containerallocator/
/mapreduce-nextgen/yarnmrv2-resource-manager-nm-manager
/mapreduce-nextgen/yarnmrv2-resource-manager-nm-manager/
/mapreduce/hadoop-mapreduce-failure-tolerance
/mapreduce/hadoop-mapreduce-failure-tolerance/
/mapreduce/hadoop-rumen-introduction
/mapreduce/hadoop-rumen-introduction/
/mapreduce/hadoop-streaming-programming
/mapreduce/hadoop-streaming-programming/
/mapreduce/hadoop-terasort-analyse
/mapreduce/hadoop-terasort-analyse/
/mapreduce/hdfs-small-files-solution
/mapreduce/hdfs-small-files-solution/
/search-engine/kafka
/search-engine/kafka/
/search-engine/scribe-intro
/search-engine/scribe-intro/
/structure/heap
/structure/heap/
/structure/segment-tree
/structure/segment-tree/
/structure/suffix-array
/structure/suffix-array/
/wp-comments-post.php
Time taken: 34.821 seconds