Hive日志分析实践例子

最新推荐文章于 2022-12-06 08:44:53 发布

tuohuangs

最新推荐文章于 2022-12-06 08:44:53 发布

阅读量2.8k

点赞数

分类专栏： hive开发文章标签： hive 日志每天访问数目 pv

本文链接：https://blog.csdn.net/lzlchangqi/article/details/18715687

版权

hive开发专栏收录该内容

3 篇文章 0 订阅

订阅专栏

日志格式为：

36.248.169.9 - - [22/Sep/2013:01:21:45 +0800] "GET /mapreduce/hadoop-terasort-analyse/ HTTP/1.1" 200 22166 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)"
36.248.169.9 - - [22/Sep/2013:01:21:48 +0800] "POST /wp-comments-post.php HTTP/1.1" 302 520 "http://dongxicheng.org/mapreduce/hadoop-terasort-analyse/" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"
36.248.169.9 - - [22/Sep/2013:01:21:51 +0800] "GET /mapreduce/hadoop-terasort-analyse/ HTTP/1.1" 200 22172 "http://dongxicheng.org/mapreduce/hadoop-terasort-analyse/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)"
36.248.169.9 - - [22/Sep/2013:01:22:34 +0800] "GET /mapreduce/hadoop-terasort-analyse HTTP/1.1" 301 425 "http://dongxicheng.org/mapreduce/hadoop-terasort-analyse/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)"
36.248.169.9 - - [22/Sep/2013:01:22:34 +0800] "GET /mapreduce/hadoop-terasort-analyse HTTP/1.1" 301 425 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)"
220.181.89.164 - - [22/Sep/2013:01:22:31 +0800] "GET /category/data-mining/ HTTP/1.1" 200 15675 "-" "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)"

伪分布式环境下：

（1）

mapper_hive.py如下

#!/usr/bin/env python

import sys

# input comes from STDIN

for line in sys.stdin:

data = line.split()

print "ip=%s|time=%s|path=%s" % (data[0],data[3][1:],data[6])

b) 上传本地日志到hdfs

[user@fakeDistnode hadoop-1.0.4]$ bin/hadoop fs -put ~/Sep-2013 /user/user/hive_hwork/Sep-2013

c) 按要求输出

[user@fakeDistnode hive-0.9.0]$ bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar -mapper ~/Desktop/mapper_hive.py -input /user/user/hive_hwork/Sep-2013/*/* -output /user/user/hive_hwork/Sep_2013_filter

（2）

a)创建表 s_access_log

create table if not exists s_access_log(

log map<string,string>)

row format delimited

collection items terminated by '|'

map keys terminated by '='

stored as textfile;

hive> show tables;

s_access_log

Time taken: 4.09 seconds

(3)导入数据

hive> load data inpath '/user/user/hive_hwork/Sep_2013_filter/part-00000' into table s_access_log;

Loading data to table default.s_access_log

Time taken: 0.605 seconds

hive> select * from s_access_log limit 5;

{"ip":"1.10.213.225","time":"25/Sep/2013:21:09:38","path":"/wp-login.php\t"}

{"ip":"1.10.213.225","time":"25/Sep/2013:21:29:55","path":"/wp-login.php\t"}

{"ip":"1.162.173.111","time":"28/Sep/2013:14:37:17","path":"/?author=1\t"}

{"ip":"1.162.173.111","time":"28/Sep/2013:14:37:18","path":"/?feed=rss2\t"}

{"ip":"1.162.173.111","time":"28/Sep/2013:14:37:19","path":"/feed/\t"}

Time taken: 0.28 seconds

（4）统计结果

1)独立IP数目

hive> select count(distinct log['ip']) from s_access_log;

13349

2)统计每天的PV （没有进行重复的过滤）

hive> select substring(log["time"],0,11),count(*) from s_access_log group by substring(log["time"],0,11);

20/Sep/2013 78975

21/Sep/2013 81408

22/Sep/2013 97257

23/Sep/2013 121201

24/Sep/2013 115658

25/Sep/2013 129944

26/Sep/2013 149390

27/Sep/2013 123892

28/Sep/2013 103551

29/Sep/2013 150813

30/Sep/2013 777

3）统计每天的UV

hive> select count(distinct(log["ip"])),substring(log["time"],0,11) from s_access_log group by substring(log["time"],0,11);

1776 20/Sep/2013

2301 21/Sep/2013

2420 22/Sep/2013

2855 23/Sep/2013

2755 24/Sep/2013

3246 25/Sep/2013

2416 26/Sep/2013

2514 27/Sep/2013

1837 28/Sep/2013

2278 29/Sep/2013

52 30/Sep/2013

Time taken: 42.539 seconds

4）IP为 175.44.30.93的用户每天访问的url总数

先把每天的url（去重）结果存入t_url_perday

create table t_url_perday as select count(distinct(log["path"])) as url,substring(log["time"],0,11) as time from s_access_log where log["ip"]="175.44.30.93"

group by substring(log["time"],0,11);

然后查询该表进行avg

select avg(url) from t_url_perday;

45.0

select distinct(log["path"]) from s_access_log where log["ip"]="175.44.30.93" and log["time"]>="29/Sep/2013:00:10:16" and log["time"]<="29/Sep/2013:09:10:16"

/apache-mesos/meso-architecture

/apache-mesos/meso-architecture/

/cluster-managemant/puppet

/cluster-managemant/puppet/

/hadoop-corona/corona-clustermanager-details

/hadoop-corona/corona-clustermanager-details/

/mapreduce-nextgen/apache-hadoop-2-0-3-published

/mapreduce-nextgen/apache-hadoop-2-0-3-published/

/mapreduce-nextgen/client-codes

/mapreduce-nextgen/client-codes/

/mapreduce-nextgen/hadoop-2-0-terms-explained

/mapreduce-nextgen/hadoop-2-0-terms-explained/