HIVE,PV,UV分析

进入hive_home目录下
输入命令:
启动metastore: 改服务是原数据服务
bin/hive --service metastore

后台启动:
bin/hive --service metastore 2>&1 >> /usr/local/devtools/cdhbigdata/cdhhive/hive-0.13.1-cdh5.3.6/logs/hive.log &

后台启动,关闭shell连接依然存在:
nohup bin/hive --service metastore 2>&1 >> /usr/local/devtools/cdhbigdata/cdhhive/hive-0.13.1-cdh5.3.6/logs/hive.log &


启动:hiveserver2        
nohup  hive --service hiveserver2 &

创建分区表
CREATE EXTERNAL TABLE traffic(ip string, time string, url string) PARTITIONED BY (logdate string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/user/root/output/hivetables/traffic';

创建分区
ALTER TABLE traffic ADD PARTITION(logdate='2015_04_25') LOCATION '/user/root/output/hivetables/traffic/2013_05_30';

倒入数据使用
local表示从本地导入,使用的是复制操作,原文件保留,没有local,表示从hdfs文件系统导入,使用的是剪切操作,原目录下的文件将被移除。
load data inpath '/user/root/output/part-r-00000' into table traffic PARTITION (logdate='2013_05_30');


PV量
页面浏览量即为PV(Page View),是指所有用户浏览页面的总和,一个独立用户每打开一个页面就被记录1 次。这里,我们只需要统计日志中的记录个数即可,HQL代码如下:
 SELECT COUNT(1) AS PV FROM traffic WHERE logdate='2013_05_30';
 create table pv2013_05_30 as SELECT COUNT(1) AS PV FROM traffic WHERE logdate='2013_05_30';
pv
930


注册用户数
SELECT * AS REGUSER FROM traffic WHERE logdate='2013_05_30' AND INSTR(url,'/member.php?mod=register')>0;

登陆用户数
SELECT COUNT(1) AS REGUSER FROM traffic WHERE logdate='2013_05_30' AND INSTR(url,'/member.php?mod=logging')>0;
create table registernum2013_05_30 as SELECT COUNT(1) AS REGUSER FROM traffic WHERE logdate='2013_05_30' AND INSTR(url,'/member.php?mod=logging')>0;
18

独立IP数
 一天之内,访问网站的不同独立 IP 个数加和。其中同一IP无论访问了几个页面,独立IP 数均为1。因此,这里我们只需要统计日志中处理的独立IP数即可,在SQL中我们可以通过DISTINCT关键字,在HQL中也是通过这个关键字:
SELECT COUNT(DISTINCT ip) AS IP FROM traffic WHERE logdate='2013_05_30';
create table ip2013_05_30 as SELECT COUNT(DISTINCT ip) AS IP FROM traffic WHERE logdate='2013_05_30';
ip
140

跳出用户数
只浏览了一个页面便离开了网站的访问次数,即只浏览了一个页面便不再访问的访问次数。这里,我们可以通过用户的IP进行分组,如果分组后的记录数只有一条,那么即为跳出用户。将这些用户的数量相加,就得出了跳出用户数
SELECT COUNT(1) AS jump FROM (SELECT COUNT(ip) AS times FROM traffic WHERE logdate='2013_05_30' GROUP BY ip HAVING times=1) e
create table jump2013_05_30 as SELECT COUNT(1) AS jump FROM (SELECT COUNT(ip) AS times FROM traffic WHERE logdate='2013_05_30' GROUP BY ip HAVING times=1) e
OK
jump
61

PS:跳出率是指只浏览了一个页面便离开了网站的访问次数占总的访问次数的百分比,即只浏览了一个页面的访问次数 / 全部的访问次数汇总。这里,我们可以将这里得出的跳出用户数/PV数即可得到跳出率。


将所有关键指标放入一张汇总表中以便于通过Sqoop导出到MySQL
为了方便通过Sqoop统一导出到MySQL,这里我们借助一张汇总表将刚刚统计到的结果整合起来,通过表连接结合,HQL代码如下:
CREATE TABLE trafic2013_05_30 AS SELECT '20130530' as date, a.pv, b.reguser, c.ip, d.jump FROM pv2013_05_30 a  JOIN registernum2013_05_30 b ON 1=1  JOIN ip2013_05_30 c ON 1=1 JOIN jump2013_05_30 d ON 1=1;


create table techbbs_logs_stat(
logdate varchar(10) primary key,
 pv int,
 regiser int,
ip int,
jumper int);


./bin/sqoop export --connect jdbc:mysql://xupan001:3306/traffic --username root --password root --table traffic --fields-terminated-by '\001' --export-dir '/user/hive/warehouse/trafic2013_05_30'

step4~step8为新增内容:
#!/bin/sh

......

#step4.alter hive table and then add partition
hive -e "ALTER TABLE techbbs ADD PARTITION(logdate='${yesterday}') LOCATION '/project/techbbs/cleaned/${yesterday}';"
#step5.create hive table everyday
hive -e "CREATE TABLE hmbbs_pv_${yesterday} AS SELECT COUNT(1) AS PV FROM hmbbs WHERE logdate='${yesterday}';"
hive -e "CREATE TABLE hmbbs_reguser_${yesterday} AS SELECT COUNT(1) AS REGUSER FROM hmbbs WHERE logdate='${yesterday}' AND INSTR(url,'member.php?mod=register')>0;"
hive -e "CREATE TABLE hmbbs_ip_${yesterday} AS SELECT COUNT(DISTINCT ip) AS IP FROM hmbbs WHERE logdate='${yesterday}';"
hive -e "CREATE TABLE hmbbs_jumper_${yesterday} AS SELECT COUNT(1) AS jumper FROM (SELECT COUNT(ip) AS times FROM hmbbs WHERE logdate='${yesterday}' GROUP BY ip HAVING times=1) e;"
hive -e "CREATE TABLE hmbbs_${yesterday} AS SELECT '${yesterday}', a.pv, b.reguser, c.ip, d.jumper FROM hmbbs_pv_${yesterday} a JOIN hmbbs_reguser_${yesterday} b ON 1=1 JOIN hmbbs_ip_${yesterday} c ON 1=1 JOIN hmbbs_jumper_${yesterday} d ON 1=1;"
#step6.delete hive tables
hive -e "drop table hmbbs_pv_${yesterday};"
hive -e "drop table hmbbs_reguser_${yesterday};"
hive -e "drop table hmbbs_ip_${yesterday};"
hive -e "drop table hmbbs_jumper_${yesterday};"
#step7.export to mysql
sqoop export --connect jdbc:mysql://hadoop-master:3306/techbbs --username root --password admin --table techbbs_logs_stat --fields-terminated-by '\001' --export-dir '/hive/hmbbs_${yesterday}'
#step8.delete hive table
hive -e "drop table techbbs_${yesterday};"


这里将日期字符串作为参数传入,将该步骤转移到了其他脚本文件中;
#!/bin/sh
#step1.get yesterday format string
#yesterday=`date --date='1 days ago' +%Y_%m_%d`
yesterday=$1


#!/bin/sh
yesterday=`date --date='1 days ago' +%Y_%m_%d`
hmbbs_core.sh $yesterday

改写crontab定时任务配置:crontab -e
* 1 * * * /usr/local/files/apache_logs/techbbs_daily.sh

初始化任务操作
当一个网站已经生成了很多天的日志,而我们的日志分析系统却一直没上线,一直等到了某天才上线。这时,我们需要写一个初始化脚本任务,来对之前的每天的日志进行统计分析与导出结果。这里,我们新增一个techbbs_init.sh脚本文件,内容如下:
#step1.create external table in hive
hive -e "CREATE EXTERNAL TABLE techbbs(ip string, atime string, url string) PARTITIONED BY (logdate string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/project/techbbs/cleaned';"

#step2.compute the days between start date and end date
s1=`date --date="$1"  +%s`
s2=`date +%s`
s3=$((($s2-$s1)/3600/24))

#step3.excute techbbs_core.sh $3 times
for ((i=$s3; i>0; i--))
do
  logdate=`date --date="$i days ago" +%Y_%m_%d`
  techbbs_core.sh $logdate
done

 

转载于:https://my.oschina.net/u/2253438/blog/1579264

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值