要分析的字段:访问的资源、cookie(代表用户,因为ip是公网ip,很多人共用)、
步骤:
1、flume 采集数据,可以使用shell等脚本设置定时任务
2、对数据进行清洗
3、使用hive对数据进行分析(比如说外部分区表),自动执行
4、把hive分析的结果通过sqoop倒回到MySQL、hbase中
5、提供视图工具,类似报表
具体步骤:
1、启动要收集的节点上的flume
bin/flume-ng agent -n log -c conf -f conf/log.conf -Dflume.root.logger=INFO,console
2、将日志复制到flume采集的文件路径中(此步骤可以启动flume前和后都没有区别)
cp logtest.txt test/
3、hive中创建表(创建表时要指定不同的location,而且location是目录,直接指定到flume采集到hdfs的路径后,则不需要第四步)
CREATE EXTERNAL TABLE if not exists flume_ip_"%Y%m%d"(
remote_addr STRING,
http_marketChannelName STRING,
http_hwMac STRING,
http_x_forwarded_for STRING
)ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' location '/flume/ip/%Y%m%d';
4、将hdfs导入hive中
LOAD DATA INPATH '/flume/20160512' into table log_analyze;
5、每个用户收看的数量排名
create table UserSort row format delimited fields terminated by '\t' as select *,count(http_marketChannelName) as ChannelNum from log_analyze group by http_hwMac having ChannelNum >=2 order by ChannelNum desc limit 10;
直接create 不能指定列名,可以这样解决(无需再次以某种格式分割字段)
create table UserSort(http_hwMac string,channelnum1 bigint,channelnum2 bigint);
insert into table UserSort select *,count(http_marketChannelName) as ChannelNum from flume_ip_$date group by http_hwMac having ChannelNum >=2 order by ChannelNum desc limit 10;
6、每个用户收看的不同渠道排名
create table ChannelSort row format delimited fields terminated by '\t' as select *,count(http_marketChannelName) as ChannelNum from log_analyze group by http_marketChannelName ,http_hwMac having ChannelNum >=2 order by ChannelNum desc limit 10;
create table flume_ip_$date_ChannelSort(http_marketChannelName string,http_hwMac string, ChannelNum1 bigint, ChannelNum2 bigint)
insert into table flume_ip_$date_ChannelSort select *,count(http_marketChannelName) as ChannelNum from flume_ip_$date group by http_marketChannelName ,http_hwMac having ChannelNum >=2 order by ChannelNum desc limit 10;
7、每个用户观看的详细信息
select * from log_analyze where trim(http_hwMac)='zhangsan';
8、每个渠道有多少用户观看
create table ChannelUserSort row format delimited fields terminated by '\t' as select *,count(http_hwMac) as UserNum from log_analyze group by http_marketChannelName having UserNum >=2 order by UserNum desc limit 10;
create table ChannelUserSort_ip_$date(http_marketChannelName string,UserNum bigint,num bigint);
insert into table ChannelUserSort_ip_$date select *,count(http_hwMac) as UserNum from log_analyze group by http_marketChannelName having UserNum >=2 order by UserNum desc limit 10;
每个渠道下,不同的设备类型的排名
create table nginx_20160523_ChannelModelTypeSort(http_marketChannelName string,http_hwModel string,ModelTypeNum bigint);
insert into table nginx_20160523_ChannelModelTypeSort select http_marketChannelName,http_hwModel,count(*) as ModelNum from nginx_20160523 group by http_marketChannelName,http_hwModel having ModelNum>=50 order by ModelNum desc limit 20;
9、追加当天的另一台服务器的日志到当前的记录中(重复上述步骤,注意是第二次创建的表)
insert into table ChannelSort select *,count(http_marketChannelName) as ChannelNum from log_analyze group by http_marketChannelName ,http_hwMac having ChannelNum >=2 order by ChannelNum desc limit 10;
10、将数据相加
这里写代码片
11、将hive中的数据导出到mysql中
(1)mysql中创建对应的表
CREATE TABLE if not exists ChannelSort(
http_marketChannelName varchar(100),
http_hwMac varchar(100),
num1 varchar(10),
num2 varchar(10)
);
(2) 使用sqoop导入数据
sqoop export -connect jdbc:mysql://ip:3306/test -username root -password 123456 --table ChannelSort --export-dir /user/hive/warehouse/channelsort -m 1 -fields-terminated-by '\t'
(3)将数据相加
select http_marketChannelName,sum(num1) as num from ChannelSort group by http_marketChannelName order by num desc;
12、创建自动化脚本
vi daily.sh
chmod +x daily.sh
date=`/bin/date +%Y%m%d`
#/letv/data/apache-hive-0.13.1-bin/bin/hive -e "CREATE EXTERNAL TABLE if not exists nginx_${date}(
# host STRING,
# uri STRING,
# request_method STRING,
# remote_addr STRING,
# http_marketChannelName STRING,
# http_hwMac STRING,
# http_ethMac STRING,
# http_hwModel STRING,
# http_appVerName STRING,
# time_local_status STRING,
# body_bytes_sent string,
# bytes_sent string,
# gzip_ratio string,
# http_referer STRING,
# http_cookie STRING,
# http_user_agent STRING,
# http_x_forwarded_for STRING,
# upstream_addr STRING,
# upstream_response_time string,
# upstream_status STRING,
#request_time string
#)ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' location '/nginx/${date}/logs';"
/letv/data/apache-hive-0.13.1-bin/bin/hive -e "create table nginx_${date}_ChannelUserSort(http_marketChannelName string,UserNum bigint,num bigint);"
/letv/data/apache-hive-0.13.1-bin/bin/hive -e "insert into table nginx_${date}_ChannelUserSort select *,count(distinct http_hwMac) as UserNum from nginx_${date} group by http_marketChannelName having UserNum >=50 order by UserNum desc limit 20;"
/letv/data/apache-hive-0.13.1-bin/bin/hive -e "create table nginx_${date}_UserCount(http_marketChannelName string,UserNum bigint,num bigint);"
/letv/data/apache-hive-0.13.1-bin/bin/hive -e "insert into table nginx_${date}_UserCount select *,count(distinct http_hwMac) from nginx_${date};"
集群的其他机器要使用scp部署flume,并修改apache-flume-1.5.0-bin/conf/flume-env.sh中的JAVA_HOME=/usr/Java/jdk1.8.0_77和/etc/profile中的要一致,同时修改vim /etc/hosts以及apache-flume-1.5.0-bin/conf下的配置文件