日志分析

要分析的字段:访问的资源、cookie(代表用户,因为ip是公网ip,很多人共用)、
步骤:
1、flume 采集数据,可以使用shell等脚本设置定时任务
2、对数据进行清洗
3、使用hive对数据进行分析(比如说外部分区表),自动执行
4、把hive分析的结果通过sqoop倒回到MySQL、hbase中
5、提供视图工具,类似报表

具体步骤:
1、启动要收集的节点上的flume

bin/flume-ng agent -n log -c conf -f conf/log.conf -Dflume.root.logger=INFO,console

2、将日志复制到flume采集的文件路径中(此步骤可以启动flume前和后都没有区别)

cp logtest.txt test/

3、hive中创建表(创建表时要指定不同的location,而且location是目录,直接指定到flume采集到hdfs的路径后,则不需要第四步)

CREATE EXTERNAL TABLE if not exists flume_ip_"%Y%m%d"(                  
     remote_addr STRING,  
    http_marketChannelName STRING,  
     http_hwMac STRING, 
    http_x_forwarded_for STRING
    )ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n'  location '/flume/ip/%Y%m%d';

4、将hdfs导入hive中

LOAD DATA INPATH '/flume/20160512' into table log_analyze;

5、每个用户收看的数量排名

create table UserSort row format delimited fields terminated by '\t' as select *,count(http_marketChannelName) as ChannelNum from log_analyze group by http_hwMac having  ChannelNum >=2 order by ChannelNum desc limit 10; 

直接create 不能指定列名,可以这样解决(无需再次以某种格式分割字段)

create table UserSort(http_hwMac string,channelnum1 bigint,channelnum2 bigint); 

insert into table UserSort select *,count(http_marketChannelName) as ChannelNum from flume_ip_$date group by http_hwMac having  ChannelNum >=2 order by ChannelNum desc limit 10;

6、每个用户收看的不同渠道排名

create table ChannelSort row format delimited fields terminated by '\t' as select *,count(http_marketChannelName) as ChannelNum from log_analyze group by http_marketChannelName ,http_hwMac having  ChannelNum >=2 order by ChannelNum desc limit 10;
create table flume_ip_$date_ChannelSort(http_marketChannelName string,http_hwMac string, ChannelNum1 bigint, ChannelNum2 bigint) 

insert into table flume_ip_$date_ChannelSort select *,count(http_marketChannelName) as ChannelNum from flume_ip_$date group by http_marketChannelName ,http_hwMac having  ChannelNum >=2 order by ChannelNum desc limit 10;

7、每个用户观看的详细信息

select * from log_analyze where trim(http_hwMac)='zhangsan';

8、每个渠道有多少用户观看

create table ChannelUserSort row format delimited fields terminated by '\t' as select *,count(http_hwMac) as UserNum from log_analyze group by http_marketChannelName having  UserNum >=2 order by UserNum desc limit 10;
create table ChannelUserSort_ip_$date(http_marketChannelName string,UserNum bigint,num bigint); 

insert into table ChannelUserSort_ip_$date select *,count(http_hwMac) as UserNum from log_analyze group by http_marketChannelName having  UserNum >=2 order by UserNum desc limit 10;

每个渠道下,不同的设备类型的排名

create table nginx_20160523_ChannelModelTypeSort(http_marketChannelName string,http_hwModel string,ModelTypeNum bigint); 

insert into table nginx_20160523_ChannelModelTypeSort select http_marketChannelName,http_hwModel,count(*) as ModelNum from nginx_20160523 group by http_marketChannelName,http_hwModel having ModelNum>=50 order by ModelNum desc limit 20;

9、追加当天的另一台服务器的日志到当前的记录中(重复上述步骤,注意是第二次创建的表)


insert into table ChannelSort select *,count(http_marketChannelName) as ChannelNum from log_analyze group by http_marketChannelName ,http_hwMac having  ChannelNum >=2 order by ChannelNum desc limit 10;

10、将数据相加

这里写代码片

11、将hive中的数据导出到mysql中
(1)mysql中创建对应的表

CREATE TABLE if not exists ChannelSort(
    http_marketChannelName varchar(100),  
     http_hwMac varchar(100), 
    num1 varchar(10),
    num2 varchar(10)
    );

(2) 使用sqoop导入数据

 sqoop export -connect jdbc:mysql://ip:3306/test -username root -password 123456 --table ChannelSort --export-dir /user/hive/warehouse/channelsort -m 1 -fields-terminated-by '\t'

(3)将数据相加

select  http_marketChannelName,sum(num1) as num from ChannelSort group by  http_marketChannelName order by num desc;

12、创建自动化脚本
vi daily.sh
chmod +x daily.sh

date=`/bin/date +%Y%m%d`

#/letv/data/apache-hive-0.13.1-bin/bin/hive -e "CREATE EXTERNAL TABLE if not exists nginx_${date}(
# host STRING,  
 # uri STRING,  
 # request_method STRING,  
 # remote_addr STRING,  
 # http_marketChannelName STRING,  
 # http_hwMac STRING, 
 # http_ethMac STRING,   
 # http_hwModel STRING,  
 # http_appVerName STRING,
 # time_local_status STRING,
 # body_bytes_sent string,
 # bytes_sent string,
 # gzip_ratio string,
 # http_referer STRING, 
 # http_cookie STRING,
 # http_user_agent STRING, 
 # http_x_forwarded_for STRING,
 # upstream_addr STRING,
 # upstream_response_time string,
 # upstream_status STRING,
 #request_time string
#)ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n'  location '/nginx/${date}/logs';"

/letv/data/apache-hive-0.13.1-bin/bin/hive -e "create table nginx_${date}_ChannelUserSort(http_marketChannelName string,UserNum bigint,num bigint);"
/letv/data/apache-hive-0.13.1-bin/bin/hive -e "insert into table nginx_${date}_ChannelUserSort select *,count(distinct http_hwMac) as UserNum from nginx_${date} group by http_marketChannelName having  UserNum >=50 order by UserNum desc limit 20;"
/letv/data/apache-hive-0.13.1-bin/bin/hive -e "create table nginx_${date}_UserCount(http_marketChannelName string,UserNum bigint,num bigint);"
/letv/data/apache-hive-0.13.1-bin/bin/hive -e "insert into table nginx_${date}_UserCount select *,count(distinct http_hwMac) from nginx_${date};"

集群的其他机器要使用scp部署flume,并修改apache-flume-1.5.0-bin/conf/flume-env.sh中的JAVA_HOME=/usr/Java/jdk1.8.0_77和/etc/profile中的要一致,同时修改vim /etc/hosts以及apache-flume-1.5.0-bin/conf下的配置文件

Python网络爬虫与推荐算法新闻推荐平台:网络爬虫:通过Python实现新浪新闻的爬取,可爬取新闻页面上的标题、文本、图片、视频链接(保留排版) 推荐算法:权重衰减+标签推荐+区域推荐+热点推荐.zip项目工程资源经过严格测试可直接运行成功且功能正常的情况才上传,可轻松复刻,拿到资料包后可轻松复现出一样的项目,本人系统开发经验充足(全领域),有任何使用问题欢迎随时与我联系,我会及时为您解惑,提供帮助。 【资源内容】:包含完整源码+工程文件+说明(如有)等。答辩评审平均分达到96分,放心下载使用!可轻松复现,设计报告也可借鉴此项目,该资源内项目代码都经过测试运行成功,功能ok的情况下才上传的。 【提供帮助】:有任何使用问题欢迎随时与我联系,我会及时解答解惑,提供帮助 【附带帮助】:若还需要相关开发工具、学习资料等,我会提供帮助,提供资料,鼓励学习进步 【项目价值】:可用在相关项目设计中,皆可应用在项目、毕业设计、课程设计、期末/期中/大作业、工程实训、大创等学科竞赛比赛、初期项目立项、学习/练手等方面,可借鉴此优质项目实现复刻,设计报告也可借鉴此项目,也可基于此项目来扩展开发出更多功能 下载后请首先打开README文件(如有),项目工程可直接复现复刻,如果基础还行,也可在此程序基础上进行修改,以实现其它功能。供开源学习/技术交流/学习参考,勿用于商业用途。质量优质,放心下载使用。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值