1、背景
1.1 黑马论坛日志,数据分为两部分组成,原来是一个大文件,是56GB;以后每天生成一个文件,大约是150-200MB之间;
1.2 日志格式是apache common日志格式;
1.3 分析一些核心指标,供运营决策者使用;
1.4 开发该系统的目的是分了获取一些业务相关的指标,这些指标在第三方工具中无法获得的;
2、开发大致流程:
2.1 把日志数据上传到HDFS中进行处理
如果是日志服务器数据较小、压力较小,可以直接使用shell命令把数据上传到HDFS中;
如果是日志服务器数据较大、压力较大,使用NFS在另一台服务器上上传数据;
如果日志服务器非常多、数据量大,使用flume进行数据处理;
2.2 使用MapReduce对HDFS中的原始数据进行清洗;
2.3 使用Hive对清洗后的数据进行统计分析;
2.4 使用Sqoop把Hive产生的统计结果导出到mysql中;
2.5 如果用户需要查看详细数据的话,可以使用HBase进行展现;
3、数据准备:
完整日志测试数据下载地址:https://download.csdn.net/download/MrZhangBaby/14027892
每行记录有5部分组成:
1.访问ip
2.访问时间
3.访问资源【跟着两个访问的Url】
4.访问状态
5.本次流量
截取部分数据如下:
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/shy.gif HTTP/1.1" 200 2663 8.35.201.163 - - [30/May/2013:17:38:21 +0800] "GET /static/image/common/nv_a.png HTTP/1.1" 200 2076 27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/titter.gif HTTP/1.1" 200 1398 27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/sweat.gif HTTP/1.1" 200 1879 27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/mad.gif HTTP/1.1" 200 2423 27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/hug.gif HTTP/1.1" 200 1054 27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/lol.gif HTTP/1.1" 200 1443 27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/victory.gif HTTP/1.1" 200 1275 27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/time.gif HTTP/1.1" 200 687 27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/kiss.gif HTTP/1.1" 200 987 27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/handshake.gif HTTP/1.1" 200 1322 27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/loveliness.gif HTTP/1.1" 200 1579 27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/call.gif HTTP/1.1" 200 603 27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/funk.gif HTTP/1.1" 200 2928 27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/curse.gif HTTP/1.1" 200 1543 27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/dizzy.gif HTTP/1.1" 200 1859 27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/shutup.gif HTTP/1.1" 200 2500 27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/sleepy.gif HTTP/1 |