学习笔记:从0开始学习大数据-38.综合实训一：nginx日志分析

最新推荐文章于 2022-12-30 11:29:51 发布

领尚

最新推荐文章于 2022-12-30 11:29:51 发布

阅读量454

点赞数 1

分类专栏： hadoop Hadoop 文章标签： spark hadoop hdfs

本文链接：https://blog.csdn.net/oLinBSoft/article/details/104458844

版权

Hadoop 同时被 2 个专栏收录

46 篇文章 7 订阅

订阅专栏

hadoop

45 篇文章 0 订阅

订阅专栏

前面的学习大数据运行环境搭建后，可以开始综合实训了，因为各种事务，大数据学习进程几乎停滞了一年时间，这次受新冠状肺炎病毒疫情影响，呆在家里，才有点时间继续，这次比较完整地测试nginx日志大数据分析处理过程。

处理流程

数据处理流程参考如上，因为笔记本电脑限制，all in one，全部在一个虚拟机内完成。虚拟机ip是 192.168.49.141,计算机名centos7.linbsoft.com .

1. nginx日志通过flume导入到hdfs

（1）在flume的conf目录创建新文件nginx-loger.conf，内容如下

##配置Agent

a1.sources = r1
a1.sinks = k1
a1.channels = c1
 
# # 配置Source
a1.sources.r1.type = exec
a1.sources.r1.channels = c1
a1.sources.r1.deserializer.outputCharset = UTF-8
# # 配置需要监控的日志输出目录
a1.sources.r1.command = tail -F /var/log/nginx/access.log-20200221
# # 配置Sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.path = hdfs://centos7.linbsoft.com:8020/user/flume/nginx_logs/%Y%m%d
a1.sinks.k1.hdfs.filePrefix = %Y-%m-%d-%H
a1.sinks.k1.hdfs.fileSuffix = .log
a1.sinks.k1.hdfs.minBlockReplicas = 1
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.rollInterval = 86400
a1.sinks.k1.hdfs.rollSize = 1000000
a1.sinks.k1.hdfs.rollCount = 10000
a1.sinks.k1.hdfs.idleTimeout = 0
# # 配置Channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# # 将三者连接
a1.sources.r1.channel = c1
a1.sinks.k1.channel = c1

（2）如果不存在hdfs的flume目录，创建

hadoop fs -mkdir /user/flume

(3) 导入数据

cd apache-flume-1.6.0-cdh5.16.1-bin

bin/flume-ng agent --conf conf --conf-file conf/nginx-loger.conf --name a1

如果只导出了最后几条记录，可以把

a1.sources.r1.command = tail -F /var/log/nginx/access.log-20200221

改为

a1.sources.r1.command = cat /var/log/nginx/access.log-20200221

运行一次，因为前面tail -F是持续跟踪日志文件变化，每次只导入后10行

（4）检查导入数据是否成功

(

2. spark 读取hdfs 数据，运算后结果写入web目录文件中

可以有多种方法执行spark，如通过java程序，python程序，scala程序，shell脚本等，以下四种方式任选一个执行即可

（1）通过pyspark 的scala交互脚本

# pyspark

>>> rdd = sc.textFile("hdfs://centos7.linbsoft.com:8020/user/flume/nginx_logs/20200223/*").map(lambda line:line.split(':')[1]).map(lambda line:(line,1)).reduceByKey(lambda a1,a2:a1+a2).sortByKey(lambda a1,a2:a1+a2).map(lambda (key,value):(int(key),value)).repartition(1).saveAsTextFile("file:///usr/local/apache-tomcat-9.0.19/webapps/examples/nginxlog")

rdd的一次次转换可以分步(执行，也可以如上一次连接执行处理整个过程

（2）通过spark-shell执行

#cat examples/test.scala

val rdd = sc.textFile("hdfs://centos7.linbsoft.com:8020/user/flume/nginx_logs/20200223/*").map(lambda line:line.split(':')[1]).map(lambda line:(line,1)).reduceByKey(lambda a1,a2:a1+a2).sortByKey(lambda a1,a2:a1+a2).map(lambda (key,value):(int(key),value)).repartition(1).saveAsTextFile("file:///usr/local/apache-tomcat-9.0.19/webapps/examples/nginxlog")

# ./bin/spark-shell -i <./examples/test.scala

(3) 通过bash shell脚本执行

# cat testspark.sh

#!/bin/bash
exec $SPARK_HOME/bin/spark-shell --name spark-sql-test --executor-cores 8 --executor-memory 8g --num-executors 8 --conf spark.cleaner.ttl=240000 <<!EOF
val rdd = sc.textFile("hdfs://centos7.linbsoft.com:8020/user/flume/nginx_logs/20200223/*").map(lambda line:line.split(':')[1]).map(lambda line:(line,1)).reduceByKey(lambda a1,a2:a1+a2).sortByKey(lambda a1,a2:a1+a2).map(lambda (key,value):(int(key),value)).repartition(1).saveAsTextFile("file:///usr/local/apache-tomcat-9.0.19/webapps/examples/nginxlog")
!EOF

# ./testspark.sh

(4) pyspark scala分步执行

#pyspark

>>>rdd = sc.textFile("hdfs://centos7.linbsoft.com:8020/user/flume/nginx_logs/20200223/*")
>>>rdd.count()
>>>timerdd=rdd.map(lambda line:line.split(":")).map(lambda w:w[1])
>>>timerdd_add=timerdd.map(lambda line:(line,1))
>>>timerdd_add_reduce=timerdd_add.reduceByKey(lambda a1,a2:a1+a2)
>>>timerdd_add_reduce_sort=timerdd_add_reduce.sortByKey(lambda a1,a2:a1+a2)
>>>timerdd_add_reduce_sort.collect()
>>>timerdd_map_add=timerdd_add_reduce_sort.map(lambda (key,value):(int(key),value))
>>>timerdd_map_add.repartition(1).saveAsTextFile("file:///usr/local/apache-tomcat-9.0.19/webapps/examples/nginxlog")

分布执行的好处是每一步可以 rdd.count() rdd.collect() 查看处理结果

（4）执行上面的spark统计处理后，可以看到最后的输出路径里的结果文件

[root@centos7 nginxlog]# cat part-00000
(0, 432)
(1, 208)
(2, 232)
(3, 94)
(4, 968)
(5, 888)
(6, 5)
(7, 50)
(8, 176)
(9, 474)
(10, 476)
(11, 148)
(12, 142)
(13, 313)
(14, 356)
(15, 254)
(16, 493)
(17, 428)
(18, 906)
(19, 1306)
(20, 1335)
(21, 615)
(22, 536)
(23, 591)
[root@centos7 nginxlog]#

就是当天每小时网页的访问量

3. 通过网页展示结果，使用heightcharts 在网页展示图片，ajax读取结果文件

直接上完整文件，不解释了

(1) 网页文件nginxcharts.html

<html>
<head>
<meta charset="UTF-8" />
<title>nginx charts</title>
<script src="js/jquery-2.1.4.min.js"></script>
<script src="js/highcharts.js"></script>
</head>
<body>
<div id="container" style="width: 550px; height: 400px; margin: 0 auto"></div>
<script language="JavaScript">
$(document).ready(function() {
    var title = {
        style: {
            fontSize: '37px'
        },
	text: 'Ngnix服务器天访问量分布'
    };
   var subtitle = {
        text: '来源：随机生成'
    };
    var xAxis = {
        title: {
            text: '访问时间（小时）'
        },
	categories: count2()
    };
    var yAxis = {
        title: {
            text: '访问量（次）'
        },
        plotLines: [{
            value: 0,
            width: 1,
            color: '#808080'
        }]
    };

    var tooltip = {
        valueSuffix: '次'
    }
   var legend = {
        layout: 'vertical',
        align: 'right',
        verticalAlign: 'middle',
        borderWidth: 0
    };

    var series = [{
        name: 'Nginx',
        data: count()
    }];

    var json = {};

    json.title = title;
    json.subtitle = subtitle;
    json.xAxis = xAxis;
    json.yAxis = yAxis;
    json.tooltip = tooltip;
    json.legend = legend;
    json.series = series;


    $('#container').highcharts(json);
});


function count2() {
    var rst;
    $.ajax({
        type: "get",
        async: false,
        url: "./nginxlog/part-00000",
        success: function(data) {
            var a1 = data.split(')');
            var key = "";
            for (i = 0; i < a1.length - 1; i++) {
                if (i == 0) {
                    key = key + (a1[i].split(',')[0]).substr(1) + ",";
                } else {
                    key = key + (a1[i].split(',')[0]).substr(2);
                    if (i < a1.length - 2) {
                        key = key + ",";
                    }
                }
            }
            rst = eval('[' + key + ']');
        }
    });
    return rst;
}

function count() {
    var rst;
    $.ajax({
        type: "get",
        async: false,
        url: "./nginxlog/part-00000",
        success: function(data) {
            var a1 = data.split(')');
            var b1 = "";
            var key = "";
            for (i = 0; i < a1.length - 1; i++) {
                b1 = b1.concat(a1[i].split(',')[1]);
                if (i < a1.length - 2) {
                    b1 = b1 + ",";
                }
            }
            for (i = 0; i < a1.length - 1; i++) {
                if (i == 0) {
                    key = key + (a1[i].split(',')[0]).substr(1) + ",";
                } else {
                    key = key + (a1[i].split(',')[0]).substr(2);
                    if (i < a1.length - 2) {
                        key = key + ",";
                    }
                }
            }
            rst = eval('[' + b1 + ']');
        }
    });
    return rst;
}
</script>
</body>
</html>
~

（2）浏览器·访问

实验结束。

进阶：以上过程1会自动定时把更新数据导入hdfs，把以上过程2，即spark运算可以通过crontab -e 配置定时任务，定期更新数据即可

本次学习参考文章： https://blog.csdn.net/baidu_33329518/article/details/78644356

领尚

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
学习笔记:从0开始学习大数据-38.综合实训一：nginx日志分析

前面的学习大数据运行环境搭建后，可以开始综合实训了，这次比较完整地测试nginx日志大数据分析处理过程。数据处理流程参考如上，因为笔记本电脑限制，all in one，全部在一个虚拟机内完成。虚拟机ip是192.168.49.141,计算机名centos7 .1. nginx日志通过flume导入到hdfs（1）在flume的conf目录创建新文件nginx-loger.con...
复制链接

扫一扫

专栏目录