python - 分析 iis 日志 wwwlogs

最新推荐文章于 2024-10-11 15:11:45 发布

草青工作室

最新推荐文章于 2024-10-11 15:11:45 发布

阅读量827

点赞数

分类专栏： python 问题解决经验库

本文链接：https://blog.csdn.net/xxj_jing/article/details/103560622

版权

python 同时被 2 个专栏收录

20 篇文章

订阅专栏

问题解决经验库

18 篇文章

订阅专栏

本文介绍了一种使用Python解析IIS日志的方法，包括日志格式解析、统计分析和输出结果。通过自定义脚本，实现了对访问量、目录访问、IP分布、带宽使用、并发请求等多项关键指标的分析。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

python-解析 iis 日志

iis日志分析工具比较多，基本都支持 windows（不夸平台），统计维度也有限。有找工具的时间还不如自己写一个！！！

分析时注意 iis 日志是格林尼治时间，没有加时区。分析时需要加 8 小时！！！

日志是 tvs 类型，字段为：

#Fields: date time s-computername s-ip cs-method cs-uri-stem cs-uri-query cs-username c-ip cs-version cs(User-Agent) cs(Referer) cs-host sc-status sc-bytes time-taken

字段含义及索引：

'''
0  date         日期    2019/12/4
1  time         时间    16:00:00
2  s-computername 服务名       xxxx
3  s-ip         服务器ip  192.xxx.xxx.xxx
4  cs-method     请求模式   GET
5  cs-uri-stem       地址       /qiming/details/y6remloeo6nq.html
6  cs-uri-query   参数       -
7  cs-username       用户名          -
8  c-ip         客户端IP地址    66.249.79.103
9  cs-version    协议版本      HTTP/1.1
10 cs(User-Agent) 用户代理      xxxx
11 cs(Referer)       来源地址      -
12 cs-host          域名       xxx.com
13 sc-status     状态代码      200
14 sc-bytes      发送的字节数 14244
15 time-taken    花费的时间     1390
'''

一、运行脚本

python wwwlogs_analyze.py --input=/Users/wwwlogs/ex191212.log --output=/Downloads/

二、输出的统计信息如下

分析文件 = /Downloads/wwwlogs/ex191205.log
输出文件 = /Downloads/wwwlogs-analize-20191216125217.txt


#汇总	2019-12-16 12:52
#总行数 = 358087
#ip/小时 = 1386
#目录总量 = 228
#请求/小时 = 24
#_isDebug = False


#总小时数=24	峰值 22021 次/时	谷值 9664 次/时
20191205-00	009973	---------------------------------------------(45%)
20191205-01	010541	-----------------------------------------------(47%)
20191205-02	010193	----------------------------------------------(46%)
20191205-03	010636	------------------------------------------------(48%)
20191205-04	010360	-----------------------------------------------(47%)
20191205-05	010527	-----------------------------------------------(47%)
20191205-06	012207	-------------------------------------------------------(55%)
20191205-07	022021	----------------------------------------------------------------------------------------------------(100%)
20191205-08	018214	----------------------------------------------------------------------------------(82%)
20191205-09	014415	-----------------------------------------------------------------(65%)
20191205-10	018218	----------------------------------------------------------------------------------(82%)
20191205-11	018027	---------------------------------------------------------------------------------(81%)
20191205-12	021611	--------------------------------------------------------------------------------------------------(98%)
20191205-13	017965	---------------------------------------------------------------------------------(81%)
20191205-14	020541	---------------------------------------------------------------------------------------------(93%)
20191205-15	017777	--------------------------------------------------------------------------------(80%)
20191205-16	014605	------------------------------------------------------------------(66%)
20191205-17	018253	----------------------------------------------------------------------------------(82%)
20191205-18	018113	----------------------------------------------------------------------------------(82%)
20191205-19	017982	---------------------------------------------------------------------------------(81%)
20191205-20	015330	---------------------------------------------------------------------(69%)
20191205-21	010167	----------------------------------------------(46%)
20191205-22	009664	-------------------------------------------(43%)
20191205-23	010730	------------------------------------------------(48%)


#目录总量 = 228
347612	/x'x'x/details
001713	/xx/xx/2019/12
000871	/x'x/x'x/elmem
000843	/x'x/x'x/zqwwj
。。。
print top 100 ... 


#ip总量 = 1386
66.249.79.103	16859
66.249.79.105	12947
66.249.79.107	12354
66.249.79.109	11170
。。。
print top 100 ... 


#带宽汇总 = 24时	2.9GB 
#总小时数=24	峰值 298.3 MB/时	谷值 3.45 MB/时
20191205-00	4.47M/分	268.22M/时
20191205-01	4.86M/分	291.45M/时
20191205-02	4.87M/分	292.02M/时
20191205-03	4.76M/分	285.77M/时
20191205-04	4.76M/分	285.88M/时
20191205-05	4.97M/分	298.3M/时
20191205-06	2.97M/分	178.24M/时
20191205-07	0.25M/分	14.75M/时
20191205-08	0.19M/分	11.34M/时
20191205-09	0.08M/分	4.57M/时
20191205-10	0.16M/分	9.3M/时
20191205-11	0.24M/分	14.35M/时
20191205-12	0.43M/分	25.71M/时
20191205-13	0.17M/分	10.09M/时
20191205-14	1.36M/分	81.73M/时
20191205-15	0.07M/分	4.34M/时
20191205-16	0.06M/分	3.45M/时
20191205-17	0.12M/分	7.48M/时
20191205-18	0.26M/分	15.37M/时
20191205-19	0.22M/分	13.12M/时
20191205-20	2.28M/分	136.67M/时
20191205-21	4.18M/分	250.72M/时
20191205-22	4.33M/分	259.88M/时
20191205-23	3.38M/分	202.79M/时


#请求分析-并发 65370	n/a 次
20191205 12:56:17	2590 次/秒	0.09 MB/秒	1099097413 ms/秒
20191205 10:28:39	2085 次/秒	0.05 MB/秒	498004266 ms/秒
20191205 12:09:24	2017 次/秒	0.13 MB/秒	446131901 ms/秒
20191205 08:39:31	1925 次/秒	0.0 MB/秒	625276602 ms/秒
20191205 07:49:23	1899 次/秒	0.09 MB/秒	828558176 ms/秒
20191205 19:15:54	1850 次/秒	0.0 MB/秒	916924419 ms/秒
20191205 20:26:59	1821 次/秒	0.27 MB/秒	453436140 ms/秒
20191205 07:06:09	1807 次/秒	0.06 MB/秒	521245021 ms/秒
20191205 18:03:08	1795 次/秒	0.09 MB/秒	564011590 ms/秒
20191205 17:03:47	1751 次/秒	0.09 MB/秒	358890248 ms/秒
20191205 07:06:07	1648 次/秒	0.0 MB/秒	1034090047 ms/秒
20191205 11:25:26	1598 次/秒	0.09 MB/秒	499285907 ms/秒
20191205 19:15:55	1442 次/秒	0.15 MB/秒	232503865 ms/秒
20191205 15:57:57	1337 次/秒	0.0 MB/秒	638168839 ms/秒
20191205 12:09:23	1321 次/秒	0.0 MB/秒	745523925 ms/秒
20191205 15:57:58	1273 次/秒	0.0 MB/秒	266381110 ms/秒
20191205 13:52:03	1231 次/秒	0.09 MB/秒	344168242 ms/秒
20191205 17:03:46	1210 次/秒	0.0 MB/秒	619886590 ms/秒
20191205 10:28:38	1142 次/秒	0.18 MB/秒	651564257 ms/秒
20191205 13:52:02	1075 次/秒	0.0 MB/秒	563274687 ms/秒
20191205 07:49:24	1028 次/秒	0.19 MB/秒	135449650 ms/秒
。。。。
print top 100 ... 


#请求分析-耗时 65370	n/a 次
20191205 12:56:17	2590 次/秒	0.09 MB/秒	耗时 1099097413 ms/秒
20191205 07:06:07	1648 次/秒	0.0 MB/秒	耗时 1034090047 ms/秒
20191205 19:15:54	1850 次/秒	0.0 MB/秒	耗时 916924419 ms/秒
20191205 07:49:23	1899 次/秒	0.09 MB/秒	耗时 828558176 ms/秒
20191205 12:09:23	1321 次/秒	0.0 MB/秒	耗时 745523925 ms/秒
20191205 10:28:38	1142 次/秒	0.18 MB/秒	耗时 651564257 ms/秒
20191205 15:57:57	1337 次/秒	0.0 MB/秒	耗时 638168839 ms/秒
20191205 08:39:31	1925 次/秒	0.0 MB/秒	耗时 625276602 ms/秒
20191205 17:03:46	1210 次/秒	0.0 MB/秒	耗时 619886590 ms/秒
。。。。
print top 100 ... 


#请求分析-带宽 65370	n/a 次
20191205 12:09:54	16 次/秒	1.08 MB/秒	耗时 598701 ms/秒
20191205 10:29:08	16 次/秒	1.07 MB/秒	耗时 522417 ms/秒
20191205 19:16:48	15 次/秒	1.0 MB/秒	耗时 687462 ms/秒
20191205 22:31:33	18 次/秒	0.95 MB/秒	耗时 667591 ms/秒
20191205 23:55:41	12 次/秒	0.92 MB/秒	耗时 406580 ms/秒
20191205 11:26:26	22 次/秒	0.91 MB/秒	耗时 1260798 ms/秒
。。。。
print top 100 ...

三、wwwlogs_analyze.py 文件源码

#!/usr/bin/env python2.7
# coding=utf-8
import os
import sys
import argparse
import codecs
import time,datetime

'''
#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2019-12-04 16:00:01
#Fields: date time s-computername s-ip cs-method cs-uri-stem cs-uri-query cs-username c-ip cs-version cs(User-Agent) cs(Referer) cs-host sc-status sc-bytes time-taken
2019-12-04 16:00:00 byw-474802 123.123.123.123 GET /xxx/xxx/xxx.html - - 66.249.79.103 HTTP/1.1 Mozilla/5.0+(Linux;+Android+6.0.1;+Nexus+5X+Build/MMB29P)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/41.0.2272.96+Mobile+Safari/537.36+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - xxx.com 200 14244 1390

'''
_isDebug = False
_outDir = None
_ipDic = dict()
_dirDic = dict()
_hourDic = dict()
_scBytesByHour = dict() # 服务器发送的字节数（小时）
_reqTimes = dict() # 并发统计（秒）

class reqTimesModel:
    times = 0
    scBytes = 0
    timeTaken = 0

class reqInfo:
    dateTime = None
    path = ""
    ip = ""
    refererUrl = ""
    userAgent = ""
    scStatus = 0
    scBytes = 0
    timeTaken = 0
'''
0	date			日期		2019/12/4
1	time			时间		16:00:00
2	s-computername	服务名		xxxx
3	s-ip			服务器ip	192.xxx.xxx.xxx
4	cs-method		请求模式	GET
5	cs-uri-stem		地址			/qiming/details/y6remloeo6nq.html
6	cs-uri-query	参数			-
7	cs-username		用户名			-
8	c-ip			客户端IP地址	66.249.79.103
9	cs-version		协议版本		HTTP/1.1
10	cs(User-Agent)	用户代理		xxxx
11	cs(Referer)		来源地址		-
12	cs-host			域名			xxx.com
13	sc-status		状态代码		200
14	sc-bytes		发送的字节数	14244
15	time-taken		花费的时间		1390
'''
def getInfo(row):
    arr = row.split(' ')
    if len(arr)<16:
        print("解析失败\t",len(arr),row)
        return
    #时间偏移量，iis 时间是 0 时区
    hourOffset = datetime.timedelta(hours=8);
    # h = timedelta(hours=hour_offset)
    timeStruct = time.strptime('{} {}'.format(arr[0],arr[1]), "%Y-%m-%d %H:%M:%S")
    timestamap = time.mktime(timeStruct) #元组 to 时间戳
    localTime = datetime.datetime.fromtimestamp(timestamap) # 时间戳 to datetime
    localDateTime = localTime + hourOffset # datetime + 8 hour
    # secs = time.mktime(timeStruct)
    info = reqInfo()
    info.dateTime = localDateTime
    # info.datetime += h;
    info.path = arr[5]
    info.ip = arr[8]
    info.refererUrl = arr[11]
    info.userAgent = arr[10]
    info.scStatus = arr[13]
    info.scBytes = int(arr[14])
    info.timeTaken = int(arr[15])
    return info

def ipTotal(reqInfo):
    if reqInfo.ip not in _ipDic.keys():
        _ipDic[reqInfo.ip] = 0

    _ipDic[reqInfo.ip] += 1

def dirTotal(reqInfo):
    path = reqInfo.path
    if not path or len(path) == 0:
        path = '/'
    lastIndex = path.rfind('/')
    dir = path[0:lastIndex]

    if dir not in _dirDic.keys():
        _dirDic[dir] = 0

    _dirDic[dir] += 1


def hourTotal(reqInfo):
    # %Y--%m--%d %H:%M:%S,元组格式化
    # key =  time.strftime('%Y%m%d-%H',reqInfo.dateTime) 时间元组 to 字符串
    key =  reqInfo.dateTime.strftime('%Y%m%d-%H') # datetime to 字符串
    if key not in _hourDic.keys():
        _hourDic[key] = 0
    _hourDic[key] += 1

def scBytesTotal(reqInfo):
    key =  reqInfo.dateTime.strftime('%Y%m%d-%H') # datetime to 字符串
    if key not in _scBytesByHour.keys():
        _scBytesByHour[key] = reqInfo.scBytes
    _scBytesByHour[key] += reqInfo.scBytes

def reqTimesTotal(reqInfo):
    key =  reqInfo.dateTime.strftime('%Y%m%d %H:%M:%S') # datetime to 字符串
    if key not in _reqTimes.keys():
        req = reqTimesModel()
        req.scBytes = reqInfo.scBytes
        req.times = 1
        req.timeTaken = reqInfo.timeTaken
        _reqTimes[key] = req
    req = _reqTimes[key]
    req.scBytes += reqInfo.scBytes
    req.times += 1
    req.timeTaken += reqInfo.timeTaken


def analize(input,output,rows):
    # 排序 ---------------------------------------------
    # ip 排序
    print("\n\nip分析 total = ",len(_ipDic))
    # 利用 lambda 定义一个匿名函数 key，参数为 x 元组(k,v)，x[1]是值，x[0]是键。reverse参数接受False 或者True 表示是否逆序
    ipList = sorted(_ipDic.items(),key=lambda x:x[1],reverse=True)
    print('sort\tipList.size={}\t{}\n'.format(len(ipList),type(ipList)))
    # for (k,v) in ipList:
    #     print('{}\t{}'.format(k,v))

    # 目录排序
    print("\n\n目录分析 total = ",len(_ipDic))
    dirList = sorted(_dirDic.items(),key=lambda x:x[1],reverse=True)
    print('sort\tdirList.size={}\t{}\n'.format(len(dirList),type(dirList)))
    # for (k,v) in dirList:
    #     print('{}\t{}'.format(k,v))
    # 排序 ---------------------------------------------

    #时间排序
    print("\n\n时钟分析 total = ",len(_hourDic))

    #写文件
    curTime = datetime.datetime.now()
    outFileName = 'wwwlogs-analize-{}.txt'.format(curTime.strftime('%Y%m%d%H%M%S'))
    outFile = os.path.join(output,outFileName)
    f = codecs.open(outFile,'w+','utf-8')
    #
    f.writelines("分析文件 = {}\n".format(input))
    f.writelines("输出文件 = {}\n".format(outFile))
    f.writelines("\n\n")
    #---------------------------------------------------------
    f.writelines("#汇总\t{}\n".format(curTime.strftime('%Y-%m-%d %H:%M')))
    f.writelines("#总行数 = {}\n".format(rows))
    f.writelines("#总 ip 数= {}\n".format(len(_ipDic)))
    f.writelines("#目录总量 = {}\n".format(len(_dirDic)))
    f.writelines("#总小时数 = {}\n".format(len(_hourDic)))
    f.writelines("#_isDebug = {}\n".format(_isDebug))
    #---------------------------------------------------------
    # hour输出
    f.writelines("\n\n")
    maxTimes = max(_hourDic.values())
    minTimes = min(_hourDic.values())
    f.writelines("#总小时数={}\t峰值 {} 次/时\t谷值 {} 次/时\n"
                 .format(len(_hourDic)
                         ,maxTimes
                         ,minTimes))
    for (k,v) in _hourDic.items():
        rate = int(v/maxTimes*100)
        f.writelines('{}\t{}\t{}({}%)\n'.format(k
                                                ,str(v).zfill(6)
                                                ,'-'*rate
                                                ,rate))
    #---------------------------------------------------------
    # 目录输出
    f.writelines("\n\n")
    f.writelines("#目录总量 = {}\n".format(len(dirList)))
    top = 0
    for (k,v) in dirList:
        top += 1
        if top >= 100:
            f.writelines("print top 100 ... \n")
            break
        f.writelines('{}\t{}\n'.format(str(v).zfill(6)
                                       ,k))
    #---------------------------------------------------------
    # ip 输出
    f.writelines("\n\n")
    f.writelines("#ip总量 = {}\n"
                 .format(len(ipList)))
    top = 0
    for (k,v) in ipList:
        top += 1
        if top >= 100:
            f.writelines("print top 100 ... \n")
            break
        f.writelines('{}\t{}\n'.format(k,v))

    #---------------------------------------------------------
    # 带宽 输出
    f.writelines("\n\n")
    dBytes = round(sum(_scBytesByHour.values())/1024/1024/1024,2) # 日总流量 GB
    f.writelines("#带宽汇总 = {}时\t{}GB \n".format(len(_scBytesByHour)
                                               ,dBytes))
    maxBytes = round(max(_scBytesByHour.values())/1024/1024,2) # 峰值 MB
    minBytes = round(min(_scBytesByHour.values())/1024/1024,2) # 谷值 MB
    f.writelines("#总小时数={}\t峰值 {} MB/时\t谷值 {} MB/时\n"
                 .format(len(_scBytesByHour)
                         ,maxBytes
                         ,minBytes))
    for (k,v) in _scBytesByHour.items():
        mBtys = round(v/60/1024/1024,2) # MB
        hBtys = round(v/1024/1024,2) # MB
        f.writelines('{}\t{}M/分\t{}M/时\n'.format(k
                                                 ,mBtys
                                                 ,hBtys))

    #---------------------------------------------------------
    # 请求分析 - 次数
    f.writelines("\n\n")
    reqSortByTimes = sorted(_reqTimes.items(),key=lambda x:x[1].times,reverse=True)
    print('sort\treqSort.size={}\t{}\n'.format(len(reqSortByTimes),type(reqSortByTimes)))
    top = 100
    index = 0
    f.writelines("#请求分析-并发 {}\t{} 次\n".format(len(reqSortByTimes),'n/a'))
    for (k,v) in reqSortByTimes:
        #print('t={}\t{}'.format(type(v),v))
        index += 1
        if index>=top:
            f.writelines("print top {} ... \n".format(top))
            break
        sBytes = round(v.scBytes/1024/1024,2) # MB
        f.writelines('{}\t{} 次/秒\t{} MB/秒\t{} ms/秒\n'.format(k
                                                               ,v.times
                                                               ,sBytes
                                                               ,v.timeTaken))


    # 请求分析 - 耗时
    f.writelines("\n\n")
    reqSortByTimetaken = sorted(_reqTimes.items(),key=lambda x:x[1].timeTaken,reverse=True)
    print('sort\treqSort.size={}\t{}\n'.format(len(reqSortByTimetaken),type(reqSortByTimetaken)))
    top = 100
    index = 0
    f.writelines("#请求分析-耗时 {}\t{} 次\n".format(len(reqSortByTimetaken),'n/a'))
    for (k,v) in reqSortByTimetaken:
        #print('t={}\t{}'.format(type(v),v))
        index += 1
        if index>=top:
            f.writelines("print top {} ... \n".format(top))
            break
        sBytes = round(v.scBytes/1024/1024,2) # MB
        f.writelines('{}\t{} 次/秒\t{} MB/秒\t耗时 {} ms/秒\n'.format(k
                                                              ,v.times
                                                              ,sBytes
                                                              ,v.timeTaken))

    # 请求分析 - 带宽
    f.writelines("\n\n")
    reqSortByBytes = sorted(_reqTimes.items(),key=lambda x:x[1].scBytes,reverse=True)
    print('sort\treqSort.size={}\t{}\n'.format(len(reqSortByBytes),type(reqSortByBytes)))
    top = 100
    index = 0
    f.writelines("#请求分析-带宽 {}\t{} 次\n".format(len(reqSortByBytes),'n/a'))
    for (k,v) in reqSortByBytes:
        #print('t={}\t{}'.format(type(v),v))
        index += 1
        if index>=top:
            f.writelines("print top {} ... \n".format(top))
            break
        sBytes = round(v.scBytes/1024/1024,2) # MB
        f.writelines('{}\t{} 次/秒\t{} MB/秒\t耗时 {} ms/秒\n'.format(k
                                                           ,v.times
                                                           ,sBytes
                                                           ,v.timeTaken))

    #---------------------------------------------------------
    f.close()
    print('总记录\t {}\n'.format(rows))
    print('分析完毕\tout = {}\n'.format(outFile))
    pass

def main(input,output):
    if not input or not output:
        print("参数为空：--input={} --output={}".format(input,output))
        return
    if not os.path.exists(input):
        print("文件不存在：--input={} ".format(input))
        return
    if not os.path.exists(input):
        print("文件不存在：--input={} ".format(input))
        return
    dir = os.path.dirname(output)
    if not os.path.exists(dir):
        print("目录不存在：--output={} 目录 {} 不存在".format(output,dir))
        return
    f = codecs.open(input,'r','utf-8')
    # total = len(f.readlines())
    # print('total rows = ',total)
    rows = 0
    # 按行统计
    while True:
        rows += 1
        # ------
        if _isDebug and rows>=20000:
            print('_isDebug = ',_isDebug)
            break
        # ------
        line = f.readline()
        if not line:      #等价于if line == "":
            break
        if line.startswith('#'):
            print("跳过注释内容=>",line)
            continue
        reqInfo = getInfo(line)
        if not reqInfo:
            print("解析失败 row = {}".format(line))
            continue
        # txt = line.replace('\r\n','').encode('utf8')
        # ip 统计，访问量倒叙
        ipTotal(reqInfo)
        # 各级目录执行时间统计：最大、最小、平均
        dirTotal(reqInfo)
        # 执行时间 top90
        hourTotal(reqInfo)
        # 流量统计
        scBytesTotal(reqInfo)
        # 并发统计
        reqTimesTotal(reqInfo)
        # 错误码统计
    #关闭文件
    f.close()
    # 执行分析
    analize(input,output,rows)


'''
>>> f = open('test.txt', 'w') # 若是'wb'就表示写二进制文件
>>> f.write('Hello, world!')
>>> f.close()
python文件对象提供了两个“写”方法： write() 和 writelines()。
write()方法和read()、readline()方法对应，是将字符串写入到文件中。
writelines()方法和readlines()方法对应，也是针对列表的操作。它接收一个字符串列表作为参数，将他们写入到文件中，换行符不会自动的加入，因此，需要显式的加入换行符。
关于open()的mode参数：
'r'：读
'w'：写
'a'：追加
'r+' == r+w（可读可写，文件若不存在就报错(IOError)）
'w+' == w+r（可读可写，文件若不存在就创建）
'a+' ==a+r（可追加可写，文件若不存在就创建）
对应的，如果是二进制文件，就都加一个b就好啦：
'rb'　　'wb'　　'ab'　　'rb+'　　'wb+'　　'ab+'
'''

'''
python wwwlogs_analyze.py --input=/Users/wwwlogs/ex191212.log --output=/Downloads/
'''

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='manual to this script')
    parser.add_argument('--input', type=str, default = None)
    parser.add_argument('--output', type=str, default= None)
    args = parser.parse_args()
    #test()
    sys.exit(main(args.input,args.output))