我在江北用Python 多线程收集扫描器字典

最新推荐文章于 2022-04-08 21:37:23 发布

ouyangbro

最新推荐文章于 2022-04-08 21:37:23 发布

阅读量2.2k

点赞数

分类专栏： Python

本文链接：https://blog.csdn.net/emaste_r/article/details/8128054

版权

Python 专栏收录该内容

72 篇文章 2 订阅

订阅专栏

胡哥给的任务是精简他给的扫描器字典。

我的思路是这样子的：

1.从一大堆文件中筛选出是扫描器构造的URL

2.对这些筛选出来的URL进行统计排序，和胡哥给的字典对比，留下吻合度高的字典。

3.从第三方web应用中获取URL作为字典的一部分，毕竟现在很多用户都在用第三方的web应用如织梦CMS，wordpress，一旦扫描起来，准确度特别高。

问题1：.如果胡哥给的文件的信息过少，导致筛选出来的字典吻合度都很低，那就坑爹了。（已解决）

这个问题问题好，不过肉眼目测有6w个数据，应该不会出现字典信息过少的情况。

问题2：要怎么确定这个URL是扫描器而不是用户正常访问呢？（已解决）

这个思路有如下几点：

1)关键字如fuck，sql，webshell（当然还要很多）通通视为扫描器在扫描，因为正常用户都不会访问这些链接，记录这些IP，然后获取IP所访问的所有URL。

2)统计文件中请求IP的TOP10（根据需要可以设定这个TOP n），如果是则把这些IP所扫过的URL加入到字典中，毕竟正常用户的访问不可能特别频繁。

3)把扫描后台等敏感目录的IP视为恶意IP，并且把这个IP所扫过的URL记录为字典，如果这个IP是正常用户，那么他的访问必定数量很少，字典这点冗余可以接受，

如果这个IP是扫描器，那么我们就收集它的字典并加到我们的字典中。

问题3：如果扫描器也是个冗余字典，那怎么办？达不到我要精简字典的目的啊！（这个问题无法解决）

胡哥说，这个问题姑且留着。我个人认为这已经晁超出了自动化的范围了。

=======================================================================

第一步系统架构，好吧，这不叫架构

我的任务：

读取keyword.txt，放到keywords[]中

线程1

判断文件Top n文件是否存在，若存在则停止线程

跑出Top 10 IP

线程 2

判断文件Top n文件是否存在，若不存在则等线程1完成wait（）

读取in.txt 放到 TopN[]中

while Not EOF

如果匹配到TopN中的数据，则命中，加入到out.txt中

线程 3

读取in.txt，分析每一条数据。

while Not EOF

如果匹配到keywords中的数据，则命中

此record的IP是否在DirtyIP[]中，若是

pass

否则

DirtyIP.append(IP)

record的IP在DirtyIP中，则

写入到out.txt 中

否则

pass

如果所有线程都完成了

对out.txt去重（这个必须在最后才能的操作，没办法在读入的时候处理）

=======================================================================

我的任务（细化）：

flag = Top N文件是否存在的标志

线程2,3可以用同一个辅助函数：

bool isHitTarget(array[] , string record )

hit =fasle

for (element in array)

if element is the substring of record

hit = true

if hit == true

加入到 out.txt中

怎么对大数据去重呢？，我自己写的这个，不知道能不能承受大数据的冲击呢？效率高不高呢？谁用谁知道。。。

void getSingleRecord(filename)

while Not EOF

record = read from out.txt

if record in new_records

pass

else

new_records.append(record)

因为要写入到同一个out.txt文件中，所有要用互斥量，怎么写呢？

创建锁： g_mutex = threading.lock()

使用锁： g_mutex.acquire() ...

释放锁： g_mutex.release()

这三个线程分别用三个函数来解决：

def getTopN_IP(int n,string filename); 对应 th1 = threading.Thread(target = getTopN_IP, args =(n,filename) )

def getURLFromTopN_IP(topN[],filename) 对应 th2 = threading.Thread(target = getURLFromTopN_IP , args = (topN, filename))

def getURLFromDirty_IP(filename) 对应 th3 = threading.Thread(target = getURLFromDirty_IP,args = (filename))

等待线程完成：th1.join() th2.join() th3.join()

怎么从一条record中获取IP，获取关键的URL呢？

我首先就想到了正则表达式，以下是IP地址的正则表达式（我姑且相信它是对的）：

((?:(?:25[0-5]|2[0-4]\d|((1\d{2})|([1-9]?\d)))\.){3}(?:25[0-5]|2[0-4]\d|((1\d{2})|([1-9]?\d))))

但是我有更好的方法哦~~

因为胡哥给的数据都是IIS服务器的log，所以都是有特定格式的说。

我们可以根据这些特定格式来做文章，用split，然后数组的第i个和第j个就是我们要的IP地址和URL关键词了，这个方法不错吧，是吧？

==========================================================================================

第二步测试各个功能

1.先测试互斥量等子功能吧

这个python的threadng还真的是有意思。在指定args=(a,b)的时候，如果arg是单数的话，还要加个','，不然就一直报错：

TypeError: getURLFromDirty_IP() takes exactly 1 argument (6 given)

这个Python开发者语文学得就是好，args就必须不是是单数啊。。尼玛这个错误我检查了老半天，对照别人代码看，很难发现这个问题啊。。

import threading
 
def getURLFromDirty_IP(filename):
    print "3"
 
if __name__ == "__main__":
    infile  = "in.txt"
    th3 = threading.Thread(target = getURLFromDirty_IP,args = (infile,));  //这里args中不加','会报错

    th3.start()
    th3.join()
    
    print "Hello World";

期间遇到问题：IIS服务器的日志格式是可以修改的，所以会导致我的程序局限性太大了，所以要做一定的修改，而怎么修改呢，就是要用正则去匹配！！

天啊，过了一天之后还是要用正则！不过幸好Python对正则还是很支持的！！

2.Python的线程锁：

mutex = threading.Lock() #创建线程锁，毕竟读文件存在竞争
 mutex.acquire(100)#加个互斥锁
 out.write(Path+"\r\n")
 mutex.release() #释放锁

3.Python 判断文件和文件夹是否存在：

import os
os.path.isfile(infile) #返回False就不是文件，返回True就是了
os.path.exists(directory) #如果目录不存在就返回False

第三步终于完成v1.0版本了（还差正则表达式！）和更多的测试：

注释还算可以~~

import os.path
# To change this template, choose Tools | Templates
# and open the template in the editor.

__author__="Administrator"
__date__ ="$2012-10-30 17:13:46$"

import threading
import os

topN_IP = []; n = 10 #n是TOP N的n啊~~默认是10
threads = []
keywords = []
infile  = "../infile/"
outfile = "../outfile/"
topNFile = "../topNFile/"
dirtyFile = "../dirtyFile/dirtywords.txt"
mutex = threading.Lock() #创建线程锁，毕竟读文件存在竞争


def getTopN_IP(n,infile,outfile):

    #IPs =  "haha aa".split(" ")
    IPs = []
    isRegetIP = False

    #如果文件已经存在，则默认我们曾经跑过了这个TopN_IP，pass
    if True == os.path.isfile(topNFile)  :
        print topNFile +"已经存在，太好了~"
        f = file(topNFile,"r")
        while True:
            tmpLine = f.readline()
            if tmpLine == "":
                break
            topN_IP.append(tmpLine)
        f.close()
        if 0 == len(topN_IP):
            print "文件虽然存在，但是为空，请重新加入TOP_N_IP"
            isRegetIP = True

    if False == isRegetIP:
        f = file(infile,"r")
        while  True:
            tmpLine = f.readline()
            if tmpLine == "":
                break
            tmpList = tmpLine.split(' ')
            #我们要解析的文件是IIS的日志如：
            #2012-03-17 07:21:50 192.168.100.20 GET / - 80 - 49.94.46.156 Mozilla/5.0+(Macintosh;+Intel+Mac+OS+X+10_7_3)+AppleWebKit/534.53.11+(KHTML,+like+Gecko)+Version/5.1.3+Safari/534.53.10 200 0 0 0
            #很明显，这个结构很清晰，而且是通用的，不需要用正则去搞
            #第九个是目标IP！print tmpList[8]
            IPs.append(tmpList[8])
        f.close()
    #去重这句话好简单时尚啊~
    singleIP  = {}.fromkeys(IPs).keys()
    IPDict = {}
    for tmp in singleIP:
        IPDict[tmp] = 0;

    for tmp in IPs:
        IPDict[tmp] += 1

    #对字典进行排序key=lambda e:e[1]表示对value排序。key=lambda e:e[0]对key排序
    #IPDict.items()把字典搞成元祖集合的形式
    #lambda就是匿名函数中，语句中冒号前是参数，可以有多个，用逗号隔开，冒号右边的返回值。
    sortIP=sorted(IPDict.items(),key=lambda e:e[1],reverse=True)
    index = 0
    for tmp in sortIP:
        index += 1
        #因为元组(IP,个数),所以就是这么获取ip
        topN_IP.append(tmp[0])
        #print tmp
    if index < 10:
        n = index


def getURLFromTopN_IP(topN_IP,infile,outfile):
    #print topN_IP
    #print len(topN_IP)
    if 0 == len(topN_IP):
        print "top 名单为空"
        pass
    else:
        f = file(infile,"r")
        out = file(outfile,"w+")
        #还是根据IIS日志结构的来获取这个路径吧正则太困难了
        #也就是tmpList[4]
        while  True:
            tmpLine = f.readline()
            if tmpLine == "":
                break
            tmpList = tmpLine.split(' ')
            IP = tmpList[8]
            Path = tmpList[4]
            if IP in topN_IP:
                #加个互斥锁
                mutex.acquire(100)
                out.write(Path+"\r\n")
                mutex.release()
        out.close()
        f.close()

def getURLFromDirty_IP(infile,outfile,dirtyFile):
    f = file(infile,"r")
    out = file(outfile,"w+")
    dfile = file(dirtyFile,"r")
    
    #导入脏keywords
    dirtywords = []
    while True:
        tmpLine = dfile.readline()
        if tmpLine == "" :
            break
        dirtywords.append(tmpLine)
    dfile.close()
    
    #字符串匹配
    
    while True:
        flag = False
        tmpLine = f.readline()
        if tmpLine == "":
            break
        for word in dirtywords:
            if True == tmpLine.find(word): 
                flag = True
                break
        if flag:
            tmpList = tmpLine.split(' ')
             #加个互斥锁
            mutex.acquire(100)
            out.write(tmpList[4]+"\r\n")
            mutex.release()
            
    f.close()
    out.close()


def getSingleRecord(outfile):
    f = file(outfile,"r")
    allList = []
    while True:
        tmpLine = f.readline()
        if tmpLine == "" :
            break
        allList.append(tmpLine)
    singleList  = {}.fromkeys(allList).keys()
    f.close()
    f2 = file(outfile,"w")
    for word in singleList:
        f2.write(word+"/r/n")
    f2.close()

if __name__ == "__main__":

    while True:
        infile  = "../infile/"
        outfile = "../outfile/"
        tmpfile = raw_input("请输入文件名(退出请输入：呵呵):")
        infile += tmpfile
        if infile == "呵呵":
            print "欢迎下次使用哦~~Ps：呵呵你妹！"
            break
        if False == os.path.isfile(infile):
            print "您输入的文件不存在哦~~"
            continue

        outfile += tmpfile[0:len(tmpfile)-4]
        outfile += "_out.txt"
        print "文件输出名为："+outfile

        th1 = threading.Thread(target = getTopN_IP,args = (10,infile,outfile));
        th2 = threading.Thread(target = getURLFromTopN_IP,args = (topN_IP,infile,outfile));
        th3 = threading.Thread(target = getURLFromDirty_IP,args = (infile,outfile,dirtyFile));

        threads.append(th1);threads.append(th2);threads.append(th3);

        th1.start()
        th1.join()
        th2.start()
        th3.start()
        th2.join()
        th3.join()

        #最终结果去重
        getSingleRecord(outfile)

        print "处理完毕，请去文件夹目录查看处理结果："+outfile;

ouyangbro

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
2
评论
我在江北用Python 多线程收集扫描器字典

胡哥给的任务是精简他给的扫描器字典。我的思路是这样子的：1.从一大堆文件中筛选出是扫描器构造的URL2.对这些筛选出来的URL进行统计排序，和胡哥给的字典对比，留下吻合度高的字典。3.从第三方web应用中获取URL作为字典的一部分，毕竟现在很多用户都在用第三方的web应用如织梦CMS，wordpress，一旦扫描起来，准确度特别高。问题1：.如果胡哥给的文件的信息过少，导致筛选
复制链接

扫一扫