使用python爬取互联网设备信息

最新推荐文章于 2023-03-18 20:35:58 发布

weixin_30455365

最新推荐文章于 2023-03-18 20:35:58 发布

阅读量359

点赞数

文章标签：爬虫 python 运维

原文链接：http://www.cnblogs.com/Motorola/p/7830267.html

版权

使用python爬取互联网设备信息

扩大搜索范围

DNS服务器搭建好以后,需要寻找可修改的路由器目标,目标不能太多也不能太少,就从我路由器c段入手吧,65536个IP地址,使用nmap扫描,将所有开放80端口的主机找出来,然后分析结果:

由于某些主机不响应ping包,所以加上-Pn选项,防止漏报.--host-timeout指定连接超时,否则nmap会将速度减到很慢,浪费大量时间.经过一段时间后,结果出来了

可以看到输出的结果里包含有timeout的项目,需要将这些项目去掉,利用Notepad++的正则表达式功能删除这些不需要的项目.

点击全部替换,即可去掉超时的部分.9817个匹配项被替换:

剩下的结果为需要的部分,可以看到每个IP都会出现两次,需要单独将他们提取出来,手工剔除效率非常低,浪费时间.为了提高效率,写了个提取工具用来自动化这个过程.

脚本代码如下:

@echo off 
setlocal EnableDelayedExpansion
color 0a

set num = 1 , str = 0 ,total = 0
echo.

set /p p=input file name :

for /f "eol=# tokens=2" %%i in ( %p% ) do (

set /a str = num%%2
set /a num = num + 1
if !str! equ 0 (echo %%i >> %p:~0,-4%_ok.txt 
    set /a total = total + 1) else ( echo %%i )

)
echo.
echo total !total! IP
echo any key to close ...
pause > nul

经过批处理,开放80端口的主机ip被单独取出来了.共有1749个开放80端口的主机.

它们都是什么设备?得需要一个一个在浏览器中查看,可是这样太费时间了,而且有时候只需要看网页的title信息即可大概知道设备的类型是什么样的.基于此,我用python写了一个爬虫,来获取这些IP地址的title信息,这样可以大幅减少工作量:

下面是爬虫代码单线程版本:

 1 # !/usr/bin/python
 2 # -*- coding:utf-8 -*-
 3 #code by skq9@qq.com
 4 
 5 import requests,sys,time,datetime
 6 from lxml.html import fromstring
 7 
 8 data = {
 9 
10     
11     'User-Agent': 'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)',
12     
13 }
14 
15 reload(sys)
16 sys.setdefaultencoding("utf-8")
17 
18 try:
19     sys.argv[1]
20 except:
21     print "Usage: get_title.py [filename]"
22     sys.exit()
23 
24 ip_file = sys.argv[1]
25 num = len(open(ip_file,'r').readlines())
26 count = 0
27 starttime = datetime.datetime.now()
28 
29 with open (ip_file,'r') as f :
30     for line in f.readlines():
31         ip = line.strip()
32         count += 1
33         print '[%d/%d] ' % (count,num) + time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()) + ' -->' , ip + ' -->' ,
34         try:
35 
36             url = 'http://' + ip + '/'
37             r = requests.get(url=url,headers=data,timeout=(4,6))
38             r.encoding = r.apparent_encoding
39             tree = fromstring(r.text)
40             title = tree.findtext('.//title')
41             print '\'' + title + '\'' ,
42             
43             with open ('./title_info.txt','a') as f :
44                 f.write(url)
45                 f.write('\t')
46                 if title == '' :
47                     x = r.text.find('/doc/page/login')
48                     if x != -1 :
49                         f.write('-- May be HIKVISON --')
50                     else :
51                         f.write('-- Null string --')
52                 else:
53                     f.write(title.strip())
54                 f.write('\n')
55                 print 'ok!'
56 
57         except requests.exceptions.ConnectTimeout as e:
58             print 'ConnectTimeout'
59             info = url + '\tERROR:' + 'ConnectTimeout' + '\n'
60             with open ('./title_info.txt','a') as f :
61                 f.write(info)
62         except requests.exceptions.ConnectionError as e:
63             print 'ConnectionError'
64             info = url + '\tERROR:' + 'ConnectionError' + '\n'
65             with open ('./title_info.txt','a') as f :
66                 f.write(info)
67         except requests.exceptions.ReadTimeout as e:
68             print 'ReadTimeout'
69             info = url + '\tERROR:' + 'ReadTimeout' + '\n'
70             with open ('./title_info.txt','a') as f :
71                 f.write(info)
72         except Exception as e:
73             print 'Error:%s' % e
74             info = url + '\tERROR:' + str(e) + '\n'
75             with open ('./title_info.txt','a') as f :
76                 f.write(info)
77                 
78         finally :
79             pass
80 
81 endtime = datetime.datetime.now()
82 
83 print '--> Done at' , time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()) , 'in' , (endtime - starttime).seconds , 'seconds'