fofa爬虫
最近想找关于fofa爬虫的代码,在网上找到一篇,但内容有误,修改后成功运行,代码如下:
前提是需要先使用帐号登录fofa.info,获取cookie值后,将cookie值复制到这段代码中:
cookie = ‘’‘输入cookie’‘’
cookie值获取方式
以火狐浏览器为例,右键-检查(或者按F12)
点击网络(network),刷新下页面,单击第一个请求页面中
然后找到请求头(requests),找到cookies,右键-复制值
将内容粘贴到代码中
代码
# -*- coding: utf-8 -*-
import requests
from lxml import etree
import base64
import time
from urllib.parse import quote
import re
TimeSleep = 5 # 爬取每一页等待的秒数,防止IP被ban
SearchKEY = 'port="873"' # 搜索内容 这里是以端口号为873为例
StartPage = 1 # 开始页码
StopPage = 5 # 结束页码
cookie = '''输入cookie'''
headers = {
"Connection": "keep-alive",
"cookie": cookie.encode("utf-8").decode("latin1")
}
searchbs64 = quote(str(base64.b64encode(SearchKEY.encode()), encoding='utf-8'))
print("爬取页面为:https://fofa.info/result?qbase64=" + searchbs64)
html = requests.get(url="https://fofa.info/result?qbase64=" + searchbs64, headers=headers).text
# with open('1.html', 'w') as f:
# f.write(html)
tree = etree.HTML(
html) # etree.HTML()可以用来解析字符串格式的HTML文档对象,将传进去的字符串转变成_Element对象。作为_Element对象,可以方便的使用getparent()、remove()、xpath()等方法。
try:
pagenum = tree.xpath('//li[@class="number"]/text()')[-1] # 查找页码数
except Exception as e:
print(e)
pagenum = '0'
print("fofa页码数为0")
print("该关键字存在页码: " + pagenum) # 输出页码数
doc = open("result.txt", "w", encoding='UTF-8')
for i in range(int(StartPage), int(pagenum)):
print("Now write " + str(i) + " page")
rep = requests.get('https://fofa.info/result?qbase64=' + searchbs64 + "&page=" + str(i) + "&page_size=10",
headers=headers)
tree = etree.HTML(rep.text)
urllist = tree.xpath('//span[@class="hsxa-host"]')
for item in urllist:
item_string = etree.tostring(item, encoding='utf-8', method='text').decode()
print(item_string)
doc.write(item_string + "\n")
if i == int(StopPage):
break
time.sleep(TimeSleep)