使用selenium实现对exploit-db网站的动态抓取

最新推荐文章于 2024-06-11 13:56:06 发布

codecode入川

最新推荐文章于 2024-06-11 13:56:06 发布

阅读量549

点赞数 30

文章标签： selenium python 测试工具

本文链接：https://blog.csdn.net/m0_67274307/article/details/137115602

版权

环境准备

确保你已经安装了Selenium库以及对应的WebDriver。这个示例使用Chrome浏览器，因此你需要安装chromedriver。如果你还没有安装，可以通过以下命令安装Selenium和webdriver-manager，后者可以自动管理WebDriver的版本。

pip install selenium webdriver-manager

代码实现

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.common.exceptions import NoSuchElementException
import time

def setup_driver():
"""设置Selenium WebDriver."""
service = Service(ChromeDriverManager().install())
options = webdriver.ChromeOptions()
# options.add_argument('--headless') # 无头模式
driver = webdriver.Chrome(service=service, options=options)
return driver

def crawl_exploits(driver, url, pages=5):
"""抓取漏洞信息."""
driver.get(url)
time.sleep(5) # 等待页面加载

exploits = []

for _ in range(pages):
# 获取当前页面的漏洞标题和链接
exploit_elements = driver.find_elements(By.CSS_SELECTOR, 'h2 a.text-dark')
for element in exploit_elements:
title = element.text
link = element.get_attribute('href')
exploits.append({'title': title, 'link': link})

# 尝试点击下一页
try:
next_button = driver.find_element(By.LINK_TEXT, 'Next')
next_button.click()
time.sleep(5) # 等待下一页加载
except NoSuchElementException:
print("No more pages or cannot find the Next button.")
break

return exploits

def main():
driver = setup_driver()
try:
url = 'https://www.exploit-db.com/'
exploits = crawl_exploits(driver, url, pages=2) # 仅示例抓取2页数据
for exploit in exploits:
print(exploit)
finally:
driver.quit()

if __name__ == '__main__':
main()

功能说明

这段代码首先设置了Chrome WebDriver，可以选择使用无头模式以减少资源消耗。
在crawl_exploits函数中，代码访问https://www.exploit-db.com/，并开始抓取指定页数的漏洞信息，包括标题和详情页链接。
每次页面加载后，它会寻找并点击“下一页”按钮，直到达到指定的页数或找不到“下一页”按钮为止。
最后，它打印出每个漏洞的标题和链接。

注意事项

实际使用时，请根据目标网站的加载速度调整time.sleep中的等待时间，以确保页面完全加载。
为尊重目标网站的服务，避免对其造成不必要的负担，请不要频繁运行此脚本。
确保你的行为符合目标网站的使用条款和robots.txt规则。
在部署和使用爬虫之前，务必了解相关的法律和伦理准则。

codecode入川

关注

30
点赞
踩
18

收藏

觉得还不错? 一键收藏
0
评论
使用selenium实现对exploit-db网站的动态抓取

exploits = crawl_exploits(driver, url, pages=2) # 仅示例抓取2页数据。# options.add_argument('--headless') # 无头模式。time.sleep(5) # 等待下一页加载。"""设置Selenium WebDriver."""time.sleep(5) # 等待页面加载。# 获取当前页面的漏洞标题和链接。，后者可以自动管理WebDriver的版本。如果你还没有安装，可以通过以下命令安装。"""抓取漏洞信息."""
复制链接

扫一扫