python分析java网页_【求助】java 或者 python如何获取一个网站的所有请求

最新推荐文章于 2024-03-05 11:42:01 发布

胖博士

最新推荐文章于 2024-03-05 11:42:01 发布

阅读量319

点赞数

文章标签： python分析java网页

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_30854435/article/details/113506924

版权

本文通过Python的selenium库，使用无头浏览器Chrome，模拟用户浏览网站并捕获性能日志，从而获取到网站的全部请求链接。通过解析性能日志，过滤掉data:开头的base64编码引用和document页面链接，最终得到静态资源链接集合。

摘要由CSDN通过智能技术生成

[Python] 纯文本查看复制代码from selenium import webdriver

from selenium.webdriver.chrome.options import Options

from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

d = DesiredCapabilities.CHROME

chrome_options = Options()

#使用无头浏览器

chrome_options.add_argument('--headless')

chrome_options.add_argument('--user-agent=Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36')

#浏览器启动默认最大化

chrome_options.add_argument("--start-maximized");

#该处替换自己的chrome驱动地址

browser = webdriver.Chrome("D://googleDever//chromedriver.exe",chrome_options=chrome_options,desired_capabilities=d)

browser.set_page_load_timeout(150)

browser.get("https://www.xxx.com")

#静态资源链接存储集合

urls = []

#获取静态资源有效链接

for log in browser.get_log('performance'):

if 'message' not in log:

continue

log_entry = json.loads(log['message'])

try:

#该处过滤了data:开头的base64编码引用和document页面链接

if "data:" not in log_entry['message']['params']['request']['url'] and 'Document' not in log_entry['message']['params']['type']:

urls.append(log_entry['message']['params']['request']['url'])

except Exception as e:

pass

print(urls)

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。