python渲染html页面_如何在python中提取HTML页面渲染期间获取的URL列表？

最新推荐文章于 2021-08-13 17:35:50 发布

weixin_39920415

最新推荐文章于 2021-08-13 17:35:50 发布

阅读量81

点赞数

文章标签： python渲染html页面

要获取浏览器在打开网页时发出的所有GET请求URL，你需要模拟页面渲染。PyQT和QtWebKit库可以帮助实现这一目标。Ghost.py是一个好的起点，它在打开网页时返回页面及加载资源的元组。通过QWebView和QNetworkAccessManager，你可以详细跟踪所有加载的HTTP资源。

摘要由CSDN通过智能技术生成

I want to be able to get the list of all URLs that a browser will do a GET request for when we try to open a page. For eg: if we try to open cnn.com, there are multiple URLs within the first HTTP response which the browser recursively requests for.

I'm not trying to render a page but I'm trying to obtain a list of all the urls that are requested when a page is rendered. Doing a simple scan of the http response content wouldn't be sufficient as there could potentially be images in the css which are downloaded. Is there anyway I can do this in python?

解决方案

It's likely that you'll have to render the page (not necessarily display it though) to be sure you're getting a complete list of all resources. I've used PyQT and QtWebKit in similar situations. Especially when you start counting resources included dynamically with javascript, trying to parse and load pages recursively with BeautifulSoup just isn't going to work.

Ghost.py is an excellent client to get you started with PyQT. Also, check out the QWebView docs and the QNetworkAccessManager docs.

Ghost.py returns a tuple of (page, resources) when opening a page:

from ghost import Ghost

ghost = Ghost()

page, resources = ghost.open('http://my.web.page')

resources includes all of the resources loaded by the original URL as HttpResource objects. You can retrieve the URL for a loaded resource with resource.url.