phantomjs+selenium实现爬取动态网址

最新推荐文章于 2024-07-19 16:36:18 发布

weixin_30412167

最新推荐文章于 2024-07-19 16:36:18 发布

阅读量75

点赞数

文章标签： python 爬虫

原文链接：http://www.cnblogs.com/bencakes/p/5971859.html

版权

之前使用 selenium + firefox驱动浏览器来实现爬取动态网址，但是firefox经常更新，更新后时常会导致webdriver启动不来，所以改用phantomjs+selenium来改善一下。
使用phantomjs和使用浏览器区别并不大。

一，首先还是需要下载Phantomjs

Phantomjs对各个主流的平台都支持，下载页面。选择好存放的目录，例如D:\phantomjs。
phantomjs的可执行文件就在bin目录下，可以将D:\phantomjs\bin目录加入环境变量中。如果不加入环境变量，那么selenium在驱动phantomjs时就需要指定路径。

二，在Selenium中驱动Phantomjs

from selenium import webdriver
from selenium.common.exceptions import TimeoutException

##可以对phantomjs配置
#cap = webdriver.DesiredCapabilities.PHANTOMJS    #获取webdriver对Phantomjs的默认配置
#cap["phantomjs.page.settings.resourceTimeout"] = 5000    #资源加载超时时长
#cap["phantomjs.page.settings.loadImages"] = False    #是否加载图片
#driver = webdriver.PhantomJS(desired_capabilities=cap)

#未将phantomjs加入环境变量,需要指定phantomjs的路径
#driver = webdriver.PhantomJS(executable_path="D:\phantomjs\bin\phantomjs.exe")
driver = webdriver.PhantomJS()
driver.set_page_load_timeout(5)    #设置页面超时时长
#driver.set_script_timeout(5)    #设置页面JS超时时长，这两者超时后会报TimeoutException错

##当超时后停止页面的加载
##有些页面在加载出你想要的数据后，还是会一直加载一些其他资源
tru:
    driver.get("www.tvmao.com")
exception TimeoutException:
    driver.execute_script("window.stop()")

##获取网页源代码后，就可以将其保存起来进而进行数据解析了
page_source = driver.page_source()

############
#
#数据解析部分
#
############

phantomjs可配置的选项，可以看官方文档说明

转载于:https://www.cnblogs.com/bencakes/p/5971859.html

weixin_30412167

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
phantomjs+selenium实现爬取动态网址

之前使用 selenium + firefox驱动浏览器来实现爬取动态网址，但是firefox经常更新，更新后时常会导致webdriver启动不来，所以改用phantomjs+selenium来改善一下。使用phantomjs和使用浏览器区别并不大。一，首先还是需要下载PhantomjsPhantomjs对各个主流的平台都支持，下载页面。选择好存放的目录，例如D:\phantomjs。ph...
复制链接

扫一扫