selenuim和phantonJs处理网页动态加载数据的爬取

最新推荐文章于 2021-08-06 21:50:15 发布

阿柯柯

最新推荐文章于 2021-08-06 21:50:15 发布

阅读量329

点赞数

分类专栏：笔记 Python 爬虫文章标签： selenuim phantonJs 爬虫 python 动态数据加载处理

本文链接：https://blog.csdn.net/weixin_44761016/article/details/88980751

版权

笔记同时被 3 个专栏收录

11 篇文章 0 订阅

订阅专栏

爬虫

4 篇文章 0 订阅

订阅专栏

Python

3 篇文章 0 订阅

订阅专栏

一.selenium

- 1.selenum：三方库。可以实现让浏览器完成自动化的操作。

- 2.环境搭建

2.1 安装：pip install selenium
2.2 获取浏览器的驱动程序
- 下载地址：http://chromedriver.storage.googleapis.com/index.html
- 浏览器版本和驱动版本的对应关系表： https://blog.csdn.net/huilan_same/article/details/51896672
#使用下面的方法，查找指定的元素进行操作即可
find_element_by_id 根据id找节点
find_elements_by_name 根据name找
find_elements_by_xpath 根据xpath查找
find_elements_by_tag_name 根据标签名找
find_elements_by_class_name 根据class名字查找

#编码流程:
from selenium import webdriver
from time import sleep
#创建一个浏览器对象executable_path驱动的路径
bro = webdriver.Chrome(executable_path='./chromedriver')
#get方法可以指定一个url，让浏览器进行请求
bro.get('https://www.baidu.com')
sleep(1)
#让百度进行指定词条的一个搜索
text = bro.find_element_by_id('kw')#定位到了text文本框
text.send_keys('人民币') #send_keys表示向文本框中录入指定内容
sleep(1)
button = bro.find_element_by_id('su')
button.click()#click表示的是点击操作
sleep(3)
bro.quit()#关闭浏览器

- phantomJs：无界面浏览器。其自动化流程上述操作谷歌自动化流程一致。

from selenium import webdriver

bro = webdriver.PhantomJS(executable_path='/Users/bobo/Desktop/路飞爬虫授课/动态数据加载爬取/phantomjs-2.1.1-macosx/bin/phantomjs')

#打开浏览器
bro.get('https://www.baidu.com')

#截屏
bro.save_screenshot('./1.png')

text = bro.find_element_by_id('kw')#定位到了text文本框
text.send_keys('人民币') #send_keys表示向文本框中录入指定内容

bro.save_screenshot('./2.png')

bro.quit()

在这里插入图片描述

重点：selenium+phantomjs 就是爬虫终极解决方案:有些网站上的内容信息是通过动态加载js形成的，所以使用普通爬虫程序无法回去动态加载的js内容。例如豆瓣电影中的电影信息是通过下拉操作动态加载更多的电影信息。

使用selenium+phantomJs处理页面动态加载数据的爬取

综合操作：需求是尽可能多的爬取豆瓣网中的电影信息

from selenium import webdriver
from time import sleep
bro = webdriver.PhantomJS(executable_path='D:\Python3.7.1\Scripts\phantomjs')
url = 'https://movie.douban.com/typerank?type_name=%E5%96%9C%E5%89%A7&type=24&interval_id=100:90&action='
bro.get(url)
sleep(1)
#截屏
bro.save_screenshot('./1.png')
#编写js代码：让页面中的滚轮向下滑动（底部）
js = 'window.scrollTo(0,document.body.scrollHeight)'
#如何让浏览器对象执行js代码
bro.execute_script(js)
sleep(1)
#截屏
bro.save_screenshot('./2.png')
bro.execute_script(js)
bro.save_screenshot('./3.png')
#获取加载数据后的页面:page_sourse获取浏览器当前的页面数据
page_text = bro.page_source
print(page_text)

在这里插入图片描述

阿柯柯

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
selenuim和phantonJs处理网页动态加载数据的爬取

一.selenium- 1.selenum：三方库。可以实现让浏览器完成自动化的操作。- 2.环境搭建2.1 安装：pip install selenium2.2 获取浏览器的驱动程序下载地址：http://chromedriver.storage.googleapis.com/index.html浏览器版本和驱动版本的对应关系表： https://...
复制链接

扫一扫