【爬虫教程】动态页面抓取04

最新推荐文章于 2024-07-16 14:25:25 发布

AI study

最新推荐文章于 2024-07-16 14:25:25 发布

阅读量197

点赞数

分类专栏：网络爬虫文章标签：网络爬虫 selenium 动态页面抓取 charles

本文链接：https://blog.csdn.net/weixin_43797885/article/details/104385347

版权

网络爬虫专栏收录该内容

11 篇文章 0 订阅

订阅专栏

4.1Charles数据抓取工具

Charles使用指南

4.2Selenium自动化工具

4.2.1基础入门

4.2.2Selenium使用

4.2.2.1浏览器对象操作

import time
from selenium import webdriver
url='http://www.qq.com/'
chrome=webdriver.Chrome()  # 创建浏览器对象
chrome.maximize_window()  # 浏览器最大化
chrome.get(url)  # 打开该url对应的页面
time.sleep(3)  # 等待3s等待页面的加载
chrome.page_source  # 获取该url对应的页面源码
chrome.save_screenshot('./qq.png')  #对打开的页面截图并保存
chrome.close()  # 关闭当前页面
chrome.quit()  # 退出当前页面
chrome.back()  # 回退
chrome.forward()  # 前进

4.2.2.2标签对象的操作

导包:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

获取标签元素

chrome.find_element_by_id('su').click() # 通过id获取标签元素
chrome.find_element_by_class_name('bg s_btn').click() # 通过类名获取元素
chrome.find_element_by_xpath('') # 通过xpath获取元素
chrome.find_elements_by_tag_name('') # 通过标签名获取元素
chrome.find_element_by_css_selector('') # 通过选择器获取元素
chrome.find_element_by_link_text('') # 通过文本获取
chrome.find_element(By.ID,'') # 自定义方式获取元素

点击事件

# 方式一
from selenium.webdriver.common.keys import Keys
chrome.find_element_by_id('su').send_keys(Keys.ENTER)

# 方式二
chrome.find_element_by_id('su').click()  # 通过id获取标签元素

发送消息事件

from selenium.webdriver.common.by import By
chrome.find_element(By.ID,'kw').send_keys('google')

4.2.2.3窗口对象操作

窗口之间的切换

windows=chrome.window_handles # 获取窗口列表
current_window=chrome.current_window_handle # 获取当前窗口
chrome._switch_to.window(windows[1]) # 跳转到windows[1]这个窗口
iframe=driver.find_element_by_xpath('//div[@class="login"]/iframe') # 获取框架窗口标签对象
driver._switch_to.frame(iframe) # 跳转到该框架内部

窗口对象设置

from selenium.webdriver import ChromeOptions
url = 'http://news.mtime.com/2019/10/31/1598542.html'
options=ChromeOptions()  # 选项
options.add_argument('--mute-audio')  # 关闭媒体的声音
chrome=webdriver.Chrome(executable_path='驱动的路径',options=options)  # 创建浏览器对象，指定驱动位置，指定执行的选项
chrome.set_window_size(500,500)  # 设置窗口的大小
chrome.set_window_position(0,0,'父窗口') # 设置窗口的位置

4.2.2.4Js操作

滚动条操作

js1 = 'document.documentElement.scrollTop=10000'
js2 = 'window.scrollTo(0,document.body.scrollHeight)'
js3 = 'window.scrollTo(0,200)'
chrome.execute_script(js1)

img = chrome.find_element_by_class_name('index-logo-src') # 获取标签对象
chrome.execute_script('$(arguments[0]).fadeOut()',img) # 使用弹出效果影藏被选对象(对某个对象要执行的操作,对象)
chrome.execute_script("var q = document.getElementById(\"kw\");""q.style.border=\"2px solid red\";")

4.2.2.3三种等待方式

time.sleep(3) # 等待3s等待页面的加载
chrome.implicitly_wait(10) # 隐性等待,就是在创建driver时，为浏览器对象创建一个等待时间，这个方法是得不到某个元素就等待一段时间，直到拿到某个元素位置。在使用隐式等待的时候，实际上浏览器会在你自己设定的时间内部断的刷新页面去寻找我们需要的元素
kw=WebDriverWait(chrome,10,0.5).until(EC.presence_of_element_located((By.ID,'su')))显示等待,明确的要等到某个元素的出现或者是某个元素的可点击等条件，等不到，就一直等，除非在规定的时间之内都没找到，那么久跳出Exception

AI study

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
【爬虫教程】动态页面抓取04

4.1Charles数据抓取工具Charles使用指南4.2Selenium自动化工具4.2.1基础入门pip3.6 install selenium 版本对照表1 版本对照表2 驱动地址使用文档地址浏览器版本查看地址 robots.txt4.2.2Selenium使用4.2.2.1浏览器对象操作import timefrom selenium i...
复制链接

扫一扫