python selenium 爬虫

最新推荐文章于 2024-03-20 10:36:02 发布

shlhhy

最新推荐文章于 2024-03-20 10:36:02 发布

阅读量1.2k

点赞数 4

分类专栏：爬虫文章标签： selenium

本文链接：https://blog.csdn.net/shlhhy/article/details/106273174

版权

爬虫专栏收录该内容

5 篇文章 0 订阅

订阅专栏

1. 安装

海关总署网站新闻：http://www.customs.gov.cn/customs/xwfb34/302425/3049105/index.html
采用python的requests抓取时网页返回412，各种尝试均未能解决问题。改用selenium尝试能否爬取。

chrome浏览器，chrome://version，查看浏览器版本
chromedriver，驱动器，二者版本需保持一致
dirver的下载地址如下：http://chromedriver.storage.googleapis.com/index.html
驱动下载后可以与python.exe放在同一目录下，或者在代码中指定路径

browser = webdriver.Chrome(executable_path="XXXXXXXXXX"，chrome_options=chrome_options)

2. 遇到的问题

chrome浏览器为83版本时，selenium打开的海关网页呈一片空白，但是打开其它网站均ok
改用57版的chrome浏览器，ok

3. 解决方案

如果不降低最新版的chrome，可采用等待网页跳转响应的办法打开，通过的原理未知，可以试一下，
http://mogicwula.com/2020/03/23/%E7%88%AC%E5%8F%96%E6%AD%A6%E6%B1%89%E7%96%AB%E6%83%85%E6%95%B0%E6%8D%AE412%E7%8A%B6%E6%80%81%E7%A0%81%E8%A7%A3%E5%86%B3/

from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import selenium.webdriver.support.ui as ui

def is_visible(locator, timeout=10):
    try:
        ui.WebDriverWait(browser, timeout).until(EC.visibility_of_element_located((By.XPATH, locator)))
        return True
    except TimeoutException:
        return False
 
browser.get("XXXXXXXXXXXXXXXXXXXX")
is_visible('/html/body/div[2]/div[2]/div[1]')

4. xpath解析网页元素

browser.get()抓取到网页后，可以通过以下的方法解析网页元素

find_element_by_xpath("//ul[@class='conList_ull']")：返回网页中class=conList_ull的ul下的元素
find_elements_by_xpath('li')：返回ul下的所有li元素，是一个包含多个webElement的数组
li_content[i].find_element_by_xpath('a').get_attribute('href')
需要使用get_attribute()方法才能拿到 a 里面的属性数据
find_element_by_name('ArticleTitle').get_attribute('content')
find_element_by_id('easysiteText').text