使用Selenium爬取热搜遇到问题与实践

有梦想的大薯条

已于 2023-02-06 09:21:07 修改

阅读量430

点赞数

分类专栏： Python 文章标签： selenium python chrome

于 2023-02-05 22:51:20 首次发布

本文链接：https://blog.csdn.net/weixin_45293882/article/details/128891555

版权

Python 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

文章介绍了在Pycharm中使用selenium爬取热搜时可能遇到的问题，包括浏览器驱动的版本匹配、selenium版本过高导致的错误，以及如何通过调整代码来解决find_element系列的报错。作者提供了详细的解决步骤和代码示例，帮助读者理解和解决这些问题。

摘要由CSDN通过智能技术生成

热搜实践

本文使用Pycharm编程，遇到的问题也是关于在Pycharm编程中可能遇到的问题！！！希望对你有所帮助**（使用谷歌浏览器爬热搜）**。

工具准备

爬取热搜，首先要下载selenium包，我喜欢在Pycharm里的settings里下载，主要不想放C盘。

2. 下载浏览器驱动
下载完后，根据电脑浏览器是谷歌还是火狐来选择浏览器驱动，我使用Chrome浏览器，本文介绍谷歌浏览器驱动的使用
chrome驱动下载地址 http://chromedriver.storage.googleapis.com/index.html
Firfox驱动下载地址https://github.com/mozilla/geckodriver/releases

3. 浏览器驱动存放位置
将下载好的chromedriver.exe复制，放入谷歌浏览器安装目录，一般都在C盘。
chrome路径：C:\Program Files\Google\Chrome\Application 如图

将驱动放入Python安装目录**(此目录仅供参考)**
Python路径：D:\Python_release\python3.10

4. 测试可用性
在Pycharm运行如下代码，运行后会自动打开Chrome浏览器百度网址。

 from selenium import webdriver
driver=webdriver.Chrome()
url='https://www.baidu.com'
driver.get(url)
driver.maximize_window()

5. 爬取热搜操作

import time
from selenium.webdriver import Chrome,ChromeOptions
from selenium.webdriver.common.by import By

option=ChromeOptions()
option.add_argument("--headless")#隐藏浏览器
option.add_argument("--no-sandbox") # 以root身份运行
url='https://top.baidu.com/board?tab=realtime'
browser=Chrome(options=option)#创建浏览器对象，自动打开浏览器
browser.get(url)
#通过selenium实现自动化操作点击
button=browser.find_element(By.CSS_SELECTOR,'#sanRoot > main > div.container.right-container_2EFJr > div > div:nth-child(2) > div:nth-child(31) > div.content_1YWBm > div.hot-desc_1m_jR.small_Uvkd3 > a')
button.click()#点击更多
time.sleep(2)#等5秒
#print(browser.page_source)#获取网页源码
#content=browser.find_elements('/html/body/div/div/main/div[2]/div/div[2]/div/div[2]/a/div[1]')#老版本selenium使用的书写方法
content=browser.find_elements(By.XPATH,'/html/body/div/div/main/div[2]/div/div[2]/div/div[2]/a/div[1]')#根据网页xpath来选择,获取热搜词
number=browser.find_elements(By.XPATH,'//*[@id="sanRoot"]/main/div[2]/div/div[2]/div/div[1]/div[2]')#获取热搜度
for i in content:#二者都是列表，要以for循环来获取
    print(i.text)
browser.close()#关闭浏览器

遇到的问题

1. 运行chrome浏览器驱动，在没有出现报错的情况下自动退出（即测试可用性那里出现闪退现象）。

问题	解决办法
1.chromedriver与浏览器的版本不匹配	在设置中查看谷歌浏览器版本号，下载相似的驱动不要超过浏览器版本
2.selenium版本太高	查看selenium是否是4.8.0，可卸载切换为4.4.3版本以下即可

2.降低selenium，首先进入Pycharm的setting，进入Project:–>python lnterpreter后，找到selenium双击它，如图所示切换

3.运行报错为"WebDriver‘ object has no attribute ‘find_element_by_xpath"
解决办法：
1.降低selenium版本为3.141.0（3.几版本试试看）或者4.1.0。
2.改为 find_element(By.XPATH, , “解析路径”)。

4.运行报错“WebDriver“ object has no attribute “find_element_by_css_selector“
与上面类似
1.改为find_element( By.CSS_SELECTOR ,“解析路径”)

5.要注意元素是多个还是单个（find_element），多个用（find_elements）

只查找一个元素的时候:可以使用find_element(),find_elements();
查找多个元素的时候:只能用find_elements(),返回一个列表,列表里的元素全是WebElement节点对象
如果查找的目标在网页中只有一个,那么完全可以用find_element(),但如果有多个满足要求的节点,用find_element()就只能得到第一个节点了（不会报错）,所以查找多个节点时,应该使用find_elements()更好。