爬虫入门系列笔记

hushi19910329

已于 2024-07-06 23:16:30 修改

阅读量1.2k

点赞数 6

文章标签：爬虫笔记 python selenium

于 2024-07-06 22:13:57 首次发布

本文链接：https://blog.csdn.net/hushi19910329/article/details/140236508

版权

最近因为需要做一些基础的培训，所以将培训过程中用到的一些资源、材料、心得做一些记录。

文章目录

- 一些教程和相关参阅内容
- 利用chatGPT帮助我们写爬虫代码

一些教程和相关参阅内容

bilibili上一个比较简单易懂的教程：https://www.bilibili.com/video/BV1Z4411o7TA/?spm_id_from=333.337.search-card.all.click&vd_source=6f87e8e80d1bc75a823b552179074fe3
css选择器还不错的教程：https://www.w3cschool.cn/css/css-selector.html
xpath选择器的教程：http://v3.byhy.net/tut/auto/selenium/xpath_1/
selenium中用到的一些元素定位的方式和示例https://blog.csdn.net/hushi19910329/article/details/140236764?spm=1001.2014.3001.5501

利用chatGPT帮助我们写爬虫代码

比如我写了以下指令：

我需要使用 selenium 爬取一个网站，所有定位都使用 xpath 的方式，批量下载图片，
chrome 执行程序的位置为./driver/chrome.exe
driver 的位置为./driver/chromedrier.exe
启动 selenium 打开浏览器
打开页面：
https://image.baidu.com/
定位到：//\*[@id="app"]/div/div[1]/div/div[3]/div[2]/div[1]/form/span[1]/input
输入：猫咪
定位到并点击开始搜索：/html/body/div/div/div[1]/div/div[3]/div[2]/div[1]/form/span[3]/input
在新页面中，定位到所有的这种元素：
/html/body/div[5]/div[2]/div[4]/div[1]/ul/li[1]
/html/body/div[5]/div[2]/div[4]/div[1]/ul/li[2]
...
对于每一个元素：逐个点击，会在新页面打开这个图片，然后在新页面，找到元素：/html/body/div[1]/div[2]/div/div[2]/div/div[1]/span[7]
这是一个下载按钮，点击会弹出下载框，我需要获取点击后得到的链接，给我一个示例代码：

** 以下是chatGPT的结果： **

要使用Selenium批量下载图片，首先需要打开Chrome浏览器并执行一系列操作，包括输入搜索词、点击搜索按钮、点击图片、下载图片。以下是一个示例代码，演示如何实现这些步骤。

首先，需要确保你的Chrome驱动程序位置正确，并且Chrome驱动程序与Chrome浏览器版本匹配。

示例代码

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains

# 配置Chrome驱动
chrome_service = Service('./driver/chromedriver.exe')
options = webdriver.ChromeOptions()
options.binary_location = './driver/chrome.exe'

# 启动Chrome浏览器
driver = webdriver.Chrome(service=chrome_service, options=options)

# 打开百度图片
driver.get("https://image.baidu.com/")
time.sleep(2)

# 定位搜索框并输入关键词 "猫咪"
search_box = driver.find_element(By.XPATH, '//*[@id="app"]/div/div[1]/div/div[3]/div[2]/div[1]/form/span[1]/input')
search_box.send_keys("猫咪")
time.sleep(1)

# 定位并点击搜索按钮
search_button = driver.find_element(By.XPATH, '/html/body/div/div/div[1]/div/div[3]/div[2]/div[1]/form/span[3]/input')
search_button.click()
time.sleep(5)  # 等待搜索结果加载

# 批量定位所有图片的XPath
image_elements = driver.find_elements(By.XPATH, '/html/body/div[5]/div[2]/div[4]/div[1]/ul/li')

# 遍历所有图片元素
for image_element in image_elements:
    # 点击图片，打开图片详情页
    image_element.click()
    time.sleep(3)  # 等待详情页加载
    
    # 切换到新打开的窗口
    driver.switch_to.window(driver.window_handles[-1])
    
    try:
        # 定位下载按钮并点击
        download_button = driver.find_element(By.XPATH, '/html/body/div[1]/div[2]/div/div[2]/div/div[1]/span[7]')
        download_button.click()
        time.sleep(2)  # 等待下载框弹出
        
        # 获取下载链接
        download_link = download_button.get_attribute('href')
        print(f"Download link: {download_link}")
        
        # 在这里可以添加下载文件的代码，例如使用requests库下载图片
        # import requests
        # response = requests.get(download_link)
        # with open('path_to_save_image.jpg', 'wb') as file:
        #     file.write(response.content)
    except Exception as e:
        print(f"Error: {e}")
    
    # 关闭当前窗口，返回搜索结果页面
    driver.close()
    driver.switch_to.window(driver.window_handles[0])

# 关闭浏览器
driver.quit()