机器视觉-深度学习-数据集爬取-爬虫

lucky~star

已于 2023-08-28 16:16:04 修改

阅读量210

点赞数 1

文章标签：深度学习爬虫人工智能 python beautifulsoup

于 2023-08-07 18:30:57 首次发布

本文链接：https://blog.csdn.net/qq_62703890/article/details/132152172

版权

我们在进行深度学习训练视觉识别模型时，需要提供大量的照片数据集给电脑训练。所以我们需要在百度，bing等搜索网站进行爬取图片。

在此，我将提供一个由python的爬虫代码，这个代码可以爬取百度图片。

首先，你需要安装python爬虫所需要的依赖，这个前提必须支持！！！！！

例如selenium，beautifulsoup，requests等，还需要下载匹配谷歌浏览器版本的爬虫驱动。下载链接入下。

CNPM Binaries Mirror (npmmirror.com)

from selenium import webdriver
from selenium.webdriver.common.by import By
import time
from bs4 import BeautifulSoup

# 配置ChromeDriver路径
#driver_path = 'ChromeDriver路径'
driver=webdriver.Chrome()
# 创建Chrome浏览器实例
#driver = webdriver.Chrome(executable_path=driver_path)

# 打开百度图片网页
driver.get('https://image.baidu.com/')

# 定位搜索框并输入关键词
search_box = driver.find_element(By.XPATH,"//input[@id='kw']")
search_box.send_keys("小麦")  # 替换为你想搜索的关键词

# 定位搜索按钮并点击
search_button = driver.find_element(By.XPATH,"//input[@class='s_newBtn']")
search_button.click()

# 模拟滚动以加载更多图片
for i in range(40):  # 这里可以根据需要设置滚动的次数
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(2)  # 等待页面加载
#image_elements=[]
page = driver.page_source
soup = BeautifulSoup(page, 'html.parser')
# 查找包含 src 属性的元素
elements_with_src = soup.find_all(src=True)

# 提取链接地址
urls = []
for element in elements_with_src:
    if 'src' in element.attrs:
        url = element['src']
        urls.append(url)
#过滤出指定格式的链接
filtered_urls = []
prefix = "https://img"
for url in urls:
    if url.startswith(prefix):
        filtered_urls.append(url)

# 输出链接地址
for url in filtered_urls:
    print(url)

driver.quit()

import requests
import os

# 创建保存图片的文件夹
folder_path = r'E:\meeee\data111'
if not os.path.exists(folder_path):
    os.makedirs(folder_path)

# 遍历下载链接图片
for i, url in enumerate(filtered_urls):
    try:
        response = requests.get(url)
        response.raise_for_status()#检查请求是否成功
        image_name = f'image_{i}.jpg'  # 图片命名规则可以根据需要修改
        image_path = os.path.join(folder_path, image_name)
        with open(image_path, 'wb') as file:
            file.write(response.content)
        print(f'成功下载图片 {image_name}')
    except Exception as e:
        print(f'下载图片 {image_name} 失败：{str(e)}')
        continue
print('所有图片下载完成')

好用就给我点一个小小的赞吧！！！！！！