我们在进行深度学习训练视觉识别模型时,需要提供大量的照片数据集给电脑训练。所以我们需要在百度,bing等搜索网站进行爬取图片。
在此,我将提供一个由python的爬虫代码,这个代码可以爬取百度图片。
首先,你需要安装python爬虫所需要的依赖,这个前提必须支持!!!!!
例如selenium,beautifulsoup,requests等,还需要下载匹配谷歌浏览器版本的爬虫驱动。下载链接入下。
CNPM Binaries Mirror (npmmirror.com)
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
from bs4 import BeautifulSoup
# 配置ChromeDriver路径
#driver_path = 'ChromeDriver路径'
driver=webdriver.Chrome()
# 创建Chrome浏览器实例
#driver = webdriver.Chrome(executable_path=driver_path)
# 打开百度图片网页
driver.get('https://image.baidu.com/')
# 定位搜索框并输入关键词
search_box = driver.find_element(By.XPATH,"//input[@id='kw']")
search_box.send_keys("小麦") # 替换为你想搜索的关键词
# 定位搜索按钮并点击
search_button = driver.find_element(By.XPATH,"//input[@class='s_newBtn']")
search_button.click()
# 模拟滚动以加载更多图片
for i in range(40): # 这里可以根据需要设置滚动的次数
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2) # 等待页面加载
#image_elements=[]
page = driver.page_source
soup = BeautifulSoup(page, 'html.parser')
# 查找包含 src 属性的元素
elements_with_src = soup.find_all(src=True)
# 提取链接地址
urls = []
for element in elements_with_src:
if 'src' in element.attrs:
url = element['src']
urls.append(url)
#过滤出指定格式的链接
filtered_urls = []
prefix = "https://img"
for url in urls:
if url.startswith(prefix):
filtered_urls.append(url)
# 输出链接地址
for url in filtered_urls:
print(url)
driver.quit()
import requests
import os
# 创建保存图片的文件夹
folder_path = r'E:\meeee\data111'
if not os.path.exists(folder_path):
os.makedirs(folder_path)
# 遍历下载链接图片
for i, url in enumerate(filtered_urls):
try:
response = requests.get(url)
response.raise_for_status()#检查请求是否成功
image_name = f'image_{i}.jpg' # 图片命名规则可以根据需要修改
image_path = os.path.join(folder_path, image_name)
with open(image_path, 'wb') as file:
file.write(response.content)
print(f'成功下载图片 {image_name}')
except Exception as e:
print(f'下载图片 {image_name} 失败:{str(e)}')
continue
print('所有图片下载完成')
好用就给我点一个小小的赞吧!!!!!!