YOLOv3框架实现目标检测之 - 爬虫百度、google图片，制作VOC格式数据集

最新推荐文章于 2023-07-11 14:35:05 发布

shine stone

最新推荐文章于 2023-07-11 14:35:05 发布

阅读量1.1k

点赞数 1

分类专栏：深度学习 yolo目标检测文章标签： VOC数据集 COCO 深度学习数据集 YOLOv3

本文链接：https://blog.csdn.net/hu_helloworld/article/details/103044737

版权

yolo目标检测同时被 2 个专栏收录

6 篇文章 1 订阅

订阅专栏

深度学习

5 篇文章 0 订阅

订阅专栏

图片数据来源于百度、google图片

曾部分参考文章：https://blog.csdn.net/wobeatit/article/details/79559314

因为google图片质量较好，推荐使用方法1：
利用googleimagesdownload工具爬取google 图片，
但需要fanqiang,能访问goolge图片，可以找插件/搭建亚马逊AWS服务器解决

以下方法仅在ubuntu下测试过

1、ubuntu下使用工具google-images-download，爬取google images

若有梯子，能访问google images,可以采用这种方式，稳定，十分推荐！
官方教程如下
项目地址：googleimagesdownload
工具安装：安装googleimagesdownload
使用示例：使用示例

可直接pip安装

pip install google_images_download

googleimagesdownload -k "灭火器箱" --size medium -l 1000 --chromedriver ./chromedriver

命令行输入参数解释：
-k “要搜索的图片”
–size 指定图片大小，如medium
-l 限制下载的数量
–chromedriver 指定谷歌驱动的路径
Chrome驱动下载安装教程很多：https://blog.csdn.net/qq_41188944/article/details/79039690

像这样 -l 限制下载1000张图片，因为图片版权的原因，实际下载到409张，可以更改搜索词再次下载。
google images 的图片质量较高，基本算是标注好的图片。
类似这种：在这里插入图片描述

该工具会在终端目录创建download文件夹以放置爬取的图片

2、python 脚本爬取百度图片

(1) 安装 Chrome 浏览器和 Chrome驱动
Chrome驱动安装：https://blog.csdn.net/qq_41188944/article/details/79039690

(2) pip install selenium安装selenium库

#*******本脚本运行时需要本机安装 Chrome 浏览器以及Chrome的驱动，同时需要selenium库的支撑********
from selenium import webdriver 
from selenium.webdriver.chrome.options import Options
import time  
import urllib.request
from bs4 import BeautifulSoup as bs
import re  
import os  
#****************************************************
#base_url_part1 = 'https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=index&fr=&hs=0&xthttps=111111&sf=1&fmq=&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&word='
#base_url_part2 = '&oq=bagua&rsp=0' # base_url_part1以及base_url_part2都是固定不变的，无需更改
base_url_part1 = 'https://www.shutterstock.com/zh/search/'
base_url_part2 = '' # base_url_part1以及base_url_part2都是固定不变的，无需更改
search_query = '灭火器' # 检索的关键词，可自行更改
location_driver = '/usr/bin/chromedriver' # Chrome驱动程序在电脑中的位置
 
class Crawler:
	def __init__(self):
		self.url = base_url_part1 + search_query + base_url_part2
 
	# 启动Chrome浏览器驱动
	def start_brower(self):
		chrome_options = Options()
		chrome_options.add_argument("--disable-infobars")
		# 启动Chrome浏览器  
		driver = webdriver.Chrome(executable_path=location_driver, chrome_options=chrome_options)  
		# 最大化窗口，因为每一次爬取只能看到视窗内的图片
		driver.maximize_window()  
		# 浏览器打开爬取页面  
		driver.get(self.url)  
		return driver
 
	def downloadImg(self, driver):  
		t = time.localtime(time.time())
		foldername = str(t.__getattribute__("tm_year")) + "-" + str(t.__getattribute__("tm_mon")) + "-" + \
					 str(t.__getattribute__("tm_mday")) # 定义文件夹的名字
		picpath = '/home/hujinlei/dev/DataSet/BaiduImage/%s' %(foldername) # 下载到的本地目录
		# 路径不存在时创建一个 
		if not os.path.exists(picpath): os.makedirs(picpath)
		# 记录下载过的图片地址，避免重复下载
		img_url_dic = {} 
		x = 0  
		# 当鼠标的位置小于最后的鼠标位置时,循环执行
		pos = 0     
		for i in range(80): # 此处可自己设置爬取范围，本处设置为1，那么不会有下滑出现
			pos += 500 # 每次下滚500
			js = "document.documentElement.scrollTop=%d" %pos    
			driver.execute_script(js)  
			time.sleep(2)
			# 获取页面源码
			html_page = driver.page_source
			# 利用Beautifulsoup4创建soup对象并进行页面解析
			soup = bs(html_page, "html.parser")
			# 通过soup对象中的findAll函数图像信息提取
			imglist = soup.findAll('img', {'src':re.compile(r'https:.*\.(jpg|png)')})
 
			for imgurl in imglist:  
				if imgurl['src'] not in img_url_dic:
					target = '{}/{}.jpg'.format(picpath, x)
					img_url_dic[imgurl['src']] = '' 
					urllib.request.urlretrieve(imgurl['src'], target)  
					x += 1  
					
	def run(self):
		print ('\t\t\t**************************************\n\t\t\t**\t\tWelcome to Use Spider\t\t**\n\t\t\t**************************************')  
		driver=self.start_brower()
		self.downloadImg(driver)
		driver.close()
		print("Download has finished.")
 
if __name__ == '__main__':  
	craw = Crawler() 
	craw.run()

3、如何批量重命名下载的图片，制作VOC COCO等数据集
以后补充

shine stone

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
1
评论
YOLOv3框架实现目标检测之 - 爬虫百度、google图片，制作VOC格式数据集

图片数据来源于百度、google图片曾部分参考文章：https://blog.csdn.net/wobeatit/article/details/795593141、python 脚本爬取百度图片数据(1) 安装 Chrome 浏览器和 Chrome驱动**Chrome驱动安装：**https://blog.csdn.net/qq_41188944/article/details/790...
复制链接

扫一扫