推荐一个很好用的开源项目,用于爬取工程项目中用的图片数据:https://github.com/sczhengyabin/Image-Downloader
该项目存在一个问题,有时为了更多的获取工程项目所需的图片数据,需要更换不同的关键词来爬取,但是这样爬下来的数据存在很多重复图片,因此,需要采取一个去重的策略。
在此,提供一个去重的方法,就是记录已经爬过的图片的URL,已经爬过的URL直接跳过。
修改downloader.py
中的download_images
函数,记录爬过的URL:
def download_images(image_urls, dst_dir, file_prefix="img", concurrency=50, timeout=20, proxy_type=None, proxy=None):
"""
Download image according to given urls and automatically rename them in order.
:param timeout:
:param proxy:
:param proxy_type:
:param image_urls: list of image urls
:param dst_dir: output the downloaded images to dst_dir
:param file_prefix: if set to "img", files will be in format "img_xxx.jpg"
:param concurrency: number of requests process simultaneously
:return: none
"""
with concurrent.futures.ThreadPoolExecutor(max_workers=concurrency) as executor:
future_list = list()
count = 0
if not os.path.exists(dst_dir):
os.makedirs(dst_dir)
for image_url in image_urls:
# 去重
url_file = './url.txt'
if not os.path.exists(url_file):
with open(url_file, 'w') as f:
f.write(image_url+'\n')
else:
with open(url_file, 'r') as f:
url_list = f.readlines()
if image_url+'\n' in url_list:
print('## already downloaded!! ' + image_url)
continue
else:
with open(url_file, 'a') as f:
f.write(image_url+'\n')
file_name = file_prefix + "_" + "%04d" % count
future_list.append(executor.submit(
download_image, image_url, dst_dir, file_name, timeout, proxy_type, proxy))
count += 1
concurrent.futures.wait(future_list, timeout=180)
上述代码中,涉及不同的文件打开方式,关于不同的打开方式之间的差别,参考:https://www.runoob.com/python/python-func-open.html