【爬虫】谷歌、必应、百度图片爬取用于深度学习

最新推荐文章于 2024-04-30 22:00:02 发布

zhicai_liu

最新推荐文章于 2024-04-30 22:00:02 发布

阅读量488

点赞数

分类专栏：爬虫文章标签：爬虫

本文链接：https://blog.csdn.net/zhicai_liu/article/details/109749585

版权

爬虫专栏收录该内容

1 篇文章 0 订阅

订阅专栏

推荐一个很好用的开源项目，用于爬取工程项目中用的图片数据：https://github.com/sczhengyabin/Image-Downloader

该项目存在一个问题，有时为了更多的获取工程项目所需的图片数据，需要更换不同的关键词来爬取，但是这样爬下来的数据存在很多重复图片，因此，需要采取一个去重的策略。

在此，提供一个去重的方法，就是记录已经爬过的图片的URL，已经爬过的URL直接跳过。

修改downloader.py中的download_images函数，记录爬过的URL：

def download_images(image_urls, dst_dir, file_prefix="img", concurrency=50, timeout=20, proxy_type=None, proxy=None):
    """
    Download image according to given urls and automatically rename them in order.
    :param timeout:
    :param proxy:
    :param proxy_type:
    :param image_urls: list of image urls
    :param dst_dir: output the downloaded images to dst_dir
    :param file_prefix: if set to "img", files will be in format "img_xxx.jpg"
    :param concurrency: number of requests process simultaneously
    :return: none
    """

    with concurrent.futures.ThreadPoolExecutor(max_workers=concurrency) as executor:
        future_list = list()
        count = 0
        if not os.path.exists(dst_dir):
            os.makedirs(dst_dir)
        for image_url in image_urls:
            # 去重
            url_file = './url.txt'
            if not os.path.exists(url_file):
                with open(url_file, 'w') as f:
                    f.write(image_url+'\n')
            else:
                with open(url_file, 'r') as f:
                    url_list = f.readlines()
                    if image_url+'\n' in url_list:
                        print('## already downloaded!!  ' + image_url)
                        continue
                    else:
                        with open(url_file, 'a') as f:
                            f.write(image_url+'\n')

            file_name = file_prefix + "_" + "%04d" % count
            future_list.append(executor.submit(
                download_image, image_url, dst_dir, file_name, timeout, proxy_type, proxy))
            count += 1
        concurrent.futures.wait(future_list, timeout=180)

上述代码中，涉及不同的文件打开方式，关于不同的打开方式之间的差别，参考：https://www.runoob.com/python/python-func-open.html

zhicai_liu

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
【爬虫】谷歌、必应、百度图片爬取用于深度学习

推荐一个很好用的开源项目，用于爬取工程项目中用的图片数据：https://github.com/sczhengyabin/Image-Downloader该项目存在一个问题，有时为了更多的获取工程项目所需的图片数据，需要更换不同的关键词来爬取，但是这样爬下来的数据存在很多重复图片，因此，需要采取一个去重的策略。在此，提供一个去重的方法，就是记录已经爬过的图片的URL，已经爬过的URL直接跳过。修改downloader.py中的download_images函数，记录爬过的URL：def downlo
复制链接

扫一扫