开源项目 `images-web-crawler` 使用教程

束斯畅Sharon

于 2024-09-02 08:09:26 发布

阅读量674

点赞数 17

本文链接：https://blog.csdn.net/gitblog_00652/article/details/141800215

版权

开源项目 `images-web-crawler` 使用教程

images-web-crawlerThis package is a complete tool for creating a large dataset of images (specially designed -but not only- for machine learning enthusiasts). It can crawl the web, download images, rename / resize / covert the images and merge folders.. 项目地址:https://gitcode.com/gh_mirrors/im/images-web-crawler

1. 项目的目录结构及介绍

images-web-crawler/
├── LICENSE
├── README.md
├── dataset_builder.py
├── images_downloader.py
├── sample.py
├── web_crawler.py

LICENSE: 项目的许可证文件，采用 GPL-3.0 许可证。
README.md: 项目的说明文档，包含项目的基本介绍和使用方法。
dataset_builder.py: 用于构建数据集的脚本。
images_downloader.py: 用于下载图片的脚本。
sample.py: 示例脚本，展示如何使用项目功能。
web_crawler.py: 核心脚本，负责爬取网页并收集图片链接。

2. 项目的启动文件介绍

项目的启动文件是 sample.py，它展示了如何使用 web_crawler.py 和 images_downloader.py 来爬取和下载图片。

# sample.py
from web_crawler import WebCrawler

crawler = WebCrawler(api_keys)
crawler.collect_links_from_web(keywords, images_nbr, remove_duplicated_links=True)
crawler.save_urls(download_folder + "/links.txt")
crawler.download_images(keywords, target_folder=download_folder)