数据集收集和初步处理-纯代码实用干货

Ysn0719

已于 2024-08-21 17:43:19 修改

阅读量74

点赞数

文章标签： python 数据收集人工智能工程化应用大数据

于 2024-08-21 08:56:26 首次发布

本文链接：https://blog.csdn.net/yangsn0719/article/details/141377095

版权

第一步：你想从下载一些图像，作为业务数据集的初始准备，使用如下代码（我推荐使用百度，重复数量更少）
关于第一步的网上爬取图片，详细内容请参见原博主链接： https://blog.csdn.net/libaiup/article/details/134028174

# -*- coding:utf8 -*-
import requests
import json
from urllib import parse
import os
import time

class BaiduImageSpider(object):
    def __init__(self):
        self.json_count = 0  # 请求到的json文件数量（一个json文件包含30个图像文件）
        self.url = 'https://image.baidu.com/search/acjson?tn=resultjson_com&logid=5179920884740494226&ipn=rj&ct' \
                   '=201326592&is=&fp=result&queryWord={' \
                   '}&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=0&hd=&latest=&copyright=&word={' \
                   '}&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&fr=&expermode=&nojc=&pn={' \
                   '}&rn=30&gsm=1e&1635054081427= '
        self.directory = "E:/work/ToolCode/网络下载图片/{}"  # 存储目录  这里需要修改为自己希望保存的目录  {}不要丢
        self.header = {
   
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                          'Chrome/95.0.4638.54 Safari/537.36 Edg/95.0.1020.30 '
        }

    # 创建存储文件夹
    def create_directory(self, name):
        self.directory = self.directory.format(name)
        # 如果目录不存在则创建
        if not os.path.exists(self.directory):
            os.makedirs(self.directory)
        self.directory += r'\{}'

    # 获取图像链接
    def get_image_link(self, url):
        list_image_link = []
        strhtml = requests.get(url, headers=self.header)  # Get方式获取网页数据
        jsonInfo = json.loads(strhtml.text)
        for index in range(30):
            list_image_link.append(jsonInfo['data'][index]['thumbURL'])
        return list_image_link

    # 下载图片
    def save_image(self, img_link

最低0.47元/天解锁文章

Ysn0719

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
2
评论
数据集收集和初步处理-纯代码实用干货

第三步：下载完成，重命名也完成之后，最好做一遍数据去重，不然重复图像很多的话，在后续标注的时候会产生更大的工作量。第一步：你想从下载一些图像，作为业务数据集的初始准备，使用如下代码（我推荐使用百度，重复数量更少）
复制链接

扫一扫