续接上篇,本篇主要内容为图片下载、数据标注准备
一、数据准备
上篇将所有待下载图片写入了txt中,现在对url进行处理,下载所有图片。
依赖的包有:requests,首先下载包,下载不了请使用清华源pypi | 镜像站使用帮助 | 清华大学开源软件镜像站 | Tsinghua Open Source Mirror
pip install requests
(一)得到图片文件后缀函数。对传入的url进行处理,得到图片文件名后缀
def get_suffix(url):
# 文件名
file_name = url[url.rfind("/"):].strip()
suffix_index = file_name.rfind(".")
if suffix_index != -1:
file_suffix = file_name[suffix_index:]
if file_suffix.find("?") > 0:
file_suffix = file_suffix[:file_suffix.find("?")]
else:
file_suffix = 'undefined'
return file_suffix
(二)发起请求。对requests初始化设置,设置【请求头,响应超时时间,读取超时时间,重复请求次数等】,并进行异常处理
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Encoding": "gzip, deflate, br", "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8,zh-TW;q=0.7",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"}
retries = Retry(total=2, backoff_factor=0.1, status_forcelist=[500, 502, 503, 504])
session = requests.Session()
session.mount('http://', HTTPAdapter(max_retries=retries))
session.mount('https://', HTTPAdapter(max_retries=retries))
sleep_time = 3
txt_file_name = "url2.txt"
output_file_name_pattern = "yuanshen"
with open(txt_file_name, 'r', encoding='utf-8') as f:
lines = f.readlines()
for index, line in enumerate(lines):
request_url = line.strip()
suffix = get_suffix(request_url)
if suffix == 'undefined':
continue
try:
resp = session.get(request_url, timeout=(sleep_time, sleep_time), headers=headers) #分别设置响应超时时间,读取超时时间
except Exception as e:
print(f"{request_url}无法访问")
continue
img = resp.content
print(f"请求成功{request_url}")
if not os.path.exists("img"):
os.mkdir("img")
with open(f"img/" + output_file_name_pattern + str(index) + suffix, 'wb') as f:
f.write(img)
print(f"download{index}图片完成")
time.sleep(sleep_time)
二、数据标注工具下载
本项目使用X-AnyLabeling-CPU版,可在github上搜索下载。X-AnyLabeling可简单理解为AnyLabeling的升级版,除提供官方中文外,还可使用预训练AI模型辅助图片标注。界面如下