本文使用同一个数据集进行数据预处理练习,其中包含了人脸图片文件夹,CSV文件,txt文件。
数据集主要是针对于人脸照片进行年龄以及性别的预测,在导入模型签的一些简单的数据处理。
1.对人脸图片文件夹,txt文件的操作
1.1.数据集格式
├── faces
├── 00000A02.jpg
├── 00002A02.jpg
├── 00004A02.jpg
├── 00006A02.jpg
├── ......
├── face_gender_label.txt
- 在txt文件
中,对每个jpg文件
,都标注了文件类别
:0 ,1。
其中第一列
就是对应文件的名字
,第二列
就是该文件所属的类别
然后,还给出了不同类别
对应的分类的具体名称
:0 为女性 ,1为男性。
目的就是实现将女性男性分别建立文件夹,根据标签进行分类至不同文件夹,并保存照片。
想实现的效果如下:
├── Female
├── 00000A02.jpg
├── 00002A02.jpg
├── 00004A02.jpg
├── 00006A02.jpg
├── ......
├── Male
├── 07974A11.jpg
├── 07976A12.jpg
├── 07978A12.jpg
├── 07980A12.jpg
├── ......
1.2.提取txt文件中的标签,按照标签类型保存人脸图片的代码实现。
(如果是修改为自己的数据集的话,只需要修改读取的路径,以及txt文件路径及名称)
import os
import shutil
label_file = open("D:\\Pro\stfa1227\\STFA\\face_gender_label.txt", 'r')
input_path = "D:\\Pro\stfa1227\\STFA\data_set\\face_data\\faces_JinChungChen"
output_path = "D:\\Pro\stfa1227\\STFA\data_set\\faces_split"
lables=["Male","Female"]
data = label_file.readlines()
i = 1
for line in data:
str1 = line.split(" ")
file_name = str1[0]
file_label = str1[1].strip()
old_file_path = os.path.join(input_path, file_name)
new_file_path = ""
if "0" in file_label:
new_file_path = os.path.join(output_path, lables[int(file_label) - 1])
elif "1" in file_label:
new_file_path = os.path.join(output_path, lables[int(file_label) - 1])
if not os.path.exists(new_file_path):
print("Path " + new_file_path + " not existed,creat new one......")
os.makedirs(new_file_path)
new_file_path = os.path.join(new_file_path, file_name)
print("" + str(i) + "\tcopy file : " + old_file_path + " to " + new_file_path)
shutil.copyfile(old_file_path, new_file_path)
i = i + 1
print("Data Processing Done")
str1[1]和str1[1].strip()的区别
strip: 用来去除头尾字符、空白符(包括\n、\r、\t、' ',即:换行、回车、制表符、空格)
结果
1.2.根据上面提取的文件,按照比例划分为训练集和测试集
(如果是修改为自己的数据集的话,只需要修改读取的路径)
代码
import os
from shutil import copy, rmtree
import random
def mk_file(file_path: str):
if os.path.exists(file_path):
rmtree(file_path)
os.makedirs(file_path)
def main():
random.seed(0)
split_rate = 0.1
cwd = os.getcwd()
data_root = os.path.join(cwd, "data_set/face_data")
origin_photos_path = os.path.join(data_root, "faces_split")
assert os.path.exists(origin_photos_path), "path '{}' does not exist.".format(origin_photos_path)
photo_class = [cla for cla in os.listdir(origin_photos_path)
if os.path.isdir(os.path.join(origin_photos_path, cla))]
train_root = os.path.join(data_root, "train")
mk_file(train_root)
for cla in photo_class:
mk_file(os.path.join(train_root, cla))
val_root = os.path.join(data_root, "val")
mk_file(val_root)
for cla in photo_class:
mk_file(os.path.join(val_root, cla))
for cla in photo_class:
cla_path = os.path.join(origin_photos_path, cla)
images = os.listdir(cla_path)
num = len(images)
eval_index = random.sample(images, k=int(num*split_rate))
for index, image in enumerate(images):
if image in eval_index:
image_path = os.path.join(cla_path, image)
new_path = os.path.join(val_root, cla)
copy(image_path, new_path)
else:
image_path = os.path.join(cla_path, image)
new_path = os.path.join(train_root, cla)
copy(image_path, new_path)
print("\r[{}] processing [{}/{}]".format(cla, index+1, num), end="") # processing bar
print()
print("processing done!")
if __name__ == '__main__':
main()
结果
(为了看的方便,我把刚才的faces_split文件夹名修改为faces_gender_photos)
原本的照片不变,只是重新划分了训练集和验证集,以此方便深度网络模型去训练!
2.对人脸图片文件夹,CSV文件的操作
(python深度学习图像处理CSV文件分类标签图片到各个文件夹)
2.1.数据集格式
├── faces
├── 00000A02.jpg
├── 00002A02.jpg
├── 00004A02.jpg
├── 00006A02.jpg
├── ......
├── faces.csv
faces.csv文件详情
import pandas as pd
import os
import shutil
# 读取表格文件+填写你的csv文件的位置
f = open(r"D:\\Pro\\stfa1227\\STFA\\data_set\\face_data\\train_age_JCC.csv", "rb")
list = pd.read_csv(f)
# 进行分类----填写你要分类文件夹的标签,有多少就写多少
for i in ['age','gender']:
if not os.path.exists(i):
os.mkdir(i)
listnew = list[list["age"] == i]#type是你csv文件里面的你要处理的那一列的列名称
l = listnew["id"].tolist()#image这里是你的处理文件的名字的列名称
j = str(i)
for each in l:
#这里是你数据文件放置的位置
shutil.copy('D:\\Pro\\stfa1227\\STFA\\data_set\\csv_datapro\\' + each, j)
3.Yolov标签归一化处理:自己数据集json文件转为txt文件
3.1.标签转换
边界框(bounding box)说明
在目标检测中,用边界框来表示物体的位置,边界框为能正好包含物体的矩形框。如下图中包含框选的矩形框即为边界框。
边界框的表达形式:
-
xyxy格式: 边界框由左上角坐标(x1,y1)和右下角坐标(x2,y2)表示
-
xywh格式: 边界框由中心坐标(x,y)和框的长宽(w,h)表示——YOLO中主要采用的是这种
根据以上的标准,在进行转化之前,你需要知道你的json文件里面bbox存储的是哪种形式。
假如我的数据格式如下
[
{
"Code Name": "A020118XX_10921.jpg",
"Name": "ppang",
"W": "0.6440329218107",
"H": "0.933250927070457",
"File Format": "jpg",
"Cat 1": "02",
"Cat 2": "01",
"Cat 3": "18",
"Cat 4": "xx",
"Annotation Type": "binding",
"Point(x,y)": "0.67798353909465,0.516069221260816",
"Label": "0",
"Serving Size": "xx",
"Camera Angle": "xx",
"Cardinal Angle": "xx",
"Color of Container": "xx",
"Material of Container": "xx",
"Illuminance": "xx"
}
]
其中第一行是照片的名称
对于自己的数据集,要明确自己要提取的信息!
也可以使用【1】制作并训练数据集!
3.统计文件夹标签数,
3.1.数据类型
全部是json文件的情况
左边文档名称为标签,右边json文件内容如下:
[
{
"Code Name": "A270332XX_00871.jpg",
"Name": "galibi",
"W": "0.564815",
"H": "0.587961",
"File Format": "jpg",
"Cat 1": "27",
"Cat 2": "03",
"Cat 3": "32",
"Cat 4": "xx",
"Annotation Type": "binding",
"Point(x,y)": "0.587963,0.522375",
"Label": "0",
"Serving Size": "xx",
"Camera Angle": "xx",
"Cardinal Angle": "xx",
"Color of Container": "xx",
"Material of Container": "xx",
"Illuminance": "xx"
}
]
其中重要的是“name”(标签名)和
需要完成以下的转换—> 左边的可能是其他渠道或者任务提供的json标注,右边的txt是yolov5所需要的标注
边界框(bounding box)的表示
在目标检测中,用边界框来表示物体的位置,边界框为能正好包含物体的矩形框。如下图中包含框选的矩形框即为边界框。
边界框的表达形式:
-
xyxy格式: 边界框由左上角坐标(x1,y1)和右下角坐标(x2,y2)表示
-
xywh格式: 边界框由中心坐标(x,y)和框的长宽(w,h)表示——YOLO中主要采用的是这种
而格式不同,代码不同
- bbox(x1,y1,x2,y2),也就是xyxy格式
size为图片的尺寸,一般json文件中可以获取,以list的形式储存,如[1920, 1080]
box为json里的边界框bbox,同样以list的形式表示
def convert(size,box):
dw = 1. / size[0]
dh = 1. / size[1]
x = (box[0] + box[2]) / 2.0
y = (box[1] + box[3]) / 2.0
w = box[2] - box[0]
h = box[3] - box[1]
x = x * dw
w = w * dw
y = y * dh
h = h * dh
return(x,y,w,h)
- bbox(x,y,w,h),也就是xywh格式
def convert(siez,box):
x, y, w, h = item['bbox']
dw = 1. / size[0]
dh = 1. / size[1]
x = x * dw
w = w * dw
y = y * dh
h = h * dh
return(x,y,w,h)
但是在本文中不是直接写了‘bbox’,而是point(x,y)和H,W.所以提取json中的这些数据就可以啦。
3.2.json文件转txt文件
3.2.1.标签数量少的情况(只有8个种类)
import sys
import os
import glob
import json
#v2 nc:8
def json2txt(jsonfile) :
tom = (item for item in json_data)
dict = next(tom, False)
if dict['Name'] == 'jamong':
ID = '0'
if dict['Name'] == 'jamongjuseu':
ID = '1'
if dict['Name'] == 'jang-eogu-i':
ID = '2'
if dict['Name'] == 'jang-eochobab':
ID = '3'
if dict['Name'] == 'jeog-yangbaechu':
ID = '4'
if dict['Name'] == 'jeonbog':
ID = '5'
if dict['Name'] == 'jelli':
ID = '6'
if dict['Name'] == 'jomigim':
ID = '7'
x = dict['Point(x,y)'].split(',')[0]
y = dict['Point(x,y)'].split(',')[1]
W = dict['W']
H = dict['H']
# print(ID+x+y+W,H)
return ID, x, y, W, H
rootdir = './food-sp/*json'
folderlist = glob.glob(rootdir)
print(folderlist)
foldercount = len(folderlist)
for folder in folderlist :
path = folder+"/"
file_list = os.listdir(path)
file_list_py = [file for file in file_list if file.endswith('.json')]
for jsonfile in file_list_py:
with open(path+jsonfile, 'r') as f:
filename = jsonfile.split(".")[0]
# print(filename)
json_data = json.load(f)
unlist = (item for item in json_data)
dict = next(unlist, False)
info_line = json2txt(dict)
str = ' '.join(info_line)
f = open(path+ filename +".txt", 'w')
print(str, file=f)
f.close()
if os.path.exists(path+jsonfile):
os.remove(path+jsonfile)
print("remove : "+jsonfile)
3.2.1.标签数量多的情况
统计标签的数量代码label-num.py
#统计任意一个或多个文件夹中的文件个数
import os
path = './food-sp' # 输入文件夹地址
files = os.listdir(path) # 读入文件夹
num_png = len(files) # 统计文件夹中的文件个数
print(num_png) # 打印文件个数
# 输出所有文件名
print("All Folder Name:")
for file in files:
print(file)
print("All Folder Number:",num_png)
运行后
说明有39个种类。
将自己数据集.json的标签,转化为归一化的.txt文件,用于YOLO的训练
如果是整体一个json文件的话
import os
import json
from tqdm import tqdm
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--json_path', default='./food-sp/*',type=str, help="input: coco format(json)")
parser.add_argument('--save_path', default='./foodyolo/labels', type=str, help="specify where to save the output dir of labels")
arg = parser.parse_args()
def convert(size, box):
dw = 1. / (size[0])
dh = 1. / (size[1])
x = box[0] + box[2] / 2.0
y = box[1] + box[3] / 2.0
w = box[2]
h = box[3]
x = x * dw
w = w * dw
y = y * dh
h = h * dh
return (x, y, w, h)
if __name__ == '__main__':
json_file = arg.json_path # 数据集类型的标注
ana_txt_save_path = arg.save_path # 保存的路径
data = json.load(open(json_file, 'r'))
if not os.path.exists(ana_txt_save_path):
os.makedirs(ana_txt_save_path)
id_map = {} # coco数据集的id不连续!重新映射一下再输出!
with open(os.path.join(ana_txt_save_path, 'classes.txt'), 'w') as f:
# 写入classes.txt
for i, category in enumerate(data['categories']):
f.write("{category['name']}\n")
id_map[category['id']] = i
# print(id_map)
for img in tqdm(data['images']):
filename = img["file_name"]
img_width = img["width"]
img_height = img["height"]
img_id = img["id"]
head, tail = os.path.splitext(filename)
ana_txt_name = head + ".txt" # 对应的txt名字,与jpg一致
f_txt = open(os.path.join(ana_txt_save_path, ana_txt_name), 'w')
for ann in data['annotations']:
if ann['image_id'] == img_id:
box = convert((img_width, img_height), ann["bbox"])
f_txt.write("%s %s %s %s %s\n" % (id_map[ann["category_id"]], box[0], box[1], box[2], box[3]))
f_txt.close()
像本文中是一个文件夹下,多个json文件夹得情
# _*_coding:utf-8 _*_
import os
files = os.listdir("/home/appleyuchi/PycharmProjects/2017-9-orgin")
for filename in files:
portion = os.path.splitext(filename)#分离文件名和扩展名
print("filename=",filename)
if portion[1] == ".txt":
print(portion[0])
newname = portion[0] + ".json"
print("newname=",newname)
os.rename("/home/appleyuchi/PycharmProjects/2017-9-orgin/"+filename,"/home/appleyuchi/PycharmProjects/2017-9-orgin/"+newname)
对数据集的所有照片和标签的路径保存到train_list.txt/val_list.txt
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on 2023.01.19
@author: Elena
V1 √
"""
##label list
import os
import pandas as pd
from sklearn.model_selection import train_test_split
# train_list
DATAPATH='train/'
lab_train_lists=os.listdir(DATAPATH+'labels')
#lab_train_lists.sort()
print(lab_train_lists[:2])
img_train_lists=os.listdir(DATAPATH+'images')
img_train_lists.sort()
#train_list.txt 生成的内容
#total=[DATAPATH+''+img_train_list+' '+\
#DATAPATH+'images/'+lab_train_list for img_train_list,lab_train_list in zip(img_train_lists,lab_train_lists)]
total=[DATAPATH+' images/'+ img_train_list+' '+ DATAPATH+'labels/'+lab_train_list for img_train_list,lab_train_list in zip(img_train_lists,lab_train_lists)]
df=pd.DataFrame(total)
train_df, val_df = train_test_split(df, test_size=0.2, random_state=1000)
val_df.to_csv(DATAPATH+'kf_val_list.txt',index=0,header=0)
train_df.to_csv(DATAPATH+'kf_train_list.txt',index=0,header=0)
# valid_list.txt
DATAPATH1='valid/'
test_lists=os.listdir(DATAPATH1+'images')
test_lists.sort()
test_total=[DATAPATH1+''+test_list for test_list in test_lists]
test_df=pd.DataFrame(test_total)
test_df.to_csv(DATAPATH1+'kf_test_list.txt',index=0,header=0)
'''
# test_list.txt
DATAPATH2='test/'
test_lists=os.listdir(DATAPATH2+'images')
test_lists.sort()
test_total=[DATAPATH2+''+test_list for test_list in test_lists]
test_df=pd.DataFrame(test_total)
test_df.to_csv(DATAPATH2+'test_list.txt',index=0,header=0)
'''
print("------------------------------Done----------------------------")
对images/labels 制作一个txt文件脚本
##制作label list
import os
import pandas as pd
from sklearn.model_selection import train_test_split
DATAPATH='/home/aistudio/work/dataset/'
lab_train_lists=os.listdir(DATAPATH+'lab_train')
lab_train_lists.sort()
print(lab_train_lists[:2])
img_train_lists=os.listdir(DATAPATH+'img_train')
img_train_lists.sort()
total=[DATAPATH+'img_train/'+img_train_list+' '+\
DATAPATH+'lab_train/'+lab_train_list for img_train_list,lab_train_list in zip(img_train_lists,lab_train_lists)]
df=pd.DataFrame(total)
train_df, val_df = train_test_split(df, test_size=0.2, random_state=1000)
val_df.to_csv(DATAPATH+'val_list.txt',index=0,header=0)
train_df.to_csv(DATAPATH+'train_list.txt',index=0,header=0)
test_lists=os.listdir(DATAPATH+'img_testA/')
test_lists.sort()
test_total=[DATAPATH+'img_testA/'+test_list for test_list in test_lists]
test_df=pd.DataFrame(test_total)
test_df.to_csv(DATAPATH+'test_list.txt',index=0,header=0)