将数据集类标签数字化
(一)该数据集类标签在最后一列(直接数字化标签)
//打开旧文件
f = open('dataset/datingTestSet.txt','r',encoding='utf-8')
//打开新文件
f_new = open('dataset/datingTestSet0.txt','w',encoding='utf-8')
//循环读取旧文件
for line in f:
labels=['didntLike','smallDoses','largeDoses']
new_labels=['1','2','3']
i=0
for label in labels:
# 进行判断
if label in line:
print(new_labels[i])
line=line.replace(label,new_labels[i])
print(line)
break
i+=1
// 如果不符合就正常的将文件中的内容读取并且输出到新文件中
f_new.write(line)
f.close()
f_new.close()
备注:该数据集来自datingTestSet.txt
(二)该数据集类标签不在最后一列
此时,为了统一数据集格式便于后期处理,将类标签数字化并将类标签放到最后一列,然后写入新文件
//打开旧文件
f = open('dataset/letter-recognition.data','r',encoding='utf-8')
//打开新文件
f_new = open('dataset/letter-recognition0.data', 'w', encoding='utf-8')
//循环读取旧文件
for line in f:
labels=['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z']
new_labels=['1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24','25','26']
i=0
for label in labels:
// 进行判断
if label in line:
//删掉该字符
line=line.replace(label+',','')
//去掉首尾'\n','\r','\t',' '
line=line.strip()
//将数字化的标签添加到末尾
line=line+','+new_labels[i]+'\n'
break
i+=1
// 如果不符合就正常的将文件中的内容读取并且输出到新文件中
f_new.write(line)
f.close()
f_new.close()
备注:该数据集来自letter-recognition.data