写在前面——近日在处理数据的时候发现有的文件为csv文件,有的为tsv文件,大概搜了一下了解到:TSV是用制表符(‘\t’)作为字段值的分隔符;CSV是用半角逗号(‘,’)作为字段值的分隔符。https://www.jianshu.com/p/6e1c3e9f5e42
所以我需要把格式统一,把tsv转化为csv,还需要在最后一列加上label。
代码是自己东拼西凑的,如有错误,请指出,谢谢大家~
import pandas as pd
import os
# 原始文件位置
source_path = "./tsv_data/"
# 保存位置
save_path = "./csv_data/"
if not os.path.exists(save_path):
os.mkdir(save_path)
pathDir = os.listdir(source_path)
Name = []
End = []
# 获得文件的名称和后缀
def getName(workdir):
for filename in os.listdir(workdir):
split_file = os.path.splitext(filename)
# print(split_file[0])
Name.append(split_file[0])
End.append(split_file[1])
return Name, End
name, end = getName(source_path)
# print(Name, End)
TsvFile = os.listdir(source_path)
# print(len(TsvFile))
# print(TsvFile)
# 循环将tsv文件转为csv文件
for long in range(len(TsvFile)):
with open(source_path + TsvFile[long], 'r', encoding='utf-8') as tsv_file:
# print(tsv_file)
if end[long] == '.tsv':
pd_all = pd.read_csv(tsv_file, sep='\t')
pd_all.to_csv(save_path + name[long] + '.csv', index=False, sep=',')
CsvFile = os.listdir(save_path)
# print(len(CsvFile))
# print(CsvFile)
# 循环合并csv文件
# f2是我的label文件
for long in range(len(CsvFile)):
f2 = pd.read_csv('60+60_label.csv')
with open(save_path + CsvFile[long], 'r', encoding='utf-8') as csv_file:
# print(csv_file)
f1 = pd.read_csv(csv_file)
file = [f1, f2]
# print(file)
# axis=1 列合并
# axis=0 行合并(默认)
train = pd.concat(file, axis=1)
train.to_csv(save_path + name[long] + '.csv', index=False, sep=',')
Reference:
- https://blog.csdn.net/qq_40303258/article/details/107326287
- https://blog.csdn.net/weixin_45750972/article/details/121358100
- https://blog.csdn.net/qq_42041134/article/details/118934868