不同文件中读取相同数据并替换
由于在做深度学习,所以大部分时间都花在了数据处理部分,太难了。
def change_data(path):
name = "format.data"
file = open(path + name)
file_id = open("/data01/Violin/CRTubeGet_Downloaded_file/tokenid_name_label.scp")
file_token = open("/data01/Violin/CRTubeGet_Downloaded_file/token_name_label.scp")
file_shape = open("/data01/Violin/CRTubeGet_Downloaded_file/shape_name_label.scp")
data = file.readline()
tmp = []
while data:
if len(data):
name_file = data.split('\t')[0].split(':')[1]
num = 0
name_data_id = file_id.readline()
name_data_token = file_token.readline()
name_data_shape = file_shape.readline()
while name_file != name_data_id.split(' ')[0] or name_file!= name_data_token.split(' ')[0] or name_file!= name_data_shape.split(' ')[0]:
#print("{},{},{},{}".format(name_file, name_data_id, name_data_token,name_data_shape))
file_id = open("/data01/Violin/CRTubeGet_Downloaded_file/tokenid_name_label.scp")
file_token = open("/data01/Violin/CRTubeGet_Downloaded_file/token_name_label.scp")
file_shape = open("/data01/Violin/CRTubeGet_Downloaded_file/shape_name_label.scp")
if num != 0:
data = file.readline()
for line in file_id.readlines():
if name_file == line.split(' ')[0]:
name_data_id = line
break
for line in file_token.readlines():
if name_file == line.split(' ')[0]:
name_data_token = line
break
for line in file_shape.readlines():
if name_file == line.split(' ')[0]:
name_data_shape = line
break
num += 1
assert name_file == name_data_id.split(' ')[0] == name_data_token.split(' ')[0] == name_data_shape.split(' ')[0]
new_id = " ".join(name_data_id.split(' ')[1:]).replace("\n", "")
new_token = "".join(name_data_token.split(' ')[1:]).replace("\n", "")
new_shape = "".join(name_data_shape.split(' ')[1]).replace("\n", "")
old_token = data.split('\t')[4].split(':')[1]
old_id = data.split('\t')[5].split(':')[1]
old_shape = data.split('\t')[6].split(':')[1]
data = data.replace(old_shape,new_shape)
data = data.replace(old_id,new_id)
data = data.replace(old_token,new_token)
tmp.append(data)
else:
print(data)
break
data = file.readline()
f = open(path + "1111_" +name, "w")
f.writelines(tmp)
f.close()
代码前一部分的逻辑:从不同文件中选取相同名称的数据
例如:
file_id的内容有:
umIBbT6uwZI_30_2500 809 809 2587 2 3797 3704 4528 4454 3771 459
umIBbT6uwZI_2510_6079 4457 2257 4454 2883 4459 4987 4995 4901
umIBbT6uwZI_6089_7610 125 1190 1 4995 2564 4454 1636
umIBbT6uwZI_7620_8058 4528 2930 1
umIBbT6uwZI_8069_11390 2655 3317 4995 2923 407 2767 4868 2212
umIBbT6uwZI_11400_13160 2389 2 3797 4602 2188 2 2634 379 1 1286 2 4381 3166 2196
umIBbT6uwZI_13170_17300 1286 2 4381 2623 3957 2 3797 1 4222 2197 4995 1887
umIBbT6uwZI_17310_26359 3126 2188 2 4741 155 5 1869 1 2457 2256 2457 2256
umIBbT6uwZI_26369_33229 2417 2930 2931 4241 4478 4241 2417 2930 2976 1887 3126 3065
file_token的内容有:
utt:umIBbT6uwZI_30_2500 feat:/data01/WuYiHao/Violin/CRTubeGet_Downloaded_cat/umIBbT6uwZI_30_2500.wav feat_shape:2.47 text:COME COME LET’S RETURN TO THE ROSE BLOOM token:▁COME ▁COME ▁LET ’ S ▁RETURN ▁TO ▁THE ▁ROSE ▁BLOOM tokenid:820 820 2560 2 3806 3708 4526 4443 3775 472 token_shape:10,5001feat:/data01/WuYiHao/Violin/CRTubeGet_Downloaded_cat/umIBbT6uwZI_2510_6079.wav feat_shape:3.57 text:THEM IN THE MORNING THEN YES YOU WILL token:▁THEM ▁IN ▁THE ▁MORNING ▁THEN ▁YES ▁YOU ▁WILL tokenid:4446 2222 4443 2864 4448 4985 4992 4895 token_shape:8,5001
utt:umIBbT6uwZI_6089_7610 feat:/data01/WuYiHao/Violin/CRTubeGet_Downloaded_cat/umIBbT6uwZI_6089_7610.wav feat_shape:1.52 text:ALL DIE UNLESS YOU LEAVE THE FARM token:▁ALL ▁DIE ▁UNLESS ▁YOU ▁LEAVE ▁THE ▁FARM tokenid:134 1181 4677 4992 2536 4443 1608 token_shape:7,5001
utt:umIBbT6uwZI_8069_11390 feat:/data01/WuYiHao/Violin/CRTubeGet_Downloaded_cat/umIBbT6uwZI_8069_11390.wav feat_shape:3.32 text:LOOK PLEASE YOU MUST BELIEVE ME WHAT IF token:▁LOOK ▁PLEASE ▁YOU ▁MUST ▁BELIEVE ▁ME ▁WHAT ▁IF tokenid:2633 3310 4992 2902 422 2745 4856 2182 token_shape:8,5001
utt:umIBbT6uwZI_11400_13160 feat:/data01/WuYiHao/Violin/CRTubeGet_Downloaded_cat/umIBbT6uwZI_11400_13160.wav feat_shape:1.76 text:IT’S TRUE I’LL BE KILLED DON’T PANIC token:▁IT ’ S ▁TRUE ▁I ’ LL ▁BE ▁KILLED ▁DON ’ T ▁PAN IC tokenid:2353 2 3806 4600 2157 2 2615 388 2448 1268 2 4373 3162 2164 token_shape:14,5001
utt:umIBbT6uwZI_13170_17300 feat:/data01/WuYiHao/Violin/CRTubeGet_Downloaded_cat/umIBbT6uwZI_13170_17300.wav feat_shape:4.13 text:DON’T LISTEN SHE’S HYSTERICAL YOU GET token:▁DON ’ T ▁LISTEN ▁SHE ’ S ▁HY STER ICAL ▁YOU ▁GET tokenid:1268 2 4373 2604 3954 2 3806 2153 4215 2165 4992 1860 token_shape:12,5001
utt:umIBbT6uwZI_17310_26359 feat:/data01/WuYiHao/Violin/CRTubeGet_Downloaded_cat/umIBbT6uwZI_17310_26359.wav feat_shape:9.05 text:OUT I’VE AN AGENDA JUSTIN JUSTIN token:▁OUT ▁I ’ VE ▁AN ▁A GEN DA ▁JUST IN ▁JUST IN tokenid:3128 2157 2 4732 166 5 1842 1046 2423 2221 2423 2221 token_shape:12,5001
utt:umIBbT6uwZI_26369_33229 feat:/data01/WuYiHao/Violin/CRTubeGet_Downloaded_cat/umIBbT6uwZI_26369_33229.wav feat_shape:6.86 text:JENNA STOP THIS STOP JENNER GET OUT OF token:▁JE N NA ▁STOP ▁THIS ▁STOP ▁JE N NER ▁GET ▁OUT ▁OF tokenid:2381 2910 2911 4231 4468 4231 2381 2910 2961 1860 3128 3054 token_shape:12,5001
utt:umIBbT6uwZI_33239_38480 feat:/data01/WuYiHao/Violin/CRTubeGet_Downloaded_cat/umIBbT6uwZI_33239_38480.wav feat_shape:5.24 text:MY WAY token:▁MY ▁WAY tokenid:2906 4828 token_shape:2,5001
可见,我希望把每个文件中名字相同的数据对应上,但各个文件之间难免有缺漏,则全部遍历一遍。有些费时费事,可水平有限,无奈之举。
第二步,替换老文件中的数据,比较简单,不过多赘述。