1.将csv/xlsx(表格)文件,提取每行为两个汉字的行保存为txt文件
import pandas as pd
import numpy as np
from pandas import DataFrame,Series
lc = pd.DataFrame(pd.read_csv('cnword.csv','rb',header=0,dtype=str))
df = lc[lc["词语"].str.len() == 2]
df.to_excel('newWord.xlsx')
原始文件:cnword.csv
提取后文件:newWord.xlsx
2.将numpy数组保存为txt文件
with open("zi_vecters_3908_dim_100.txt","a") as f:
for i in train_set_0:
z = encode(i)
np.savetxt(f,z[None])
原始:为多维数组
embeddings= [[0.002345 0,16347 0.1267 -0.64878 ],
[0.002345 0,16347 0.1267 -0.64878 ],
[0.002345 0,16347 0.1267 -0.64878]]
保存后:每个向量显示为一行
3,将字典中嵌套列表的数据,保存为txt
np.set_printoptions(linewidth = np.inf)#在txt中每行显示的个数,设置为无限大
np.set_printoptions(suppress=True)#不显示为科学计数法
with open('hanzi_embeddings.txt','w',encoding='utf-8') as f:
for key in embeddings:
f.writelines(str(key)+''+str(embeddings[key].lstrip('[').rstrip(']'))#删除数组括号
f.write('\n')
原始数据:embeddings= {'同':[0.002345 0,16347 0.1267 -0.64878 ],
'名':[0.002345 0,16347 0.1267 -0.64878]
}
保存为txt:每一个向量,一行显示,为:键:值得形式
4.将数组保存为txt
print(g.edges)//两个节点
with open('hanzi_edgeList.txt','w',encoding='utf-8') as fw:
for line in edge:
for a in line:
fw.write(a)
fw.write('\t')
fw.write('\n')
fw.close()
原始数据:
保存后的txt: