Python:在没有默认分隔符(包含数百万条记录)的情况下读取文件并将其放入“数据框架(panda)”中,最有效的方法是什么? 文件是:"file_sd.txt"
A123456MESTUDIANTE 000-12
A123457MPROFESOR 003103
I128734MPROGRAMADOR00-111
A129863FARQUITECTO 00-456
# Fields and position:
# - Activity Indicator : indAct -> 01 Character
# - Person Code : codPer -> 06 Characters
# - Gender (M / F) : sex -> 01 Character
# - Occupation : occupation -> 11 Characters
# - Amount(User format): amount -> 06 Characters (Convert to Number)
我不确定。这是最好的选择吗?
import pandas as pd
import numpy as np
def stoI(cad):
pos = cad.find("-")
if pos < 0: return int(cad)
return int(cad[pos+1:])*-1
#Read Txt
data = pd.read_csv(r'D:\file_sd.txt',header = None)
data_sep = pd.DataFrame(
{
'indAct' :data[0].str.slice(0,1),
'codPer' :data[0].str.slice(1,7),
'sexo' :data[0].str.slice(7,8),
'ocupac' :data[0].str.slice(8,19),
'monto' :np.vectorize(stoI)(data[0].str.slice(19,25))
})
print(data_sep)
indAct codPer sexo ocupac monto
0 A 123456 M ESTUDIANTE -12
1 A 123457 M PROFESOR 3103
2 I 128734 M PROGRAMADOR -111
3 A 129863 F ARQUITECTO -456
这个7百万行的解决方案。结果是:
%timeit df_slice()
11.1 s ± 166 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
问题来源StackOverflow 地址:/questions/59383835/python-efficiency-when-reading-a-file-without-a-default-delimiter-with-millions