我有一个大数据集存储为一个17GB的csv文件(fileData),其中包含可变数量的记录(最多30个,000),我试图搜索特定客户(列在fileSelection-总共90000个客户中的1500个),并将每个客户的记录复制到一个单独的csv文件(fileOutput)。在
我对Python很陌生,但使用它是因为vba和matlab(我更熟悉)不能处理文件大小。(我使用aptanastudio编写代码,但是为了提高速度,直接从cmd行运行python。运行64位Windows 7。)
我编写的代码提取了一些客户,但有两个问题:
1) 它在大型数据集中找不到大多数客户。(我相信它们都在数据集中,但不能完全确定。)
2) 它很慢。如果能更好地利用核心代码,那就更好了
代码如下:`def main():
# Initialisation :
# - identify columns in slection file
#
fS = open (fileSelection,"r")
if fS.mode == "r":
header = fS.readline()
selheaderlist = header.split(",")
custkey = selheaderlist.index('CUSTOMER_KEY')
#
# Identify columns in dataset file
fileData = path2+file_data
fD = open (fileData,"r")
if fD.mode == "r":
header = fD.readline()
dataheaderlist = header.split(",")
custID = dataheaderlist.index('CUSTOMER_ID')
fD.close()
# For each customer in the selection file
customercount=1
for sr in fS:
# Find customer key and locate it in customer ID field in dataset
selrecord = sr.split(",")
requiredcustomer = selrecord[custkey]
#Look for required customer in dataset
found = 0
fD = open (fileData,"r")
if fD.mode == "r":
while found == 0:
dr = fD.readline()
if not dr: break
datrecord = dr.split(",")
if datrecord[custID] == requiredcustomer:
found = 1
# Open outputfile
fileOutput= path3+file_out_root + str(requiredcustomer)+ ".csv"
fO=open(fileOutput,"w+")
fO.write(str(header))
#copy all records for required customer number
while datrecord[custID] == requiredcustomer:
fO.write(str(dr))
dr = fD.readline()
datrecord = dr.split(",")
#Close Output file
fO.close()
if found == 1:
print ("Customer Count "+str(customercount)+ " Customer ID"+str(requiredcustomer)+" copied. ")
customercount = customercount+1
else:
print("Customer ID"+str(requiredcustomer)+" not found in dataset")
fL.write (str(requiredcustomer)+","+"NOT FOUND")
fD.close()
fS.close()
`
花了几天时间才找到几百个客户,但没有找到更多的客户。在
谢谢@Paul Cornelius。这样效率更高。我采用了您的方法,也使用了@Bernardo建议的csv处理:
^{pr2}$