python处理海量数据_加速处理海量数据的Python文件

最新推荐文章于 2024-07-22 21:51:13 发布

知乎盐选创作者

最新推荐文章于 2024-07-22 21:51:13 发布

阅读量345

点赞数

文章标签： python处理海量数据

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_27009347/article/details/112879379

版权

我有一个大数据集存储为一个17GB的csv文件(fileData)，其中包含可变数量的记录(最多30个，000)，我试图搜索特定客户(列在fileSelection-总共90000个客户中的1500个)，并将每个客户的记录复制到一个单独的csv文件(fileOutput)。在

我对Python很陌生，但使用它是因为vba和matlab(我更熟悉)不能处理文件大小。(我使用aptanastudio编写代码，但是为了提高速度，直接从cmd行运行python。运行64位Windows 7。)

我编写的代码提取了一些客户，但有两个问题：

1) 它在大型数据集中找不到大多数客户。(我相信它们都在数据集中，但不能完全确定。)

2) 它很慢。如果能更好地利用核心代码，那就更好了

代码如下：`def main():

# Initialisation :

# - identify columns in slection file

#

fS = open (fileSelection,"r")

if fS.mode == "r":

header = fS.readline()

selheaderlist = header.split(",")

custkey = selheaderlist.index('CUSTOMER_KEY')

#

# Identify columns in dataset file

fileData = path2+file_data

fD = open (fileData,"r")

if fD.mode == "r":

header = fD.readline()

dataheaderlist = header.split(",")

custID = dataheaderlist.index('CUSTOMER_ID')

fD.close()

# For each customer in the selection file

customercount=1

for sr in fS:

# Find customer key and locate it in customer ID field in dataset

selrecord = sr.split(",")

requiredcustomer = selrecord[custkey]

#Look for required customer in dataset

found = 0

fD = open (fileData,"r")

if fD.mode == "r":

while found == 0:

dr = fD.readline()

if not dr: break

datrecord = dr.split(",")

if datrecord[custID] == requiredcustomer:

found = 1

# Open outputfile

fileOutput= path3+file_out_root + str(requiredcustomer)+ ".csv"

fO=open(fileOutput,"w+")

fO.write(str(header))

#copy all records for required customer number

while datrecord[custID] == requiredcustomer:

fO.write(str(dr))

dr = fD.readline()

datrecord = dr.split(",")

#Close Output file

fO.close()

if found == 1:

print ("Customer Count "+str(customercount)+ " Customer ID"+str(requiredcustomer)+" copied. ")

customercount = customercount+1

else:

print("Customer ID"+str(requiredcustomer)+" not found in dataset")

fL.write (str(requiredcustomer)+","+"NOT FOUND")

fD.close()

fS.close()

`

花了几天时间才找到几百个客户，但没有找到更多的客户。在

谢谢@Paul Cornelius。这样效率更高。我采用了您的方法，也使用了@Bernardo建议的csv处理：

^{pr2}$

知乎盐选创作者

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。