python csv.reader_使用python中的csv.DictReader做最快的方式做数据类型转换

1586010002-jmsa.png

I'm working with a CSV file in python, which will have ~100,000 rows when in use. Each row has a set of dimensions (as strings) and a single metric (float).

As csv.DictReader or csv.reader return values as string only, I'm currently iterating over all rows and converting the one numeric value to a float.

for i in csvDict:

i[col] = float(i[col])

Is there a better way that anyone could suggest to do this? I've been playing around with various combinations of map, izip, itertools and have searched extensively for some samples of doing it more efficiently, but unfortunately haven't had much success.

In case it helps:

I'm doing this on appengine. I believe that what I'm doing may be resulting in me hitting this error:

Exceeded soft process size limit with 267.789 MB after servicing 11 requests total - I only get it when the CSV is quite large.

Edit: My Goal

I'm parsing this CSV so that I can use it as a data source for the Google Visualizations API. The final data set will be loaded in to a gviz DataTable for querying. Type must be specified during the construction of this table. My problem could also be solved if anyone knew of a good gviz csv->datatable converter in python!

Edit2: My Code

I believe that my issue has to do with the way I attempt to fixCsvTypes(). Also, data_table.LoadData() expects an iterable object.

class GvizFromCsv(object):

"""Convert CSV to Gviz ready objects."""

def __init__(self, csvFile, dateTimeFormat=None):

self.fileObj = StringIO.StringIO(csvFile)

self.csvDict = list(csv.DictReader(self.fileObj))

self.dateTimeFormat = dateTimeFormat

self.headers = {}

self.ParseHeaders()

self.fixCsvTypes()

def IsNumber(self, st):

try:

float(st)

return True

except ValueError:

return False

def IsDate(self, st):

try:

datetime.datetime.strptime(st, self.dateTimeFormat)

except ValueError:

return False

def ParseHeaders(self):

"""Attempts to figure out header types for gviz, based on first row"""

for k, v in self.csvDict[0].items():

if self.IsNumber(v):

self.headers[k] = 'number'

elif self.dateTimeFormat and self.IsDate(v):

self.headers[k] = 'date'

else:

self.headers[k] = 'string'

def fixCsvTypes(self):

"""Only fixes numbers."""

update_to_numbers = []

for k,v in self.headers.items():

if v == 'number':

update_to_numbers.append(k)

for i in self.csvDict:

for col in update_to_numbers:

i[col] = float(i[col])

def CreateDataTable(self):

"""creates a gviz data table"""

data_table = gviz_api.DataTable(self.headers)

data_table.LoadData(self.csvDict)

return data_table

解决方案

I had first exploited the CSV file with a regex, but since the data in the file is very strictly arranged in each row, we can simply use the split() function

import gviz_api

scheme = [('col1','string','SURNAME'),('col2','number','ONE'),('col3','number','TWO')]

data_table = gviz_api.DataTable(scheme)

# --- lines in surnames.csv are : ---

# surname,percent,cumulative percent,rank\n

# SMITH,1.006,1.006,1,\n

# JOHNSON,0.810,1.816,2,\n

# WILLIAMS,0.699,2.515,3,\n

with open('surnames.csv') as f:

def transf(surname,x,y):

return (surname,float(x),float(y))

f.readline()

# to skip the first line surname,percent,cumulative percent,rank\n

data_table.LoadData( transf(*line.split(',')[0:3]) for line in f )

# to populate the data table by iterating in the CSV file

Or without a function to be defined:

import gviz_api

scheme = [('col1','string','SURNAME'),('col2','number','ONE'),('col3','number','TWO')]

data_table = gviz_api.DataTable(scheme)

# --- lines in surnames.csv are : ---

# surname,percent,cumulative percent,rank\n

# SMITH,1.006,1.006,1,\n

# JOHNSON,0.810,1.816,2,\n

# WILLIAMS,0.699,2.515,3,\n

with open('surnames.csv') as f:

f.readline()

# to skip the first line surname,percent,cumulative percent,rank\n

datdata_table.LoadData( [el if n==0 else float(el) for n,el in enumerate(line.split(',')[0:3])] for line in f )

# to populate the data table by iterating in the CSV file

At one moment, I believed I was obliged to populate the data table with one row at a time because I was using a regex and that needed to obtain the matches' groups before floating the numbers' strings. With split() all can be done in one instruction with LoadData()

.

Hence, your code can be shortened. By the way, I don't see why it should continue to define a class. Instead, a function seems enough for me:

def GvizFromCsv(filename):

""" creates a gviz data table from a CSV file """

data_table = gviz_api.DataTable([('col1','string','SURNAME'),

('col2','number','ONE' ),

('col3','number','TWO' ) ])

# --- with such a table schema , lines in the file must be like that: ---

# blah, number, number, ...anything else...\n

# SMITH,1.006,1.006, ...anything else...\n

# JOHNSON,0.810,1.816, ...anything else...\n

# WILLIAMS,0.699,2.515, ...anything else...\n

with open(filename) as f:

data_table.LoadData( [el if n==0 else float(el) for n,el in enumerate(line.split(',')[0:3])]

for line in f )

return data_table

.

Now you must examine if the way in which the CSV data is read from another API can be inserted in this code to keep the iterating principle to populate the data table.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值