python csv.reader_使用python中的csv.DictReader做最快的方式做数据类型转换

最新推荐文章于 2023-09-01 14:00:43 发布

weixin_39938312

最新推荐文章于 2023-09-01 14:00:43 发布

阅读量719

点赞数

文章标签： python csv.reader

I'm working with a CSV file in python, which will have ~100,000 rows when in use. Each row has a set of dimensions (as strings) and a single metric (float).

As csv.DictReader or csv.reader return values as string only, I'm currently iterating over all rows and converting the one numeric value to a float.

for i in csvDict:

i[col] = float(i[col])

Is there a better way that anyone could suggest to do this? I've been playing around with various combinations of map, izip, itertools and have searched extensively for some samples of doing it more efficiently, but unfortunately haven't had much success.

In case it helps:

I'm doing this on appengine. I believe that what I'm doing may be resulting in me hitting this error:

Exceeded soft process size limit with 267.789 MB after servicing 11 requests total - I only get it when the CSV is quite large.

Edit: My Goal

I'm parsing this CSV so that I can use it as a data source for the Google Visualizations API. The final data set will be loaded in to a gviz DataTable for querying. Type must be specified during the construction of this table. My problem could also be solved if anyone knew of a good gviz csv->datatable converter in python!

Edit2: My Code

I believe that my issue has to do with the way I attempt to fixCsvTypes(). Also, data_table.LoadData() expects an iterable object.

class GvizFromCsv(object):

"""Convert CSV to Gviz ready objects."""

def __init__(self, csvFile, dateTimeFormat=None):

self.fileObj = StringIO.StringIO(csvFile)

self.csvDict = list(csv.DictReader(self.fileObj))

self.dateTimeFormat = dateTimeFormat

self.headers = {}

self.ParseHeaders()

self.fixCsvTypes()

def IsNumber(self, st):

try:

float(st)

return True

except ValueError:

return False

def IsDate(self, st):

try:

datetime.datetime.strptime(st, self.dateTimeFormat)

except ValueError:

return False

def ParseHeaders(self):

"""Attempts to figure out header types for gviz, based on first row"""

for k, v in self.csvDict[0].items():

if self.IsNumber(v):

self.headers[k] = 'number'

elif self.dateTimeFormat and self.IsDate(v):

self.headers[k] = 'date'

else:

self.headers[k] = 'string'

def fixCsvTypes(self):

"""Only fixes numbers."""

update_to_numbers = []

for k,v in self.headers.items():

if v == 'number':

update_to_numbers.append(k)

for i in self.csvDict:

for col in update_to_numbers:

i[col] = float(i[col])

def CreateDataTable(self):

"""creates a gviz data table"""

data_table = gviz_api.DataTable(self.headers)

data_table.LoadData(self.csvDict)

return data_table

解决方案

I had first exploited the CSV file with a regex, but since the data in the file is very strictly arranged in each row, we can simply use the split() function

import gviz_api

scheme = [('col1','string','SURNAME'),('col2','number','ONE'),('col3','number','TWO')]

data_table = gviz_api.DataTable(scheme)

# --- lines in surnames.csv are : ---

# surname,percent,cumulative percent,rank\n

# SMITH,1.006,1.006,1,\n

# JOHNSON,0.810,1.816,2,\n

# WILLIAMS,0.699,2.515,3,\n

with open('surnames.csv') as f:

def transf(surname,x,y):

return (surname,float(x),float(y))

f.readline()

# to skip the first line surname,percent,cumulative percent,rank\n

data_table.LoadData( transf(*line.split(',')[0:3]) for line in f )

# to populate the data table by iterating in the CSV file

Or without a function to be defined:

import gviz_api

scheme = [('col1','string','SURNAME'),('col2','number','ONE'),('col3','number','TWO')]

data_table = gviz_api.DataTable(scheme)

# --- lines in surnames.csv are : ---

# surname,percent,cumulative percent,rank\n

# SMITH,1.006,1.006,1,\n

# JOHNSON,0.810,1.816,2,\n

# WILLIAMS,0.699,2.515,3,\n

with open('surnames.csv') as f:

f.readline()

# to skip the first line surname,percent,cumulative percent,rank\n

datdata_table.LoadData( [el if n==0 else float(el) for n,el in enumerate(line.split(',')[0:3])] for line in f )

# to populate the data table by iterating in the CSV file

At one moment, I believed I was obliged to populate the data table with one row at a time because I was using a regex and that needed to obtain the matches' groups before floating the numbers' strings. With split() all can be done in one instruction with LoadData()

Hence, your code can be shortened. By the way, I don't see why it should continue to define a class. Instead, a function seems enough for me:

def GvizFromCsv(filename):

""" creates a gviz data table from a CSV file """

data_table = gviz_api.DataTable([('col1','string','SURNAME'),

('col2','number','ONE' ),

('col3','number','TWO' ) ])

# --- with such a table schema , lines in the file must be like that: ---

# blah, number, number, ...anything else...\n

# SMITH,1.006,1.006, ...anything else...\n

# JOHNSON,0.810,1.816, ...anything else...\n

# WILLIAMS,0.699,2.515, ...anything else...\n

with open(filename) as f:

data_table.LoadData( [el if n==0 else float(el) for n,el in enumerate(line.split(',')[0:3])]

for line in f )

return data_table

Now you must examine if the way in which the CSV data is read from another API can be inserted in this code to keep the iterating principle to populate the data table.

weixin_39938312

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python csv.reader_使用python中的csv.DictReader做最快的方式做数据类型转换

I'm working with a CSV file in python, which will have ~100,000 rows when in use. Each row has a set of dimensions (as strings) and a single metric (float).As csv.DictReader or csv.reader return value...
复制链接

扫一扫