I'm working with a CSV file in python, which will have ~100,000 rows when in use. Each row has a set of dimensions (as strings) and a single metric (float).
As csv.DictReader or csv.reader return values as string only, I'm currently iterating over all rows and converting the one numeric value to a float.
for i in csvDict:
i[col] = float(i[col])
Is there a better way that anyone could suggest to do this? I've been playing around with various combinations of map, izip, itertools and have searched extensively for some samples of doing it more efficiently, but unfortunately haven't had much success.
In case it helps:
I'm doing this on appengine. I believe that what I'm doing may be resulting in me hitting this error:
Exceeded soft process size limit with 267.789 MB after servicing 11 requests total - I only get it when the CSV is quite large.
Edit: My Goal
I'm parsing this CSV so that I can use it as a data source for the Google Visualizations API. The final data set will be loaded in to a gviz DataTable for querying. Type must be specified during the construction of this table. My problem could also be solved if anyone knew of a good gviz csv->datatable converter in python!
Edit2: My Code
I believe that my issue has to do with the way I attempt to fixCsvTypes(). Also, data_table.LoadData() expects an iterable object.
class GvizFromCsv(object):
"""Convert CSV to Gviz ready objects."""
def __init__(self, csvFile, dateTimeFormat=None):
self.fileObj = StringIO.StringIO(csvFile)
self.csvDict = list(csv.DictReader(self.fileObj))
self.dateTimeFormat = dateTimeFormat
self.headers = {}
self.ParseHeaders()
self.fixCsvTypes()
def IsNumber(self, st):
try:
float(st)
return True
except ValueError:
return False
def IsDate(self, st):
try:
datetime.datetime.strptime(st, self.dateTimeFormat)
except ValueError:
return False
def ParseHeaders(self):
"""Attempts to figure out header types for gviz, based on first row"""
for k, v in self.csvDict[0].items():
if self.IsNumber(v):
self.headers[k] = 'number'
elif self.dateTimeFormat and self.IsDate(v):
self.headers[k] = 'date'
else:
self.headers[k] = 'string'
def fixCsvTypes(self):
"""Only fixes numbers."""
update_to_numbers = []
for k,v in self.headers.items():
if v == 'number':
update_to_numbers.append(k)
for i in self.csvDict:
for col in update_to_numbers:
i[col] = float(i[col])
def CreateDataTable(self):
"""creates a gviz data table"""
data_table = gviz_api.DataTable(self.headers)
data_table.LoadData(self.csvDict)
return data_table
解决方案
I had first exploited the CSV file with a regex, but since the data in the file is very strictly arranged in each row, we can simply use the split() function
import gviz_api
scheme = [('col1','string','SURNAME'),('col2','number','ONE'),('col3','number','TWO')]
data_table = gviz_api.DataTable(scheme)
# --- lines in surnames.csv are : ---
# surname,percent,cumulative percent,rank\n
# SMITH,1.006,1.006,1,\n
# JOHNSON,0.810,1.816,2,\n
# WILLIAMS,0.699,2.515,3,\n
with open('surnames.csv') as f:
def transf(surname,x,y):
return (surname,float(x),float(y))
f.readline()
# to skip the first line surname,percent,cumulative percent,rank\n
data_table.LoadData( transf(*line.split(',')[0:3]) for line in f )
# to populate the data table by iterating in the CSV file
Or without a function to be defined:
import gviz_api
scheme = [('col1','string','SURNAME'),('col2','number','ONE'),('col3','number','TWO')]
data_table = gviz_api.DataTable(scheme)
# --- lines in surnames.csv are : ---
# surname,percent,cumulative percent,rank\n
# SMITH,1.006,1.006,1,\n
# JOHNSON,0.810,1.816,2,\n
# WILLIAMS,0.699,2.515,3,\n
with open('surnames.csv') as f:
f.readline()
# to skip the first line surname,percent,cumulative percent,rank\n
datdata_table.LoadData( [el if n==0 else float(el) for n,el in enumerate(line.split(',')[0:3])] for line in f )
# to populate the data table by iterating in the CSV file
At one moment, I believed I was obliged to populate the data table with one row at a time because I was using a regex and that needed to obtain the matches' groups before floating the numbers' strings. With split() all can be done in one instruction with LoadData()
.
Hence, your code can be shortened. By the way, I don't see why it should continue to define a class. Instead, a function seems enough for me:
def GvizFromCsv(filename):
""" creates a gviz data table from a CSV file """
data_table = gviz_api.DataTable([('col1','string','SURNAME'),
('col2','number','ONE' ),
('col3','number','TWO' ) ])
# --- with such a table schema , lines in the file must be like that: ---
# blah, number, number, ...anything else...\n
# SMITH,1.006,1.006, ...anything else...\n
# JOHNSON,0.810,1.816, ...anything else...\n
# WILLIAMS,0.699,2.515, ...anything else...\n
with open(filename) as f:
data_table.LoadData( [el if n==0 else float(el) for n,el in enumerate(line.split(',')[0:3])]
for line in f )
return data_table
.
Now you must examine if the way in which the CSV data is read from another API can be inserted in this code to keep the iterating principle to populate the data table.