- How to load a CSV file
- How to convert strings from a file to floating point numbers.
- How to convert class values from a file to integers.
1.2 Tutorial
- Load a file
- Load a file and convert Strings to Floats
- Load a file and convert Strings to Integers.
# Function for loading a CSV
# load a CSV file
from csv import reader
def load_csv(filename):
file = open(filename,"r")
lines = reader(file)
dataset = list(lines)
return dataset
load_csv('pima-indians-diabetes.data.csv')
# Example of Loading the Pima Indians Diabetes Dataset CSV File
# Example of loading Pima Indians CSV dataset
from csv import reader
# Load a csv file
def load_csv(filename):
file = open(filename,"r")
lines = reader(file)
dataset = list(lines)
return dataset
# Load dataset
filename = 'pima-indians-diabetes.data.csv'
dataset = load_csv(filename)
print('Loaded data file {0} with {1} rows and {2} columns'.format(filename,len(dataset),len(dataset[0])))
Sample output from loading the Pima Indians Diabetes dataset CSV file.
A limitation of this function is that it will load empty lines from data files and add them to our list of rows. Below is the updated example with the new improved version of the load_csv () function
# Improved Example of Loading the Pima Indians Diabetes Dataset CSV File
# Example of loading Pima Indian CSV dataset
from csv import reader
# Load a CSV file
def load_csv(filename):
dataset = list()
with open(filename, 'r') as file:
csv_reader = reader(file)
for row in csv_reader:
if not row:
continue
dataset.append(row)
return dataset
# Load dataset
filename = 'pima-indians-diabetes.data.csv'
dataset = load_csv(filename)
print('Loaded data file {0} with {1} rows and {2} columns'.format(filename,len(dataset),len(dataset[0])))
Sample Output From Loading the Pima Indians Diabetes Dataset CSV File
1.2 Convert String to Floats
if not all machine learning algorithms prefer to work with numbers. Specifically, floating point numbers are prefered.Our code for loading a CSV file returns a dataset as a list of lists. but each value is a string. We can see if we print out one record from the dataset:
print(dataset[0])
We can write a small function to convert specific columns of our loaded dataset to floating point values.Below is this function called str_column_to_float(). It will convert a given column in the dataset to floating point values, careful to strip any whitespace from the value befor making the conversion.
def str_column_to_float(dataset, column):
for row in dataset:
row[column] = float(row[column].strip())
We can test this function by combining it with our load CSV function above, and convert all of the numeric data in the Pima Indians dataset to floating point values. The complete example is below.
# Example of converting string variables to float
from csv import reader
# Load a CSV file
def load_csv(filename):
dataset = list()
with open(filename, 'r') as file:
csv_reader = reader(file)
for row in csv_reader:
if not row:
continue
dataset.append(row)
return dataset
# Convert string column to float
def str_column_to_float(dataset, column):
for row in dataset:
row[column] = float(row[column].strip())
# Load pima-indians-diabetes dataset
filename = 'pima-indians-diabetes.data.csv'
dataset = load_csv(filename)
print('Loaded data file {0} with {1} rows and {2} columns'.format(filename,len(dataset),len(dataset[0])))
print(dataset[0])
# convert string columns to float
for i in range(len(dataset[0])):
str_column_to_float(dataset,i)
print(dataset[0])
Running this example we see the first row of the dataset printed both before and after the conversion. We can see that the values in each column have been converted from strings to numbers.
Some machine learning algorithms prefer all values to be numeric, including the outcome or predicted value. We can convert the class value in the iris flowers dataset to an integer by creating a map.
- First, we locate all of the unique class values, which happen to be: Iris-setosa, Iris-versicolor and Iris-virginica.
- Next, we assign an integer value to each, such as: 0, 1 and 2.
- Finally, we replace all occurrences of class string values with their corresponding integer values.
Below is a function to do just that called str_column_to_int(). Like the previously introduced str_column_to_float() it operates on a single column in the dataset.
# Example of integer encoding string class values
from csv import reader
# Load a CSV file
def load_csv(filename):
dataset = list()
with open(filename, 'r') as file:
csv_reader = reader(file)
for row in csv_reader:
if not row:
continue
dataset.append(row)
return dataset
# Convert string column to float
def str_column_to_float(dataset, column):
for row in dataset:
row[column] = float(row[column].strip())
# Convert string column to float
def str_column_to_float(dataset,column):
for row in dataset:
row[column] = float(row[column].strip())
# Convert string column to integer
def str_column_to_int(dataset, column):
class_values = [row[column] for row in dataset]
unique = set(class_values)
lookup = dict()
for i, value in enumerate(unique):
lookup[value] = i
for row in dataset:
row[column] = lookup[row[column]]
return lookup
# Load iris dataset
filename = 'iris.csv'
dataset = load_csv(filename)
print('Loaded data file {0} with {1} rows and {2} columns'.format(filename,len(dataset),len(dataset[0])))
print(dataset[0])
# convert string columns to float
for i in range(4):
str_column_to_int(dataset,4)
# convert class column to int
lookup = str_column_to_int(dataset, 4)
print(dataset[0])
print(lookup)