After reading this tutorial you will know:
- How to normalize your data from scratch.
- How to standardize your data from scratch
- When to normalize as opposed to standardize data
1.Normalize Data
2.Standardize Data
3.When to Normalize and Standardize
1.1 Normalize Data
Normalization can refer to different techniques depending on context.we use normalization to refer to rescaling an input variable to the range between 0 and 1 .Normalization requires that you know the minimum and maximum values for each attribute.The snippet of code below defines the dataset_minmax() function that calculates the min and max value for each attribute in a dataset, then returns an array of these minimum and maxmum values.
# Example Calculating the Min and Max Values of a Contrived Dataset
# Find the min and max values for each column
def dataset_minmax(dataset):
minmax = list()
for i in range(len(dataset[0])):
col_values = [row[i] for row in dataset]
value_min = min(col_values)
value_max = max(col_values)
minmax.append([value_min,value_max])
return minmax
# Contrive small dataset
dataset = [[50, 30], [20, 90]]
print(dataset)
# Calculate min and max for each column
minmax = dataset_minmax(dataset)
print(minmax)
# Function To Normalize a Dataset
# Rescale dataset columns to the range 0-1
def normalize_dataset(dataset, minmax):
for row in dataset:
for i in range(len(row)):
row[i] = (row[i] - minmax[i][0])/(minmax[i][1] - minmax[i][0])
We can tie this function together with the dataset_minmax() function and normalize the contrived dataset.
# Find the min and max values for each column
def dataset_minmax(dataset):
minmax = list()
for i in range(len(dataset[0])):
col_values = [row[i] for row in dataset]
value_min = min(col_values)
value_max = max(col_values)
minmax.append([value_min, value_max])
return minmax
# Rescale dataset columns to the range 0-1
def normalize_dataset(dataset, minmax):
for row in dataset:
for i in range(len(row)):
row[i] = (row[i] - minmax[i][0]) / (minmax[i][1] - minmax[i][0])
# Contrive small dataset
dataset = [[50, 30],[20, 90]]
print(dataset)
# Calculate min and max for each column
minmax = dataset_minmax(dataset)
print(minmax)
# Normalize columns
normalize_dataset(dataset, minmax)
print(dataset)
Running this example prints the output below, including the normalized dataset.
We can combine this code with code for loading a CSV dataset and load and normalize the Pima Indians Diabetes dataset.The example first load the dataset and converts the values for each column from string to floating point values.The minimum and maximum values for each column are estimated from the dataset, and finally,the values in the dataset are normalized.
# Example of normalizeing the diabetes dataset
from csv import reader
# Load a CSV file
def load_csv(filename):
dataset = list()
with open(filename,'r') as file:
csv_reader = reader(file)
for row in csv_reader:
if not row:
continue
dataset.append(row)
return dataset
# Convert string column to float
def str_column_to_float(dataset, column):
for row in dataset:
row[column] = float(row[column].strip())
# Find the min and max values for each column
def dataset_minmax(dataset):
minmax = list()
for i in range(len(dataset[0])):
col_values = [row[i] for row in dataset]
value_min = min(col_values)
value_max = max(col_values)
minmax.append([value_min, value_max])
return minmax
# Rescale dataset columns to the range 0-1
def normalize_dataset(dataset, minmax):
for row in dataset:
for i in range(len(row)):
row[i] = (row[i] - minmax[i][0]) / (minmax[i][1] - minmax[i][0])
# Load pima-indians-diabetes dataset
filename = 'pima-indians-diabetes.data.csv'
dataset = load_csv(filename)
print('Loaded data file {0} with {1} and {2} columns'.format(filename, len(dataset),len(dataset[0])))
# convert string columns to float
for i in range(len(dataset[0])):
str_column_to_float(dataset, i)
print(dataset[0])
# Calculate min and max for each column
minmax = dataset_minmax(dataset)
# Normalize_dataset(dataset, minmax)
normalize_dataset(dataset,minmax)
print(dataset[0])
Running the example produces the output below. The first record from the dataset is printed before and after normalization, showing the effect of the scaling.
1.2 Standardize Data
Standardization is a rescaling technique that refers to centering the distribution of the data on the value 0 and the standard deviation to the value 1.Together , the mean and the standard deviation can be used to summarize a normal distribution, also called the Gaussian distribution or bell curve.Let’s start with creating functions to estimate the mean and standard deviation statistics for each column from a dataset. The mean describes the middle or central tendency for a collection of numbers. The mean for a column is calculated as the sum of all values for a column divided by the total number of values.
# Function to calculate Means for each Cloumn in a Dataset
# calculate column means
def column_means(dataset):
means = [0 for i in range(len(dataset[0]))]
for i in range(len(dataset[0])):
col_values = [row[i] for row in dataset]
means[i] = sum(col_values) / float(len(dataset))
return means
The function below named column_stdevs() calculates the standard deviation of values for each column in the dataset and assumes the means have already been calculated.
# Function To Calculate Standard Deviations For Each Column in a Dataset
# calculate column standard deviations
def column_stdevs(dataset, means):
stdevs = [0 for i in range(len(dataset[0]))]
for i in range(len(dataset[0])):
variance = [pow(row[i]-means[i],2) for row in dataset]
stdevs[i] = sum(variance)
stdevs = [sqrt(x/(float(len(dataset)-1))) for x in stdevs]
return stdevs
Using the contrived dataset, we can estimate the summary statistics.
# Example of calculating stats on a contrived dataset
from math import sqrt
# calculate column means
def column_means(dataset):
means = [0 for i in range(len(dataset[0]))]
for i in range(len(dataset[0])):
col_values = [row[i] for row in dataset]
means[i] = sum(col_values) / float(len(dataset))
return means
# calculate column standard deviations
def column_stdevs(dataset, means):
stdevs = [0 for i in range(len(dataset[0]))]
for i in range(len(dataset[0])):
variance = [pow(row[i] - means[i], 2) for row in dataset]
stdevs[i] = sum(variance)
stdevs = [sqrt(x/(float(len(dataset)-1))) for x in stdevs]
return stdevs
# Standardize dataset
dataset = [[50, 30],[20, 90], [30, 50]]
print(dataset)
# Estimate mean and standard deviation
means = column_means(dataset)
stdevs = column_stdevs(dataset, means)
print(means)
print(stdevs)
Executing the example provides the following output, matching the numbers calculated in the spreadsheet.
Once the summary statistics are calculated, we can easily standardize the values in each column. The calculation to standardize a given value is as follows:
Below is a function named standardize_dataset() that implements this equation
# standardize dataset
def standardize_dataset(dataset, means, stdevs):
for row in dataset:
for i in range(len(row)):
row[i] = (row[i] - means[i]) / stdevs[i]
Combining this with the functions to estimate the mean and standard deviation summary statistics, we can standardize our contrived dataset.
# Example of standardizing a contrived dataset
from math import sqrt
# calculate column means
def column_means(dataset):
means = [0 for i in range(len(dataset[0]))]
for i in range(len(dataset[0])):
col_values = [row[i] for row in dataset]
means[i] = sum(col_values) / float(len(dataset))
return means
# calculate column standard deviations
def column_stdevs(dataset, means):
stdevs = [0 for i in range(len(dataset[0]))]
for i in range(len(dataset[0])):
variance = [pow(row[i]-means[i],2) for row in dataset]
stdevs[i] = sum(variance)
stdevs = [sqrt(x/(float(len(dataset)-1))) for x in stdevs]
return stdevs
# standardize dataset
def standardize_dataset(dataset, means, stdevs):
for row in dataset:
for i in range(len(row)):
row[i] = (row[i] - means[i]) / stdevs[i]
# Standardize dataset
dataset = [[50, 30], [20, 90], [30, 50]]
print(dataset)
# Estimate mean and standard deviation
means = column_means(dataset)
stdevs = column_stdevs(dataset, means)
print(means)
print(stdevs)
# standardize dataset
standardize_dataset(dataset, means, stdevs)
print(dataset)
Executing this example produces the following output, showing standardized values for the contrived dataset.
# Standardize the Diabetes Dataset
from csv import reader
from math import sqrt
# Load a CSV file
def load_csv(filename):
dataset = list()
with open(filename,'r') as file:
csv_reader = reader(file)
for row in csv_reader:
if not row:
continue
dataset.append(row)
return dataset
# Convert string column to float
def str_column_to_float(dataset, column):
for row in dataset:
row[column] = float(row[column].strip())
# calculate column means
def column_means(dataset):
means = [0 for i in range(len(dataset[0]))]
for i in range(len(dataset[0])):
col_values = [row[i] for row in dataset]
means[i] = sum(col_values) / float(len(dataset))
return means
# calculate column standard deviations
def column_stdevs(dataset, means):
stdevs = [0 for i in range(len(dataset[0]))]
for i in range(len(dataset[0])):
variance = [pow(row[i]-means[i],2) for row in dataset]
stdevs[i] = sum(variance)
stdevs = [sqrt(x/(float(len(dataset)-1))) for x in stdevs]
return stdevs
# standardize dataset
def standardize_dataset(dataset, means, stdevs):
for row in dataset:
for i in range(len(row)):
row[i] = (row[i] - means[i]) / stdevs[i]
# Load pima-indians-diabetes dataset
filename = 'pima-indians-diabetes.data.csv'
dataset = load_csv(filename)
print('Loaded data file {0} with {1} rows and {2} columns'.format(filename,len(dataset),len(dataset[0])))
# convert string columns to float
for i in range(len(dataset[0])):
str_column_to_float(dataset,i)
print(dataset[0])
# Estimate mean and standard deviation
means = column_means(dataset)
stdevs = column_stdevs(dataset, means)
# standardize dataset
standardize_dataset(dataset, means, stdevs)
print(dataset[0])
Running the example prints the first row of the dataset, first in a raw format as loaded, and then standardized which allows us to see the difference for comparison.
1.3 When to Normalize and Standardize
Standardization is a scaling technique that assumes your data conforms to a normal distribution.Normalization is a scaling technique that does not assume any specific distribution.