Simple Linear Regression

Linear regression is a prediction method that is more than 200 years old. Simple linear regression is a great first machine learning algorithm to implement as it requires you to estimate properties from your training dataset, but is simple enough for beginners to understand .

Atfer completing this tutorial you will know:

  • How to estimate statistical quantities from training data.
  • How to estimate linear regression coefficients from data.
  • How to predictions using linear regression for new data.

1.1 Description

This section is divided into two parts: a description of the simple linear regression technique and a description of the dataset to which we will later apply it.

1.1.1 Simple Linear Regression

Linear regression assumes a linear or straight line relationship between the input variable(X) and the single output variable(y) .More specifically , that output(y) can be calculated from a linear combination of the input variables(X). When there is a single input variable, the method is refered to as a simple linear regression.

In simple linear regression we can use statistics on the training data to estimate  the coefficients required by the model to make predictions on new data.The line for a simple linear regression model can be written as :

                                                        y = b_{0} + b_{1} \times x

Where b0 and b1 are the coefficients we must estimate from the training data. Once the coefficients are known, we can use this equation to estimate output values for y given new input examples of x. It requires that you calculate statistical properties from the data such as mean, variance and covariance.

1.2 Tutorial 

This tutorial is broken down into five parts:

  1. Calculate Mean and Variance
  2. Calculate Covariance
  3. Estimate Coefficients
  4. Make Predictions
  5. Swedish Auto Insurance Case Study

1.2.1 Calculate Mean and Variance

The first step is to estimate the mean and the variance of both the input and output variables from the training data. The mean of a list of numbers can be calculated as:

                                mean(x) = \frac{\sum _{i=1}x_{i}}{count(x)}

Below is a function named mean() that implements this behavior for a list of numbers.

# Calculate the mean value of a list of numbers
def mean(values):
    return sum(values) / float(len(values))

The variance is the sum squared difference for each value from the mean value.Variance for a list of numbers can be calculated as :

                                        variance = \sum_{i=1}^{n}(x_{i} - mean(x))^{2}

Below is a function named variance() that calculates the variance of a list of numbers. It requires the mean of the list to be provided as an argument, just so we don't have to calculate it more than once.

# Calculate the variance of a list of numbers
def variance(values, mean):
    return sum((x-mean)**2 for x in values)

 We can put these two functions together and test them on a small, contrived dataset. Below is a small dataset of x and y values.

x, y

1, 1

2, 3

4, 3

3, 2

5, 5

# Example of Estimating Mean and Variance

# Calculate the mean value of a list of numbers
def mean(values):
    return sum(values)/float(len(values))

# Calculate the variance of a list of numbers
def variance(values,mean):
    return sum([(x-mean)**2 for x in values])

# calculate mean and variance
dataset = [[1, 1],[2, 3], [4, 3], [3, 2], [5, 5]]
x = [row[0] for row in dataset]
y = [row[1] for row in dataset]
mean_x, mean_y = mean(x),mean(y)
var_x, var_y = variance(x,mean_x),variance(y,mean_y)
print('x stats: mean=%.3f variance=%.3f' %(mean_x,var_x))
print('y stats: mean=%.3f variance=%.3f' %(mean_y,var_y))

 1.2.2 Calculate Covariance

The covariance of two groups of numbers describes how those numbers change together.Covariance is a generalization of correlation. Correlation describes the relationship between two groups of numbers, whereas covariance can describe the relationship between two or more groups of numbers . Addtionally, covariance can be normalized to produce a correlation value. Neverthless, we can calculate the covariance between two variables as follows:

                           covariance = \sum_{i=1}^{n}((x_{i}-mean(x))\times (y_{i}-mean(y)))

Below is a function named covariance() that implements this statistic. It builds upon the previous step and takes the lists of x and y values as well as the mean of these values as arguments.

# Calculate covariance between x and y
def covariance(x, mean_x, mean_y):
    covar = 0
    for i in range(len(x)):
        covar += (x[i] - mean_x) * (y[i]-mean_y)
    return covar

We can test the calculation of the covariance on the same small contrived dataset as in the previous section.Putting it all together we get the example below:

# Example of Calculating Covariance
# Calculate the mean value of a list of numbers
def mean(values):
    return sum(values) / float(len(values))
# Calculate covariance between x and y
def covariance(x, mean_x, y, mean_y):
    covar = 0.0
    for i in range(len(x)):
        covar += (x[i] - mean_x) * (y[i] - mean_y)
    return covar
# calculate covariance
dataset = [[1, 1], [2, 3], [4, 3], [3, 2], [5, 5]]
x = [row[0] for row in dataset]
y = [row[1] for row in dataset]
mean_x, mean_y = mean(x), mean(y)
covar = covariance(x, mean_x, y, mean_y)
print('Covariance: %.3f' % (covar))

1.2.3 Estimate Coefficients

#  Calculate coefficients
def coefficients(dataset):
    x = [row[0] for row in dataset]
    y = [row[1] for row in dataset]
    x_mean,y_mean = mean(x),mean(y)
    b1 = covariance(x, x_mean, y, y_mean)/ variance(x, x_mean)
    b0 = y_mean - b1 * x_mean
    return [b0, b1]

We can put this together with all of function s from the previous two steps and test out the calculation of coefficients.

# Example of Calculating Coefficients

# Calculate the mean value of a list of numbers.
def mean(values):
    return sum(values) / float(len(values))

# Calculate covariance between x and y
def covariance(x, mean_x, y, mean_y):
    covar = 0.0
    for i in range(len(x)):
        covar += (x[i] - mean_x) * (y[i] - mean_y)
    return covar

# Calculate the variance of a list of numbers
def variance(values, mean):
    return sum([(x-mean)**2 for x in values])

# Calculate coefficients
def coefficients(dataset):
    x = [row[0] for row in dataset]
    y = [row[1] for row in dataset]
    x_mean,y_mean = mean(x), mean(y)
    b1 = covariance(x, x_mean,y,y_mean)/ variance(x, x_mean)
    b0 = y_mean - b1 * x_mean
    return [b0, b1]

# calculate coefficients
dataset = [[1, 1],[2, 3], [4, 3], [3, 2], [5, 5]]
b0,b1 = coefficients(dataset)
print('Coefficients: B0=%.3f, B1=%.3f' %(b0, b1))

Running this example calculates and prints the coefficients.

 Now that we know how to estimate the coefficients , the next step is to use them.

1.4 Make Predictions

The simple linear regression model is a line defined by coefficients estimated from training data. Once the coefficients are estimated , we can use them to make predictions.The equation to make predictions with a simple linear regression model is as follow:

                                        y = b0 + b1 \times x

Below is a function named simple_linear_regression() that implements the prediction equation to make predictions on a test dataset. It also ties together the estimation of the coefficients on training data from the steps above. The coefficients prepared from the training data are used to make predictions on the test data.which are then returned.

# Function To Run Simple Linear Regression
def simple_linear_regression(train, test):
    predictions = list()
    b0, b1 = coefficients(train)
    for row in test:
        yhat = b0 + b1 * row[0]
        predictions.append(yhat)
    return predictions

Let’s pull together everything we have learned and make predictions for our simple contrived dataset. As part of this example, we will also add in a function to manage the evaluation of the predictions called evaluate_algorithm() and another function to estimate the Root Mean Squared Error of the predictions called rmse_metric(). The full example is listed below.

# Example of Standalone Simple Linear Regression
from math import sqrt

# Calculate root mean squared error
def rmse_metric(actual, predicted):
    sum_error = 0.0
    for i in range(len(actual)):
        prediction_error = predicted[i] - actual[i]
        sum_error += (prediction_error ** 2)
    mean_error = sum_error / float(len(actual))
    return sqrt(mean_error)

# Evaluate regression algorithm on training dataset
def evaluate_algorithm(dataset, algorithm):
    test_set = list()
    for row in dataset:
        row_copy = list(row)
        row_copy[-1] = None
        test_set.append(row_copy)
    predicted = algorithm(dataset, test_set)
    print(predicted)
    actual = [row[-1] for row in dataset]
    rmse = rmse_metric(actual, predicted)
    return rmse

# Calculate the mean value of a list of numbers
def mean(values):
    return sum(values) / float(len(values))

# Calculate covariance between x and y
def covariance(x, mean_x, y, mean_y):
    covar = 0.0
    for i in range(len(x)):
        covar += (x[i] - mean_x) * (y[i] - mean_y)
    return covar

# Calculate the variance of a list of numbers
def variance(values, mean):
    return sum([(x-mean)**2 for x in values])

# Calculate coefficients
def coefficients(dataset):
    x = [row[0] for row in dataset]
    y = [row[1] for row in dataset]
    x_mean, y_mean = mean(x), mean(y)
    b1 = covariance(x, x_mean, y, y_mean) / variance(x, x_mean)
    b0 = y_mean - b1 * x_mean
    return [b0, b1]

# Simple linear regression algorithm
def simple_linear_regression(train, test):
    predictions = list()
    b0, b1 = coefficients(train)
    for row in test:
        yhat = b0 + b1 * row[0]
        predictions.append(yhat)
    return predictions

# Test simple linear regression
dataset = [[1, 1], [2, 3], [4, 3], [3, 2], [5, 5]]
rmse = evaluate_algorithm(dataset, simple_linear_regression)
print('RMSE: %.3f' % (rmse))

Running this example displays the following output that first lists the predictions and the RMSE of these predictions.

Finally, we can plot the predictions as a line and compare it to the original dataset.

1.2.5 Swedish Auto Insurance Case Study

We now know how to implement a simple linear regression model. Let's apply it to the Swedish insurance dataset. This section assumes that you have downloaded the dataset to the file insurance.csv and it is available in the current working directory. We will add some convenience functions to the simple linear regression from the previous steps.

        Specifically a function to load the CSV file called load_csv(), a function to convert a loaded dataset to numbers called str_column_to_float(), a function to evaluate an algorithm using a train and test set called train_test_split() a function to calculate RMSE called rmse_metric() and a function to evaluate an algorithm called evaluate_algorithm().

         The complete example is listed below. A training dataset of 60% of the data is used to prepare the model and predictions are made on the remaining 40%.

# Example of Simple Linear Regression on the Swedish Insurance Dataset
from random import seed
from random import randrange
from csv import reader
from math import sqrt

# Load a CSV file
def load_csv(filename):
    dataset = list()
    with open(filename, 'r') as file:
        csv_reader = reader(file)
        for row in csv_reader:
            if not row:
                continue
            dataset.append(row)
    return dataset

# Convert string column to float
def str_column_to_float(dataset, column):
    for row in dataset:
        row[column] = float(row[column].strip())
        
# Split a dataset into a train and test set
def train_test_split(dataset, split):
    train = list()
    train_size = split * len(dataset)
    dataset_copy = list(dataset)
    while len(train) < train_size:
        index = randrange(len(dataset_copy))
        train.append(dataset_copy.pop(index))
    return train, dataset_copy

# Calculate root mean squared error
def rmse_metric(actual, predicted):
    sum_error = 0.0
    for i in range(len(actual)):
        prediction_error = predicted[i] - actual[i]
        sum_error  += (prediction_error ** 2)
    mean_error = sum_error / float(len(actual))
    return sqrt(mean_error)

# Evaluate an algorithm using a train/test split 
def evaluate_algorithm(dataset, algorithm, split, *args):
    train, test = train_test_split(dataset, split)
    test_set = list()
    for row in test:
        row_copy = list(row)
        row_copy[-1] = None
        test_set.append(row_copy)
    predicted = algorithm(train, test_set, *args)
    actual = [row[-1] for row in test]
    rmse = rmse_metric(actual, predicted)
    return rmse

# Calculate the mean value of a list of numbers
def mean(values):
    return sum(values) / float(len(values))

# Calculate the variance of a list of numbers
def variance(values, mean):
    return sum([(x-mean)**2 for x in values])

# Calculate coefficients
def coefficients(dataset):
    x = [row[0] for row in dataset]
    y = [row[1] for row in dataset]
    x_mean, y_mean = mean(x), mean(y)
    b1 = covariance(x, x_mean, y, y_mean) / variance(x, x_mean)
    b0 = y_mean - b1 * x_mean
    return [b0, b1]

# Simple linear regression algorithm
def simple_linear_regression(train, test):
    predictions = list()
    b0,b1 = coefficients(train)
    for row in test:
        yhat = b0 + b1 * row[0]
        predictions.append(yhat)
    return predictions

# Simple linear regression on insurance dataset
seed(1)

# load and prepare data
filename = 'insurance.csv'
dataset = load_csv(filename)
for i in range(len(dataset[0])):
    str_column_to_float(dataset,i)
    
# evaluate algorithm
split = 0.6
rmse = evaluate_algorithm(dataset, simple_linear_regression, split)
print('RMSE: %.3f' % (rmse))

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值