1. Introduction to Regression
1.1 Predicting a Variable
Let’s imagine a scenario in which we would like to predict one variable using another variable, or a set of other variables. An example is predicting the number of views a TikTok video will get next week based on video length, the date it was posted, the previous number of views, etc. Or predicting which movies a Netflix user will rate highly based on their previous movie ratings, demographic data, etc.
Working Example
Throughout the course, we will try to use working examples - sets of real-world data that we can use as a class. The advertising data set (below) consists of the sales of a particular product in 200 different markets, and advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper. We have 4 columns and 200 rows where budgets are given in units of $1,000 and sales in 1,000 sales.
TV | radio | newspaper | sales |
230.1 | 37.8 | 69.2 | 22.1 |
44.5 | 39.3 | 45.1 | 10.4 |
17.2 | 45.9 | 69.3 | 9.3 |
151.5 | 41.3 | 58.5 | 18.5 |
180.8 | 10.8 | 58.4 | 12.9 |
Response vs. Predictor Variables
There is an asymmetry in many of these problems: The variable we would like to predict may be more difficult to measure, is more important than the other(s), or maybe directly or indirectly influenced by the other variable(s). Thus, we like to define two categories of variables:
- Variables whose values we use to make our prediction. These are known as predictors, features, or covariates.
- Variables whose values we want to predict. These are known as outcomes, response variables, or dependent variables.
In this course, the terms predictors and response variables will be used. The first three columns in the advertising data (TV, radio, newspaper) are the predictors denoted by p. Each predictor on its own is a vector; when put together, they are denoted by X which is also known as The Data Matrix or Design Matrix. These variables are used to predict sales, the response variable, which is denoted by y.
Every row in the data denotes one observation and in general we have n observations and p predictors in a data set.
In general, the nomenclature that we will use throughout the course is as follows:
- observations
- predictors
- Each predictor denoted by
- Response Variables
1.2 Connecting with Pandas
If you're practicing on your home computer, make sure you have pandas External link installed on your machine. It's already available in our programming environment within the course, so if you are practicing here you won't need to install anything.
Let's connect with Pandas, taking the data matrix and checking the shape that gives the dimensions of the data matrix .
X.shape
results in where is the number of observations and is the number of predictors.
For the response variable,
y.shape
results in or .
For simplification for the rest of this section, we will consider one predictor (TV budget) where the shape of the data matrix would be or and the shape of the response variable sales is the same as before, or . In later sections we will consider more than one predictor.
Connecting to Pandas, is defined as a Pandas DataFrame
with one column and is defined as a Pandas series. When we later use sklearn
to do modeling, sklearn expects the predictors to be an array with at least one column and not a data series.
When we read in our data into a data frame 'df' with a column named 'x', there is an important difference between the two operations df[['x']]
versus df['x']
.
df[['x']]
returns a Pandas data frame object that is an array we can use in sklearndf['x']
returns a Pandas series that will give an error when using sklearn
1.3 True vs. Statistical Model
Now we have decided how we are going the names for our data, let us move to how we are going to model. We will assume that the response variable, , relates to the predictors, , through some unknown function generally expressed as:
Here is the unknown function expressing an underlying rule to relate to , is the random amount (not related to ) that describes the difference of from rule .
A statistical model is any algorithm that estimates . We denote the estimated function as .
1.4 Prediction vs. Estimation
For some problems, what is important is obtaining , our estimate of . These are called inference problems.
When we use a set of measurements, to predict a value for the response variable, we denote the predicted value by:
In other cases, we do not care about the specific form of , we just want to make our predictions as close to the observed values 's as possible. These are called prediction problems.
1.5 Example: Predicting Sales
Going back to our example data, what we show here is a plot of the sales in thousands of units versus the TV budget in thousands of dollars. The motivation here is to predict the sales of the response variable, , given the predictor, x, the TV budget. So we want to build a model to predict sales based on TV budget.
Here is a plot of sales (in thousands of units) for different TV budgets (in thousands of dollars):
Statistical Model
How do we predict, , for some ? The goal here is to predict the sales given some TV budget; we want to find the value of the sales for a TV budget of about 75,000, or the sales for a TV budget of about 160,000.
To simplify, we will start by looking at eight points on the plot.
We can first come up with a very simple model, called a naïve model, by taking the mean of all the sales values for all our observations:
For all TV budgets, the naïve model would predict the average sales and can be used as a baseline later.
Simple Prediction Model
We can do better than the naïve model. Let's think of a different type of simple prediction model. We motivate ourselves with an example from everyday life.
If you go to a doctor with some symptoms, for example your tummy is hurting. The doctor will think about the other patients they have seen with similar symptoms and give the same treatment to you. Thus, one type of simple prediction model is where we find the most similar predictor data and predict y.
How do we find at some ?
- Find distances to all other points
- Find the nearest neighbor,
- Predict . In other words, what we predict for y is the same as the y for the nearest neighbors.
Let's apply this type of model to our data. If we want to know the sales given, say, a $175,000 budget for TV advertising, we look at similar examples - those who are the nearest neighbors to our TV budget. We find the nearest neighbor by finding the smallest distance. In this way we find the most similar example we have in our data. Once we have that, our prediction will be identical to the nearest neighbor's y sales.
We can then do the same for all points x to get a simple prediction model using nearest neighbor.
Extend the Prediction Model
We can extend the model to more than one neighbor to any number "k" of nearest neighbors. So in the example, we can take the 2 nearest neighbors (k=2)
which are circled in red, and average their y values, and that is our sales prediction.
Generalizing, we take the k nearest neighbors, find the y values and average them. We measure the distance to all other points and find the k nearest neighbors and take the average of the y values of the k nearest neighbors.
What is at some ?
- Find distances to all other points
- Find the k-nearest neighbor,
- Predict
Simple Prediction Models
We can do the modeling for any number of nearest neighbors. k equal to 1, 3 and 8 are shown in the plot below. We see these horizontal lines in the plot that is the region where the center point in the horizontal line is the nearest neighbor to everything else. When k is equal to 1 it goes through every point. When k is equal to 8, since we only have 8 points, are returning back to the naïve model where all the points are averaged.
Simple Prediction Models with All Data
Using more data, we can try different k-models. The plot below shows when k=1
, 10 and 70. When k is equal to 1 it goes through every point as expected. When k is equal to 10 the line becomes a little bit more descriptive and for k is equal to 70 it goes closer and closer to the average or naïve model.
k-Nearest Neighbors - kNN
kNN is a non-parametric learning algorithm, meaning that there is no assumptions about the underlying data distribution. The k-Nearest Neighbor Algorithm can be described more formally. Given a dataset , for every new :
1. Find the k-number of observations in most similar to :
These are called the k-nearest neighbors of
2. Average the output of the k-nearest neighbors of
1.6 Exercise: Simple Data Plotting
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# "Advertising.csv" containts the data set used in this exercise
data_filename = 'Advertising.csv'
# Read the file "Advertising.csv" file using the pandas library
df = pd.read_csv(data_filename)
# Get a quick look of the data
df.info
# Create a new dataframe by selecting the first 7 rows of
# the current dataframe
df_new = df.head(7)
# Print your new dataframe to see if you have selected 7 rows correctly
print(df_new)
# Use a scatter plot for plotting a graph of TV vs Sales
plt.scatter(df_new['TV'],df_new['Sales'])
# Add axis labels for clarity (x : TV budget, y : Sales)
plt.xlabel('TV budget')
plt.ylabel('Sales')
# Add plot title
plt.title('TV budget')
plt.scatter(df['TV'],df['Sales'])
plt.xlabel('TV budget')
plt.ylabel('Sales')
plt.title('TV budget')
2. Error Evaluation and Model Comparison
2.1 Error Evaluation
We created several different models in the example of different k-nearest neighbors, so we need a way to decide which model is best. We evaluate the error of our model by looking at how well the model is doing outside the data that is used to make the prediction.
We start with a set of data and randomly hide some of that data from our model. This is called a train-validation split. We use the visible part of the data (the training set) to estimate , and the hidden part (the validation set) to evaluate the model.
The one-neighbor model (k=1) used to make predictions using the training set is shown on the plot. Now, we look at the data we have not used to make the model, the validation data shown as purple crosses.
The difference between the true value (the red cross) and the prediction is called the residual.
Observation Errors
For each observation , the absolute values of the residuals, quantify the error at each observation.
In order to quantify how well a model performs, we aggregate the errors across the data, and we call that the loss, error, or cost function. Cost usually refers to the total loss, while loss refers to a single training point.
A common loss function for quantitative outcomes is the Mean Squared Error (MSE):
The MSE is by no means the only valid loss function, and it's not always the best one to use! Other choices for loss function include:
- Max Absolute Error
- Mean Absolute Error
- Mean Squared Error
The square Root of the Mean of the Squared Errors (RMSE) is also commonly used.
2.2 Model Comparison
Now we have a way to measure the error of the model to do model comparison. We can do the same for all k's and compare the MSE. now since we have a measure of how well our model performs.
In the plot we compare the MSE for different k-nearest neighbors on the validation data. Three neighbors seems to be the best model since it has the lowest MSE. However, we should be a bit careful since it is close to k=4
. The reason that k=3
may be lower than k=4
may be on the original decision how we split the data between train and validation.
Which model is the best? k=3
seems the be the best model.
Model Fitness
We now have a way to compare models. But just because a model is the best does not mean that the model is good. For a subset of the data, with our best model of k=3, we calculate the MSE to be 5.0. Is that good enough? What if instead we measure sales in units of individual sales as opposed to thousand units? For k=3 the MSE is now 5,004.93. Is the MSE now good enough?
In order to answer this question we create a scale to compare our model to a very bad model and a very good model.
We will use the simplest model, the average or naïve model for comparison : is the worst possible model we can do that still makes sense.
We will say that , is the best possible model, when the prediction is identical to the true value.
R-Squared
We put that into a scale from 0 to 1 by creating a new quantity -squared.
- If our model is as good as the mean value, , then -squared=0
- If our model is perfect, then -squared=1
- can be negative if the model is worse than the average. This can happen when we evaluate the model on the validation set.
2.3 Exercise: Simple kNN Regression
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
%matplotlib inline
# Read the data from the file "Advertising.csv"
filename = 'Advertising.csv'
df_adv = pd.read_csv(filename)
# Take a quick look of the dataset
df_adv.head()
Part 1: KNN by hand for k=1
# Get a subset of the data i.e. rows 5 to 13
# Use the TV column as the predictor
x_true = df_adv.TV.iloc[5:13]
# Use the Sales column as the response
y_true = df_adv.Sales.iloc[5:13]
# Sort the data to get indices ordered from lowest to highest TV values
idx = np.argsort(x_true).values
# Get the predictor data in the order given by idx above
x_true = x_true.iloc[idx].values
# Get the response data in the order given by idx above
y_true = y_true.iloc[idx].values
# Define a function that finds the index of the nearest neighbor
# and returns the value of the nearest neighbor.
# Note that this is just for k = 1 where the distance function is
# simply the absolute value.
def find_nearest(array,value):
# Hint: To find idx, use .idxmin() function on the series
idx = pd.Series(np.abs(array-value)).idxmin()
# Return the nearest neighbor index and value
return idx, array[idx]
# Create some synthetic x-values (might not be in the actual dataset)
x = np.linspace(np.min(x_true), np.max(x_true))
# Initialize the y-values for the length of the synthetic x-values to zero
y = np.zeros((len(x)))
# Apply the KNN algorithm to predict the y-value for the given x value
for i, xi in enumerate(x):
# Get the Sales values closest to the given x value
y[i] = y_true[find_nearest(x_true, xi)[0]]
# Plot the synthetic data along with the predictions
plt.plot(x, y, '-.')
# Plot the original data using black x's.
plt.plot(x_true, y_true, 'kx')
# Set the title and axis labels
plt.title('TV vs Sales')
plt.xlabel('TV budget in $1000')
plt.ylabel('Sales in $1000')
Part 2: KNN for k≥1 using sklearn
# Read the data from the file "Advertising.csv"
data_filename = 'Advertising.csv'
df = pd.read_csv(data_filename)
# Set 'TV' as the 'predictor variable'
x = df[['TV']]
# Set 'Sales' as the response variable 'y'
y = df['Sales']
# Split the dataset in training and testing with 60% training set
# and 40% testing set with random state = 42
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.6,random_state=42)
# Choose the minimum k value based on the instructions given on the left
k_value_min = 1
# Choose the maximum k value based on the instructions given on the left
k_value_max = 70
# Create a list of integer k values betwwen k_value_min and k_value_max using linspace
k_list = np.linspace(k_value_min, k_value_max, 70)
# Set the grid to plot the values
fig, ax = plt.subplots(figsize=(10,6))
# Variable used to alter the linewidth of each plot
j=0
# Loop over all the k values
for k_value in k_list:
# Creating a kNN Regression model
model = KNeighborsRegressor(n_neighbors=int(j+1))
# Fitting the regression model on the training data
model.fit(x_train,y_train)
# Use the trained model to predict on the test data
y_pred = model.predict(x_test)
# Helper code to plot the data along with the model predictions
colors = ['grey','r','b']
if k_value in [1,10,70]:
xvals = np.linspace(x.min(),x.max(),100)
ypreds = model.predict(xvals)
ax.plot(xvals, ypreds,'-',label = f'k = {int(k_value)}',linewidth=j+2,color = colors[j])
j+=1
ax.legend(loc='lower right',fontsize=20)
ax.plot(x_train, y_train,'x',label='train',color='k')
ax.set_xlabel('TV budget in $1000',fontsize=20)
ax.set_ylabel('Sales in $1000',fontsize=20)
plt.tight_layout()
2.4 Exercise: Finding the Best k in kNN Regression
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
%matplotlib inline
# Read the file 'Advertising.csv' into a Pandas dataset
df = pd.read_csv('Advertising.csv')
# Take a quick look at the data
df.head()
# Set the 'TV' column as predictor variable
x = df[['TV']]
# Set the 'Sales' column as response variable
y = df['Sales']
# Split the dataset in training and testing with 60% training set and
# 40% testing set
x_train, x_test, y_train, y_test = train_test_split(x,y,train_size=0.6,random_state=66)
# Choosing k range from 1 to 70
k_value_min = 1
k_value_max = 70
# Create a list of integer k values between k_value_min and
# k_value_max using linspace
k_list = np.linspace(k_value_min,k_value_max,num=70,dtype=int)
# Setup a grid for plotting the data and predictions
fig, ax = plt.subplots(figsize=(10,6))
# Create a dictionary to store the k value against MSE fit {k: MSE@k}
knn_dict = {}
# Variable used for altering the linewidth of values kNN models
j=0
# Loop over all k values
for k_value in k_list:
# Create a KNN Regression model for the current k
model = KNeighborsRegressor(n_neighbors=int(k_value))
# Fit the model on the train data
model.fit(x_train,y_train)
# Use the trained model to predict on the test data
y_pred = model.predict(x_test)
# Calculate the MSE of the test data predictions
MSE = ((y_pred-y_test)*(y_pred-y_test)).sum()/y_pred.shape[0]
# Store the MSE values of each k value in the dictionary
knn_dict[k_value] = MSE
# Helper code to plot the data and various kNN model predictions
colors = ['grey','r','b']
if k_value in [1,10,70]:
xvals = np.linspace(x.min(),x.max(),100)
ypreds = model.predict(xvals)
ax.plot(xvals, ypreds,'-',label = f'k = {int(k_value)}',linewidth=j+2,color = colors[j])
j+=1
ax.legend(loc='lower right',fontsize=20)
ax.plot(x_train, y_train,'x',label='test',color='k')
ax.set_xlabel('TV budget in $1000',fontsize=20)
ax.set_ylabel('Sales in $1000',fontsize=20)
plt.tight_layout()
# Plot a graph which depicts the relation between the k values and MSE
plt.figure(figsize=(8,6))
plt.plot(list(knn_dict.keys()),list(knn_dict.values()),'k.-',alpha=0.5,linewidth=2)
# Set the title and axis labels
plt.xlabel('k',fontsize=20)
plt.ylabel('MSE',fontsize = 20)
plt.title('Test $MSE$ values for different k values - KNN regression',fontsize=20)
plt.tight_layout()
# Find the lowest MSE among all the kNN models
min_mse = min(knn_dict.values())
# Use list comprehensions to find the k value associated with the lowest MSE
best_model = [key for (key, value) in knn_dict.items() if value == min_mse]
# Print the best k-value
print ("The best k value is ",best_model,"with a MSE of ", min_mse)
# Helper code to compute the R2_score of your best model
model = KNeighborsRegressor(n_neighbors=best_model[0])
model.fit(x_train,y_train)
y_pred_test = model.predict(x_test)
# Print the R2 score of the model
print(f"The R2 score for your model is {r2_score(y_test, y_pred_test)}")
3. Linear Regression
Note that in building our kNN model for prediction, we did not compute a closed form for . kNN is a non-parametric model.
What if we ask the question, "how much more sales do we expect if we double the TV advertising budget?"
We can build a model by first assuming a simple form of :
where represents the slope and represents the intercept. It then follows that our estimate is:
where and are estimates of and , respectively, which we compute using observations.
3.1 Estimate of the regression coefficients
For a given data set we can draw different lines through the data, we need to determine which line is the best fit.
Is this it?
Or this one?
Question: Which line is the best?
To decide which is the best line, we can do the same as we did with the kNN model. We estimate the error for every model by looking at the residuals, the difference between the exact value of and the predicted . As shown in the plot below, the regression or model line, predicted sales, is the slanted line in green and the residuals to the exact values of sales are the vertical red lines.
As before, for each observation , the absolute residuals, quantifies the error at each observation.
Again, we aggregate and use the Mean Squared Error, MSE, as our loss function.
The difference from the kNN model is that there is a specific formula for the , which is the linear model.
We choose and that minimize the predictive errors made by our model, that is, minimize our loss function.
Then the optimal values, and , should be:
We call this "fitting" or "training" the model.
3.2 Introducing Scikit-Learn
We can do the model training using the scikit-learn library. These details are also a recap from the exercise "Linear Regression Using Sklearn".
First we import the linear regression module
from sklearn.linear_model import LinearRegression
We use pandas and the method read_csv
to read the advertising data and assign it to a data frame which we call df
df = pd.read_csv('Advertising.csv')
We take the TV column and the sales column and assign the series to and
X = df[['TV']].values
y = df['Sales'].values
We instantiate the model by using linear regression and call it reg
reg = LinearRegression()
We use now that model to fit the data by passing the data and
reg.fit(X,y)
Once we have fitted the model we can use the same variable, reg, to take the attribute coefficients to see the slope and intercept. The underscore is used to see the slope.
reg.coef_
>>> array[[0.04665056]])
reg.intercept_
>>> array[[7.08543108]])
We can use the method predict to predict at any given what would be the value of
reg.predict(np.array([[100]])
>>> array([[11.75048733]])
What is happening behind the scenes? This code is hiding a lot of things, and we should investigate what's happening at the .fit
method in order to understand how we find the and that minimizes the loss function.
3.3 Calculus Review
Derivative Definition
First we have to review some concepts. A derivative is the instantaneous rate of change or slope of a single valued function given. Given a function the derivative can be defined as the slope as the two points come closer to each other where the difference is 0. As shown in the figure, the derivative is the slope between the blue point and the red point as the two points come closer to each other.
Partial Derivatives
Another concept is the partial derivative. The partial derivative is for a function that depends on two or more variables. The rate of change of a function with respect to one variable while the other is fixed is called a partial derivative.
For a function , the partial derivative is written as
So, we have a function that depends on and the partial derivative of f with respect to is the slope when is kept fixed and the partial derivative of f with respect to is the slope when is kept fixed.
Optimization
In our simple linear regression, our loss functions depend on and . To minimize a loss function, we need to determine the rate of change of the function with respect to one variable with the others held fixed.
The global minima or maxima of the loss function must occur at a point where the gradient (slope) is equal to zero, so the partial derivatives here must be equal to zero.
There are three ways we can do this:
- Brute force: Try every combination of and and find where both partial derivatives are equal to 0
- Exact: Solving the above equations only works for rare cases. Since there are 2 equations and 2 unknowns in our example, it is solvable.
- Greedy Algorithm: Gradient descent will be discussed later
Partial Derivative Example
For our linear regression, we are going to solve it analytically exactly. We start with the loss function that depends on , and want the partial derivatives with respect to and
If then what are
We need the chain rule to solve this. The chain rule says that if a function depends on another function that depends on a variable, the partial derivative of the original function with respect to the variable is the product of two partial derivatives. The first is the partial derivative of the function with the intermediate function multiplied by the partial derivative of the intermediate function with respect to that variable.
We can use that chain rule to derive the partial derivatives of the loss function with respect to and the partial derivative of the loss function with respect to
If then what is ?
Partial Derivative
We can use some algebra
If then what is ?
where and
Partial Derivative
We can use the same algebra used for to find ?
where is
Optimization
In order to minimize the loss function, we end up with two equations, one being the partial derivative with respect to and the other with respect to . Where the partial derivatives are zero, minimizes the loss function.
Putting it throught the algebraic machine, we get the values of and that minimize the loss function.
3.4 Summary: Estimate of the Regression Coefficients
We use MSE as our loss function,
We choose and to minimize the predictive errors made by our model. i.e. minimize our loss function.
Then the optimal values for and should be:
Remenber, we call this fitting or training the model.
3.5 Estimate of the Regression Coefficients: Analytical Solution
We calculate the partial derivatives and equate them to zero. Using algebra we find the values of and .
Finding the exact solution only works in rare cases. Linear regression is one of these rare cases.
Take the gradient of the loss function and find the values of and where the gradient is zero:
where and are sample means.
The line which uses the and
is called the regression line.