【Python】Section 1: 线性回归 Linear Regression from HarvardX “Introduction to Data Science with Python“

Yqalu

已于 2024-04-28 01:18:40 修改

阅读量1k

点赞数 30

分类专栏： HarvardX CS109x 文章标签： python 线性回归机器学习

于 2024-04-27 14:01:53 首次发布

本文链接：https://blog.csdn.net/Yqalu/article/details/138234793

版权

HarvardX CS109x 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

1. Introduction to Regression

1.1 Predicting a Variable

Let’s imagine a scenario in which we would like to predict one variable using another variable, or a set of other variables. An example is predicting the number of views a TikTok video will get next week based on video length, the date it was posted, the previous number of views, etc. Or predicting which movies a Netflix user will rate highly based on their previous movie ratings, demographic data, etc.

Working Example

Throughout the course, we will try to use working examples - sets of real-world data that we can use as a class. The advertising data set (below) consists of the sales of a particular product in 200 different markets, and advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper. We have 4 columns and 200 rows where budgets are given in units of $1,000 and sales in 1,000 sales.

TV	radio	newspaper	sales
230.1	37.8	69.2	22.1
44.5	39.3	45.1	10.4
17.2	45.9	69.3	9.3
151.5	41.3	58.5	18.5
180.8	10.8	58.4	12.9

Response vs. Predictor Variables

There is an asymmetry in many of these problems: The variable we would like to predict may be more difficult to measure, is more important than the other(s), or maybe directly or indirectly influenced by the other variable(s). Thus, we like to define two categories of variables:

Variables whose values we use to make our prediction. These are known as predictors, features, or covariates.
Variables whose values we want to predict. These are known as outcomes, response variables, or dependent variables.

In this course, the terms predictors and response variables will be used. The first three columns in the advertising data (TV, radio, newspaper) are the predictors denoted by p. Each predictor on its own is a vector; when put together, they are denoted by X which is also known as The Data Matrix or Design Matrix. These variables are used to predict sales, the response variable, which is denoted by y.

Every row in the data denotes one observation and in general we have n observations and p predictors in a data set.

In general, the nomenclature that we will use throughout the course is as follows:

$n$ observations
$p$ predictors $X=X_1,...,X_p$
Each predictor denoted by $X_j=x_{1j},...,x_{ij},...,x_{nj}$
Response Variables $Y_j=y_1,...,y_n$

1.2 Connecting with Pandas

If you're practicing on your home computer, make sure you have pandas External link installed on your machine. It's already available in our programming environment within the course, so if you are practicing here you won't need to install anything.

Let's connect with Pandas, taking the data matrix and checking the shape that gives the dimensions of the data matrix .

X.shape

results in $(n,p)$ where is the number of observations and $p$ is the number of predictors.

For the response variable,

y.shape

results in $(n,1)$ or $(n,)$ .

For simplification for the rest of this section, we will consider one predictor (TV budget) where the shape of the data matrix would be $(n,1)$ or $(n,)$ and the shape of the response variable sales is the same as before, $(n,1)$ or $(n,)$ . In later sections we will consider more than one predictor.

Connecting to Pandas, $(n,1)$ is defined as a Pandas DataFrame with one column and $(n,)$ is defined as a Pandas series. When we later use sklearn to do modeling, sklearn expects the predictors to be an array with at least one column and not a data series.

When we read in our data into a data frame 'df' with a column named 'x', there is an important difference between the two operations df[['x']] versus df['x'].

df[['x']] returns a Pandas data frame object that is an array we can use in sklearn
df['x'] returns a Pandas series that will give an error when using sklearn

1.3 True vs. Statistical Model

Now we have decided how we are going the names for our data, let us move to how we are going to model. We will assume that the response variable, $Y$ , relates to the predictors, $X$ , through some unknown function generally expressed as:

$Y=f(X)+\epsilon$

Here $f$ is the unknown function expressing an underlying rule to relate $Y$ to $X$ , $\epsilon$ is the random amount (not related to $X$ ) that describes the difference of $Y$ from rule $f(X)$ .

A statistical model is any algorithm that estimates $f$ . We denote the estimated function as $\widehat{f}$ .

1.4 Prediction vs. Estimation

For some problems, what is important is obtaining , our estimate of . These are called inference problems.

When we use a set of measurements, $(x_{1p},...,x_{ip})$ to predict a value for the response variable, we denote the predicted value by:

$\widehat{y_i}=\widehat{f}(x_{i1},...,x_{ip})$

In other cases, we do not care about the specific form of $\widehat{f}$ , we just want to make our predictions $\widehat{y}$ as close to the observed values $y$ 's as possible. These are called prediction problems.

1.5 Example: Predicting Sales

Going back to our example data, what we show here is a plot of the sales in thousands of units versus the TV budget in thousands of dollars. The motivation here is to predict the sales of the response variable, $y$ , given the predictor, x, the TV budget. So we want to build a model to predict sales based on TV budget.

Here is a plot of sales (in thousands of units) for different TV budgets (in thousands of dollars):

Statistical Model

How do we predict, $y$ , $x$ for some ? The goal here is to predict the sales given some TV budget; we want to find the value of the sales for a TV budget of about 75,000, or the sales for a TV budget of about 160,000.

To simplify, we will start by looking at eight points on the plot.

We can first come up with a very simple model, called a naïve model, by taking the mean of all the sales $y$ values for all our observations: $\frac{1}{n}\sum_{i=1}^{n}y_i$

For all TV budgets, the naïve model would predict the average sales and can be used as a baseline later.

Simple Prediction Model

We can do better than the naïve model. Let's think of a different type of simple prediction model. We motivate ourselves with an example from everyday life.

If you go to a doctor with some symptoms, for example your tummy is hurting. The doctor will think about the other patients they have seen with similar symptoms and give the same treatment to you. Thus, one type of simple prediction model is where we find the most similar predictor data and predict y.

How do we find $\widehat{y_q}$ at some $x_q$ ?

Find distances to all other points $D(x_q,x_i)$
Find the nearest neighbor, $(x_p,y_p)$
Predict $\widehat{y_q}=y_q$ . In other words, what we predict for y is the same as the y for the nearest neighbors.

Let's apply this type of model to our data. If we want to know the sales given, say, a $175,000 budget for TV advertising, we look at similar examples - those who are the nearest neighbors to our TV budget. We find the nearest neighbor by finding the smallest distance. In this way we find the most similar example we have in our data. Once we have that, our prediction will be identical to the nearest neighbor's y sales.

We can then do the same for all points x to get a simple prediction model using nearest neighbor.

Extend the Prediction Model

We can extend the model to more than one neighbor to any number "k" of nearest neighbors. So in the example, we can take the 2 nearest neighbors (k=2) which are circled in red, and average their y values, and that is our sales prediction.

Generalizing, we take the k nearest neighbors, find the y values and average them. We measure the distance to all other points and find the k nearest neighbors and take the average of the y values of the k nearest neighbors.

What is $\widehat{y_q}$ at some $x_q$ ?

Find distances to all other points $D(x_q,x_i)$
Find the k-nearest neighbor, $(x_{q1},...,y_{qk})$
Predict $\widehat{y_q}=\frac{1}{k}\sum_{i}^{k}y_{qi}$

Simple Prediction Models

We can do the modeling for any number of nearest neighbors. k equal to 1, 3 and 8 are shown in the plot below. We see these horizontal lines in the plot that is the region where the center point in the horizontal line is the nearest neighbor to everything else. When k is equal to 1 it goes through every point. When k is equal to 8, since we only have 8 points, are returning back to the naïve model where all the points are averaged.

Simple Prediction Models with All Data

Using more data, we can try different k-models. The plot below shows when k=1, 10 and 70. When k is equal to 1 it goes through every point as expected. When k is equal to 10 the line becomes a little bit more descriptive and for k is equal to 70 it goes closer and closer to the average or naïve model.

k-Nearest Neighbors - kNN

kNN is a non-parametric learning algorithm, meaning that there is no assumptions about the underlying data distribution. The k-Nearest Neighbor Algorithm can be described more formally. Given a dataset $D=(X_1,y_1),...,(X_N,y_N)$ , for every new $X$ :

1. Find the k-number of observations in $D$ most similar to $X$ :

$(X^{(n_1)},y^{(n_1)}),...,(X^{(n_k)},y^{(n_k)})$

These are called the k-nearest neighbors of $x$

2. Average the output of the k-nearest neighbors of $x$

$\widehat{y}=\frac{1}{K}\sum_{k=1}^{K}y^{(n_k)}$

1.6 Exercise: Simple Data Plotting

# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# "Advertising.csv" containts the data set used in this exercise
data_filename = 'Advertising.csv'

# Read the file "Advertising.csv" file using the pandas library
df = pd.read_csv(data_filename)

# Get a quick look of the data
df.info

# Create a new dataframe by selecting the first 7 rows of
# the current dataframe
df_new = df.head(7)

# Print your new dataframe to see if you have selected 7 rows correctly
print(df_new)

# Use a scatter plot for plotting a graph of TV vs Sales
plt.scatter(df_new['TV'],df_new['Sales'])

# Add axis labels for clarity (x : TV budget, y : Sales)
plt.xlabel('TV budget')
plt.ylabel('Sales')

# Add plot title 
plt.title('TV budget')

plt.scatter(df['TV'],df['Sales'])

plt.xlabel('TV budget')
plt.ylabel('Sales')

plt.title('TV budget')

2. Error Evaluation and Model Comparison

2.1 Error Evaluation

We created several different models in the example of different k-nearest neighbors, so we need a way to decide which model is best. We evaluate the error of our model by looking at how well the model is doing outside the data that is used to make the prediction.

We start with a set of data and randomly hide some of that data from our model. This is called a train-validation split. We use the visible part of the data (the training set) to estimate $\widehat{y}$ , and the hidden part (the validation set) to evaluate the model.

The one-neighbor model (k=1) used to make predictions $\widehat{y}$ using the training set is shown on the plot. Now, we look at the data we have not used to make the model, the validation data shown as purple crosses.

The difference between the true value (the red cross) and the prediction is called the residual.

Observation Errors

For each observation $(X_n,y_n)$ , the absolute values of the residuals, $r_i=\left | y_i-\widehat{y_i} \right |$ quantify the error at each observation.

In order to quantify how well a model performs, we aggregate the errors across the data, and we call that the loss, error, or cost function. Cost usually refers to the total loss, while loss refers to a single training point.

A common loss function for quantitative outcomes is the Mean Squared Error (MSE):

$MSE=\frac{1}{n}\sum_{i=1}^{n}(y_i-\widehat{y_i})^2$

The MSE is by no means the only valid loss function, and it's not always the best one to use! Other choices for loss function include:

Max Absolute Error
Mean Absolute Error
Mean Squared Error

The square Root of the Mean of the Squared Errors (RMSE) is also commonly used.

$RMSE=\sqrt{MSE}=\sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i-\widehat{y_i})^2}$

2.2 Model Comparison

Now we have a way to measure the error of the model to do model comparison. We can do the same for all k's and compare the MSE. now since we have a measure of how well our model performs.

In the plot we compare the MSE for different k-nearest neighbors on the validation data. Three neighbors seems to be the best model since it has the lowest MSE. However, we should be a bit careful since it is close to k=4. The reason that k=3 may be lower than k=4 may be on the original decision how we split the data between train and validation.

Which model is the best? k=3 seems the be the best model.

Model Fitness

We now have a way to compare models. But just because a model is the best does not mean that the model is good. For a subset of the data, with our best model of k=3, we calculate the MSE to be 5.0. Is that good enough? What if instead we measure sales in units of individual sales as opposed to thousand units? For k=3 the MSE is now 5,004.93. Is the MSE now good enough?

In order to answer this question we create a scale to compare our model to a very bad model and a very good model.

We will use the simplest model, the average or naïve model for comparison $\widehat{y}=\frac{1}{n}\sum_{i}^{}y_i$ : is the worst possible model we can do that still makes sense.

We will say that $\widehat{y_i}=y_i$ , is the best possible model, when the prediction is identical to the true value.

R-Squared

We put that into a scale from 0 to 1 by creating a new quantity $R$ -squared.

$\widehat{R}^{2}=1-\frac{\sum_{i}^{}(\widehat{y_i}-y_i)^2}{\sum_{i}^{}(\overline{y}-y_i)^2}$

If our model is as good as the mean value, $\overline{y}$ , then $R$ -squared=0
If our model is perfect, then $R$ -squared=1
$R^2$ can be negative if the model is worse than the average. This can happen when we evaluate the model on the validation set.

2.3 Exercise: Simple kNN Regression

# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
%matplotlib inline

# Read the data from the file "Advertising.csv"
filename = 'Advertising.csv'
df_adv = pd.read_csv(filename)

# Take a quick look of the dataset
df_adv.head()

Part 1: KNN by hand for k=1

# Get a subset of the data i.e. rows 5 to 13
# Use the TV column as the predictor
x_true = df_adv.TV.iloc[5:13]

# Use the Sales column as the response
y_true = df_adv.Sales.iloc[5:13]

# Sort the data to get indices ordered from lowest to highest TV values
idx = np.argsort(x_true).values 

# Get the predictor data in the order given by idx above
x_true  = x_true.iloc[idx].values

# Get the response data in the order given by idx above
y_true  = y_true.iloc[idx].values


# Define a function that finds the index of the nearest neighbor 
# and returns the value of the nearest neighbor.  
# Note that this is just for k = 1 where the distance function is 
# simply the absolute value.

def find_nearest(array,value):
    
    # Hint: To find idx, use .idxmin() function on the series
    idx = pd.Series(np.abs(array-value)).idxmin()

    # Return the nearest neighbor index and value
    return idx, array[idx]

# Create some synthetic x-values (might not be in the actual dataset)
x = np.linspace(np.min(x_true), np.max(x_true))

# Initialize the y-values for the length of the synthetic x-values to zero
y = np.zeros((len(x)))

# Apply the KNN algorithm to predict the y-value for the given x value
for i, xi in enumerate(x):
    
    # Get the Sales values closest to the given x value
    y[i] = y_true[find_nearest(x_true, xi)[0]]

# Plot the synthetic data along with the predictions    
plt.plot(x, y, '-.')

# Plot the original data using black x's.
plt.plot(x_true, y_true, 'kx')

# Set the title and axis labels
plt.title('TV vs Sales')
plt.xlabel('TV budget in $1000')
plt.ylabel('Sales in $1000')

Part 2: KNN for k≥1 using sklearn

# Read the data from the file "Advertising.csv"
data_filename = 'Advertising.csv'
df = pd.read_csv(data_filename)

# Set 'TV' as the 'predictor variable'   
x = df[['TV']]

# Set 'Sales' as the response variable 'y' 
y = df['Sales']

# Split the dataset in training and testing with 60% training set 
# and 40% testing set with random state = 42
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.6,random_state=42)

# Choose the minimum k value based on the instructions given on the left
k_value_min = 1

# Choose the maximum k value based on the instructions given on the left
k_value_max = 70

# Create a list of integer k values betwwen k_value_min and k_value_max using linspace
k_list = np.linspace(k_value_min, k_value_max, 70)

# Set the grid to plot the values
fig, ax = plt.subplots(figsize=(10,6))

# Variable used to alter the linewidth of each plot
j=0

# Loop over all the k values
for k_value in k_list:   
    
    # Creating a kNN Regression model 
    model = KNeighborsRegressor(n_neighbors=int(j+1))
    
    # Fitting the regression model on the training data 
    model.fit(x_train,y_train)
    
    # Use the trained model to predict on the test data 
    y_pred = model.predict(x_test)
    
    # Helper code to plot the data along with the model predictions
    colors = ['grey','r','b']
    if k_value in [1,10,70]:
        xvals = np.linspace(x.min(),x.max(),100)
        ypreds = model.predict(xvals)
        ax.plot(xvals, ypreds,'-',label = f'k = {int(k_value)}',linewidth=j+2,color = colors[j])
        j+=1
        
ax.legend(loc='lower right',fontsize=20)
ax.plot(x_train, y_train,'x',label='train',color='k')
ax.set_xlabel('TV budget in $1000',fontsize=20)
ax.set_ylabel('Sales in $1000',fontsize=20)
plt.tight_layout()

2.4 Exercise: Finding the Best k in kNN Regression

# Import necessary libraries
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
%matplotlib inline

# Read the file 'Advertising.csv' into a Pandas dataset
df = pd.read_csv('Advertising.csv')

# Take a quick look at the data
df.head()

# Set the 'TV' column as predictor variable
x = df[['TV']]

# Set the 'Sales' column as response variable 
y = df['Sales']

# Split the dataset in training and testing with 60% training set and 
# 40% testing set 
x_train, x_test, y_train, y_test = train_test_split(x,y,train_size=0.6,random_state=66)

# Choosing k range from 1 to 70
k_value_min = 1
k_value_max = 70

# Create a list of integer k values between k_value_min and 
# k_value_max using linspace
k_list = np.linspace(k_value_min,k_value_max,num=70,dtype=int)

# Setup a grid for plotting the data and predictions
fig, ax = plt.subplots(figsize=(10,6))

# Create a dictionary to store the k value against MSE fit {k: MSE@k} 
knn_dict = {}

# Variable used for altering the linewidth of values kNN models
j=0

# Loop over all k values
for k_value in k_list:   
    
    # Create a KNN Regression model for the current k
    model = KNeighborsRegressor(n_neighbors=int(k_value))
    
    # Fit the model on the train data
    model.fit(x_train,y_train)
    
    # Use the trained model to predict on the test data
    y_pred = model.predict(x_test)
    
    # Calculate the MSE of the test data predictions
    MSE = ((y_pred-y_test)*(y_pred-y_test)).sum()/y_pred.shape[0]
    
    # Store the MSE values of each k value in the dictionary
    knn_dict[k_value] = MSE
    
    # Helper code to plot the data and various kNN model predictions
    colors = ['grey','r','b']
    if k_value in [1,10,70]:
        xvals = np.linspace(x.min(),x.max(),100)
        ypreds = model.predict(xvals)
        ax.plot(xvals, ypreds,'-',label = f'k = {int(k_value)}',linewidth=j+2,color = colors[j])
        j+=1
        
ax.legend(loc='lower right',fontsize=20)
ax.plot(x_train, y_train,'x',label='test',color='k')
ax.set_xlabel('TV budget in $1000',fontsize=20)
ax.set_ylabel('Sales in $1000',fontsize=20)
plt.tight_layout()

# Plot a graph which depicts the relation between the k values and MSE
plt.figure(figsize=(8,6))
plt.plot(list(knn_dict.keys()),list(knn_dict.values()),'k.-',alpha=0.5,linewidth=2)

# Set the title and axis labels
plt.xlabel('k',fontsize=20)
plt.ylabel('MSE',fontsize = 20)
plt.title('Test $MSE$ values for different k values - KNN regression',fontsize=20)
plt.tight_layout()

# Find the lowest MSE among all the kNN models
min_mse = min(knn_dict.values())

# Use list comprehensions to find the k value associated with the lowest MSE
best_model = [key  for (key, value) in knn_dict.items() if value == min_mse]

# Print the best k-value
print ("The best k value is ",best_model,"with a MSE of ", min_mse)

# Helper code to compute the R2_score of your best model
model = KNeighborsRegressor(n_neighbors=best_model[0])
model.fit(x_train,y_train)
y_pred_test = model.predict(x_test)

# Print the R2 score of the model
print(f"The R2 score for your model is {r2_score(y_test, y_pred_test)}")

3. Linear Regression

Note that in building our kNN model for prediction, we did not compute a closed form for $\widehat{F}$ . kNN is a non-parametric model.

What if we ask the question, "how much more sales do we expect if we double the TV advertising budget?"

We can build a model by first assuming a simple form of $f$ :

$f(x)=\beta _0+\beta _1X$

where $\beta _1$ represents the slope and $\beta _0$ represents the intercept. It then follows that our estimate is:

$\widehat{Y}=\widehat{f}(x)=\widehat{\beta _0}+\widehat{\beta _1}X$

where $\widehat{\beta _1}$ and $\widehat{\beta _0}$ are estimates of $\beta _1$ and $\beta _0$ , respectively, which we compute using observations.

3.1 Estimate of the regression coefficients

For a given data set we can draw different lines through the data, we need to determine which line is the best fit.

Is this it?

Or this one?

Question: Which line is the best?

To decide which is the best line, we can do the same as we did with the kNN model. We estimate the error for every model by looking at the residuals, the difference between the exact value of and the predicted . As shown in the plot below, the regression or model line, predicted sales, is the slanted line in green and the residuals to the exact values of sales are the vertical red lines.

As before, for each observation $(x_n,y_n)$ , the absolute residuals, $r_i=\left | y_i-\widehat{y_i} \right |$ quantifies the error at each observation.

Again, we aggregate and use the Mean Squared Error, MSE, as our loss function.

$L(\beta _0,\beta _1)=\frac{1}{n}\sum_{i=1}^{n}(y_i-\widehat{y_i})^2$

The difference from the kNN model is that there is a specific formula for the $\widehat{y}$ , which is the linear model.

We choose $\beta _0$ and $\beta _1$ that minimize the predictive errors made by our model, that is, minimize our loss function.

Then the optimal values, $\widehat{\beta _0}$ and $\widehat{\beta _1}$ , should be:

$\widehat{\beta _0},\widehat{\beta _1}=arg min \underset{\beta _0,\beta _1}{L}(\beta _0,\beta _1)$

We call this "fitting" or "training" the model.

3.2 Introducing Scikit-Learn

We can do the model training using the scikit-learn library. These details are also a recap from the exercise "Linear Regression Using Sklearn".

First we import the linear regression module

from sklearn.linear_model import LinearRegression

We use pandas and the method read_csv to read the advertising data and assign it to a data frame which we call df

df = pd.read_csv('Advertising.csv')

We take the TV column and the sales column and assign the series to $X$ and $y$

X = df[['TV']].values 
y = df['Sales'].values

We instantiate the model by using linear regression and call it reg

reg = LinearRegression()

We use now that model to fit the data by passing the data and

reg.fit(X,y)

Once we have fitted the model we can use the same variable, reg, to take the attribute coefficients to see the slope and intercept. The underscore is used to see the slope.

reg.coef_
>>> array[[0.04665056]])

reg.intercept_
>>> array[[7.08543108]])

We can use the method predict to predict at any given what would be the value of $y$

reg.predict(np.array([[100]])
>>> array([[11.75048733]])

What is happening behind the scenes? This code is hiding a lot of things, and we should investigate what's happening at the .fit method in order to understand how we find the and that minimizes the loss function.

3.3 Calculus Review

Derivative Definition

First we have to review some concepts. A derivative is the instantaneous rate of change or slope of a single valued function given. Given a function $f(x)$ the derivative can be defined as the slope as the two points come closer to each other where the difference is 0. As shown in the figure, the derivative is the slope between the blue point and the red point as the two points come closer to each other.

${f}'(x)=\frac{df}{dx}=\lim_{h\rightarrow 0}\frac{f(x+h)-f(x)}{h}$

Partial Derivatives

Another concept is the partial derivative. The partial derivative is for a function that depends on two or more variables. The rate of change of a function with respect to one variable while the other is fixed is called a partial derivative.

For a function $f$ , the partial derivative is written as

$\frac{\vartheta f}{\vartheta x_1},\frac{\vartheta f}{\vartheta x_2},...$

So, we have a function that depends on $x_1$ and $x_2$ the partial derivative of f with respect to $x_1$ is the slope when $x_2$ is kept fixed and the partial derivative of f with respect to $x_2$ is the slope when $x_1$ is kept fixed.

$f(x_1,x_2)=\begin{cases} \frac{\vartheta f}{\vartheta x_1} & \text{ for } x=x_1 \\ \frac{\vartheta f}{\vartheta x_2} & \text{ for } x= x_2 \end{cases}$

Optimization

In our simple linear regression, our loss functions depend on $\beta _0$ and $\beta _1$ . To minimize a loss function, we need to determine the rate of change of the function with respect to one variable with the others held fixed.

The global minima or maxima of the loss function $L(\beta _0,\beta _1)$ must occur at a point where the gradient (slope) is equal to zero, so the partial derivatives here must be equal to zero.

$\bigtriangledown (L)=\left [ \frac{\vartheta L}{\vartheta \beta _0},\frac{\vartheta L}{\vartheta \beta _1} \right ]$

There are three ways we can do this:

Brute force: Try every combination of $\beta _0$ and $\beta _1$ and find where both partial derivatives are equal to 0
Exact: Solving the above equations only works for rare cases. Since there are 2 equations and 2 unknowns in our example, it is solvable.
Greedy Algorithm: Gradient descent will be discussed later

Partial Derivative Example

For our linear regression, we are going to solve it analytically exactly. We start with the loss function that depends on $\beta _0, \beta _1$ , and want the partial derivatives with respect to $\beta _0$ and $\beta _1$

If $L(\beta _0,\beta_1)=(y-(\beta _1x+\beta _0))^2$ then what are $\frac{\vartheta L}{\vartheta \beta _0},\frac{\vartheta L}{\vartheta \beta _1}$

We need the chain rule to solve this. The chain rule says that if a function depends on another function that depends on a variable, the partial derivative of the original function with respect to the variable is the product of two partial derivatives. The first is the partial derivative of the function with the intermediate function multiplied by the partial derivative of the intermediate function with respect to that variable.

We can use that chain rule to derive the partial derivatives of the loss function with respect to $\beta _0$ and the partial derivative of the loss function with respect to $\beta _1$

If $L(\beta _0,\beta_1)=(y-(\beta _1x+\beta _0))^2$ then what is $\frac{\vartheta L}{\vartheta \beta }$ ?

$\frac{\vartheta L(f(\beta _0))}{\vartheta \beta _0}=\frac{\vartheta L}{\vartheta f} \frac{\vartheta f}{\vartheta \beta _0}$

Partial Derivative $\frac{\vartheta L}{\vartheta \beta _0}$

We can use some algebra

If $L(\beta _0,\beta_1)=(y-(\beta _1x+\beta _0))^2$ then what is $\frac{\vartheta L}{\vartheta \beta }$ ?

$L=((y-\beta_1 x-\beta_0)^2)$

where $f=(y-\beta_1 x-\beta_0)$ and $L=f^2$

$\frac{\vartheta L}{\vartheta \beta_0}=\frac{\vartheta L}{\vartheta f} \frac{\vartheta f}{\vartheta \beta_0}$

$\\L=f^2$

$\\\frac{\vartheta L}{\vartheta f}=2f$

$\\f=y-\beta_1x-\beta_0$

$\\\frac{\vartheta f}{\vartheta \beta_0}=-1$

$\\\frac{\vartheta L}{\vartheta \beta_0}=\frac{\vartheta L}{\vartheta f} \frac{\vartheta f}{\vartheta \beta_0}=-2f=-2(y-\beta_1x-\beta_0)$

Partial Derivative $\frac{\vartheta L}{\vartheta \beta _1}$

We can use the same algebra used for $\beta _0$ to find $\frac{\vartheta L}{\vartheta \beta_1}$ ?

$L=((y-\beta_1 x-\beta_0)^2)$

where $(y-\beta_1x-\beta_0)$ is $f$

$\frac{\vartheta L}{\vartheta \beta_1}=\frac{\vartheta L}{\vartheta f} \frac{\vartheta f}{\vartheta \beta_0}$

$L=f^2$

$\frac{\vartheta L}{\vartheta f}=2f$

$f=y-\beta_1x-\beta_0$

$\frac{\vartheta f}{\vartheta \beta_1}=-x$

$\\\frac{\vartheta L}{\vartheta \beta_1}=\frac{\vartheta L}{\vartheta f} \frac{\vartheta f}{\vartheta \beta_1}=-2xf=-2x(y-\beta_1x-\beta_0)$

Optimization

In order to minimize the loss function, we end up with two equations, one being the partial derivative with respect to $\beta_0$ and the other with respect to $\beta _1$ . Where the partial derivatives are zero, minimizes the loss function.

$\bigtriangledown (L)=\left [ \frac{\vartheta L}{\vartheta \beta _0},\frac{\vartheta L}{\vartheta \beta _1} \right ]=0$

$\\\frac{\vartheta L}{\vartheta \beta_0}=-2(y-\beta_1x-\beta_0)=0$

$\\\frac{\vartheta L}{\vartheta \beta_1}=-2x(y-\beta_1x-\beta_0)=0$

Putting it throught the algebraic machine, we get the values of $\beta_0$ and $\beta_1$ that minimize the loss function.

$\widehat{\beta_1}=\frac{\sum i(x_i-\overline{x})(y_i-\overline{y})}{\sum_{i}^{}(x_i-\overline{x})^2}$

$\widehat{\beta_0}=\overline{y}-\widehat{\beta_1}\overline{x}$

3.4 Summary: Estimate of the Regression Coefficients

We use MSE as our loss function,

$L(\beta_0,\beta_1)=\frac{1}{n}\sum_{i=1}^{n}(y_i-\widehat{y_i})^2=\frac{1}{n}\sum_{i=1}^{n}([y_i-(\beta_1x_i+\beta_0)])^2$

We choose $\beta_0$ and $\beta_1$ to minimize the predictive errors made by our model. i.e. minimize our loss function.

Then the optimal values for $\beta_0$ and $\beta_1$ should be:

$\widehat{\beta_0},\widehat{\beta_1}=arg min \underset{{\beta_0},{\beta_1}}{L}(\beta_0,\beta_1)$

Remenber, we call this fitting or training the model.

3.5 Estimate of the Regression Coefficients: Analytical Solution

We calculate the partial derivatives and equate them to zero. Using algebra we find the values of $\beta_0$ and $\beta_1$ .

Finding the exact solution only works in rare cases. Linear regression is one of these rare cases.

Take the gradient of the loss function and find the values of $\widehat{\beta_0}$ and $\widehat{\beta_1}$ where the gradient is zero: