A prediction from a machine learning perspective is a single point that hides the uncertainty of that prediction. Prediction intervals provide a way to quantify and communicate the uncertainty in a prediction.Prediction intervals describe the uncertainty for a single specific outcome.
After completing this tutorial , you will know:
- That a prediction interval quantifies the uncertainty of a single point prediction.
- That prediction intervals can be estimated analytically for simple models, but are more challenging for nonlinear machine learning models.
- How to calculate the prediction interval for a simple linear regression model.
1.1 Tutorial Overview
1.Why Calculate a Prediction Interval?
2.What Is a Prediction Interval?
3. How to Calculate a Prediction Interval?
4.Prediction Interval for Linear Regression
5.Worked Example
1.2 Why Calculate a Prediction Interval?
In predictive modeling , a prediction or a forecast is a single outcome value given some input variables.
# make a prediction with model
yhat = model.predict(X)
- yhat is the estimated outcome or prediction made by the trained model for the given input data X.
The model is an approximation of the relationship between the input variables and the output variables.
1.3 What Is a Prediction Interval?
A prediction interval is a quantification of the uncertainty on a prediction. It provides a probabilistic upper and lower bounds on the estimate of an outcome variable.
Prediction intervals are most commonly used when making predictions or forecasts with a regression model, where a quantity is being predicted.
The prediction interval surrounds the prediction made by the model and hopefully covers the range of the true outcome. The diagram below helps to visually understand the relationship between the prediction, prediction interval, and the actual outcome.
A prediction interval is different from a confidence interval. A confidence interval quantifies the uncertainty on an estimated population variable, such as the mean or standard deviation. Whereas a prediction interval quantifies the uncertainty on a single observation estimated from the population.
In predictive modeling. a confidence interval interval can be used to quantify the uncertainty of the estimated skill of a model, a prediction interval can be used to quantify the uncertainty of a single forecast.
1.4 How to Calculate a Prediction Interval
A prediction interval is calculated as some combination of the estimated variance of the model and the variance of the outcome variable. Prediction intervals are easy to describe, but difficult to calculate in practice. In simple cases like linear regression, we can estimate the prediction interval directly.
The following list summarizes some methods that can be used for prediction uncertainty for nonlinear machine learning models:
- The Delta Method , from the field of nonlinear regression
- The Bayesian Method, from Bayesian modeling and statistics.
- The Mean-Variance Estimation Method , using estimated statistics.
- The Bootstrap Method, using data resampling and developing an ensemble of models.
1.5 Prediction Interval for Linear Regression
A linear regression is a model that describes the linear combination of inputs to calculate the output variables.
(or yhat) is the prediction, and are coefficients of the model estimated from training data and x is the input variable.
The prediction interval around ˆy can be calculated as follows:
Where yˆ is the predicted value, z is the critical value from the Gaussian distribution (e.g. 1.96 for a 95% interval) and σ is the standard deviation of the predicted distribution.
We can calculate an unbiased estimate of the of the predicted standard deviation as follows:
1.6 Worked Example
make the case of linear regression prediction intervals concrete with a worked example.
# generate related vairables
from numpy import mean
from numpy import std
from numpy.random import randn
from numpy.random import seed
from matplotlib import pyplot
# seed random number generator
seed(1)
# prepare data
x = 20 * randn(1000) + 100
y = x + (10 * randn(1000) + 50)
# summarize
print('x: mean=%.3f stdv=%.3f' %(mean(x),std(x)))
print('y: mean=%.3f stdv=%.3f' % (mean(y),std(y)))
# plot
pyplot.scatter(x, y)
pyplot.show()
Running the example first prints the mean and standard deviations of the two variables
A plot of the dataset is then created. We can see the clear linear relationship between the variables with the spread of the points highlighting the noise or random error in the relationship
Next, we can develop a simple linear regression that given the input variable x, will predict the y variable. We can use the linregress() SciPy function to fit the model and return the b0 and b1 coefficients for the model.
# fit linear regression model
b1,b0,r_value,p_value, std_err = linregress(x, y)
We can use the coefficients to calculate the predicted y values, called yhat, for each of the input variables. The resulting points will form a line that represents the learned relationship.
# make prediction
yhat = b0 + b1 * x
The complete example is listed below.
# simple linear regression model
from numpy.random import randn
from numpy.random import seed
from scipy.stats import linregress
from matplotlib import pyplot
#seed random number generator
seed(1)
# prepare data
x = 20 * randn(1000) + 100
y = x + (10 * randn(1000) + 50)
# fit linear regression model
b1,b0,r_value,p_value,std_err = linregress(x,y)
print('b0=%.3f,b1=%.3f' %(b1, b0))
# make prediction
yhat = b0 + b1 * x
# plot data and predictions
pyplot.scatter(x, y)
pyplot.plot(x,yhat, color='r')
pyplot.show()
Running the example fits the model and prints the coefficients.
The coefficients are then used with the inputs from the dataset to make a prediction. The resulting inputs and predicted y-values are plotted as a line on top of the scatter plot for the dataset. We can clearly see that the model has learned the underlying relationship in the dataset
We are now ready to make a prediction with our simple linear regression model and add a prediction interval. We will fit the model as before. This time we will take one sample from the dataset to demonstrate the prediction interval. We will use the input to make a prediction, calculate the prediction interval for the prediction, and compare the prediction and interval to the known expected value. First, let’s define the input, prediction, and expected values.
# define the prediction
x_in = x[0]
y_out = y[0]
yhat_out = yhat[0]
Next, we can estimate the standard deviation in the prediction direction.
We can calculate this directly using the NumPy arrays as follows:
# estimate stdev of yhat
sum_errs = arraysum((y - yhat)**2)
stdev = sqrt(1/(len(y)-2) * sum_errs)
Next, we can calculate the prediction interval for our chosen input:
We will use the significance level of 95%, which is the Gaussian critical value of 1.69. Once the interval is calculated, we can summarize the bounds on the prediction to the user.
# calculate prediction interval
interval = 1.96 * stdev
lower,upper = yhat_out - interval, yhat_out + interval
We can tie all of this together. The complete example is listed below.
# linear regression prediction with prediction interval
from numpy.random import randn
from numpy.random import seed
from numpy import sqrt
from numpy import sum as arraysum
from scipy.stats import linregress
from matplotlib import pyplot
# seed random number generator
seed(1)
#prepare data
x = 20 * randn(1000) + 100
y = x + (10 * randn(1000) + 50)
# fit linear regression model
b1,b0,r_value,p_value,std_err = linregress(x,y)
# make predicitions
yhat = b0 + b1 * x
# define new input,expected value and prediction
x_in = x[0]
y_out = y[0]
yhat_out = yhat[0]
#estimate stdev of yhat
sum_errs = arraysum((y - yhat)**2)
stdev = sqrt(1/(len(y)-2) * sum_errs)
# calculate prediction interval
interval = 1.96 * stdev
print('Prediction Interval: %.3f' % interval)
lower,upper = y_out - interval, y_out + interval
print('95%% likelihood that true value is between %.3f and %.3f' %(lower, upper))
print('True value: %.3f' % yhat_out)
# plot dataset and prediction with interval
pyplot.scatter(x, y)
pyplot.plot(x, yhat, color='red')
pyplot.errorbar(x_in,yhat_out, yerr=interval,color='black',fmt='o')
pyplot.show()
Running the example estimates the yhat standard deviation and then calculates the confidence interval. Once calculated, the prediction interval is presented to the user for the given input variable. Because we contrived this example, we know the true outcome, which we also display. We can see that in this case, the 95% prediction interval does cover the true expected value.