About Prediction Intervals

A prediction from a machine learning perspective is a single point that hides the uncertainty of that prediction. Prediction intervals provide a way to quantify and communicate the uncertainty in a prediction.Prediction intervals describe the uncertainty for a single specific outcome.

After completing this tutorial , you will know:

  • That a prediction interval quantifies the uncertainty of a single point prediction.
  • That prediction intervals can be estimated analytically for simple models, but are more challenging for nonlinear machine learning models.
  • How to calculate the prediction interval for a simple linear regression model.

1.1 Tutorial Overview

1.Why Calculate a Prediction Interval?

2.What Is a Prediction Interval?

3. How to Calculate a Prediction Interval?

4.Prediction Interval for Linear Regression

5.Worked Example

1.2 Why Calculate a Prediction Interval?

In predictive modeling , a prediction or a forecast is a single outcome value given some input variables.

# make a prediction with model

yhat = model.predict(X)
  • yhat is the estimated outcome or prediction made by the trained model for the given  input data X.

The model is an approximation of the relationship between the input variables and the output variables.

1.3 What Is a Prediction Interval?

A prediction interval is a quantification of the uncertainty on a prediction. It provides a probabilistic upper and lower bounds on the estimate of an outcome variable.

Prediction intervals are most commonly used when making predictions or forecasts with a regression model, where a quantity is being predicted.

The prediction interval surrounds the prediction made by the model and hopefully covers the range of the true outcome. The diagram below helps to visually understand the relationship between the prediction, prediction interval, and the actual outcome.

 A prediction interval is different from a confidence interval. A confidence interval quantifies the uncertainty on an estimated population variable, such as the mean or standard deviation. Whereas a prediction interval quantifies the uncertainty on a single observation estimated from the population.

In predictive modeling. a confidence interval interval can be used to quantify the uncertainty of the estimated skill of a model, a prediction interval can be used to quantify the uncertainty of a single forecast.

1.4 How to Calculate a Prediction Interval

A prediction interval is calculated as some combination of the estimated variance of the model and the variance of the outcome variable. Prediction intervals are easy to describe, but difficult to calculate in practice. In simple cases like linear regression, we can estimate the prediction interval directly.

The following list summarizes some methods that can be used for prediction uncertainty for nonlinear machine learning models:

  • The Delta Method , from the field of nonlinear regression
  • The Bayesian Method, from Bayesian modeling and statistics.
  • The Mean-Variance Estimation Method , using estimated statistics.
  • The Bootstrap Method, using data resampling and developing an ensemble of models.

1.5 Prediction Interval for Linear Regression

A linear regression is a model that describes the linear combination of inputs to calculate the output variables.

                                                         \widehat{y} = b_{0} + b_{1} \times x

\widehat{y}(or yhat) is the prediction,b_{0} and b_{1} are coefficients of the model estimated from training data and x is the input variable.

The prediction interval around ˆy can be calculated as follows:

Where yˆ is the predicted value, z is the critical value from the Gaussian distribution (e.g. 1.96 for a 95% interval) and σ is the standard deviation of the predicted distribution.

 We can calculate an unbiased estimate of the of the predicted standard deviation as follows:

1.6 Worked Example 

make the case of linear regression prediction intervals concrete with a worked example.

# generate  related vairables
from numpy import mean
from numpy import std
from numpy.random import randn
from numpy.random import seed
from matplotlib import pyplot
# seed random number generator
seed(1)
# prepare data
x = 20 * randn(1000) + 100
y = x + (10 * randn(1000) + 50)

# summarize
print('x: mean=%.3f stdv=%.3f' %(mean(x),std(x)))
print('y: mean=%.3f stdv=%.3f' % (mean(y),std(y)))

# plot
pyplot.scatter(x, y)
pyplot.show()

Running the example first prints the mean and standard deviations of the two variables

 

A plot of the dataset is then created. We can see the clear linear relationship between the variables with the spread of the points highlighting the noise or random error in the relationship

 Next, we can develop a simple linear regression that given the input variable x, will predict the y variable. We can use the linregress() SciPy function to fit the model and return the b0 and b1 coefficients for the model.

# fit linear regression model
b1,b0,r_value,p_value, std_err = linregress(x, y)

We can use the coefficients to calculate the predicted y values, called yhat, for each of the input variables. The resulting points will form a line that represents the learned relationship.

# make prediction
yhat = b0 + b1 * x

The complete example is listed below.

# simple linear regression model
from numpy.random import randn
from numpy.random import seed
from scipy.stats import linregress
from matplotlib import pyplot
#seed random number generator
seed(1)
# prepare data
x = 20 * randn(1000) + 100
y = x + (10 * randn(1000) + 50)

# fit linear regression model
b1,b0,r_value,p_value,std_err = linregress(x,y)
print('b0=%.3f,b1=%.3f' %(b1, b0))
# make prediction
yhat = b0 + b1 * x
# plot data and predictions
pyplot.scatter(x, y)
pyplot.plot(x,yhat, color='r')
pyplot.show()

Running the example fits the model and prints the coefficients.

 The coefficients are then used with the inputs from the dataset to make a prediction. The resulting inputs and predicted y-values are plotted as a line on top of the scatter plot for the dataset. We can clearly see that the model has learned the underlying relationship in the dataset

 

 We are now ready to make a prediction with our simple linear regression model and add a prediction interval. We will fit the model as before. This time we will take one sample from the dataset to demonstrate the prediction interval. We will use the input to make a prediction, calculate the prediction interval for the prediction, and compare the prediction and interval to the known expected value. First, let’s define the input, prediction, and expected values.

# define the prediction
x_in = x[0]
y_out = y[0]
yhat_out = yhat[0]

Next, we can estimate the standard deviation in the prediction direction.

 We can calculate this directly using the NumPy arrays as follows:

# estimate stdev of yhat
sum_errs = arraysum((y - yhat)**2)
stdev = sqrt(1/(len(y)-2) * sum_errs)

Next, we can calculate the prediction interval for our chosen input:

 We will use the significance level of 95%, which is the Gaussian critical value of 1.69. Once the interval is calculated, we can summarize the bounds on the prediction to the user.

# calculate prediction interval
interval = 1.96 * stdev
lower,upper = yhat_out - interval, yhat_out + interval

We can tie all of this together. The complete example is listed below.

# linear regression prediction with prediction interval
from numpy.random import randn
from numpy.random import seed
from numpy import sqrt
from numpy import sum as arraysum
from scipy.stats import linregress
from matplotlib import pyplot
# seed random number generator
seed(1)
#prepare data
x = 20 * randn(1000) + 100
y = x + (10 * randn(1000) + 50)
# fit linear regression model
b1,b0,r_value,p_value,std_err = linregress(x,y)
# make predicitions
yhat = b0 + b1 * x
# define new input,expected value and prediction
x_in = x[0]
y_out = y[0]
yhat_out = yhat[0]
#estimate stdev of yhat
sum_errs = arraysum((y - yhat)**2)
stdev = sqrt(1/(len(y)-2) * sum_errs)
# calculate prediction interval
interval = 1.96 * stdev
print('Prediction Interval: %.3f' % interval)
lower,upper = y_out - interval, y_out + interval
print('95%% likelihood that true value is between %.3f and %.3f' %(lower, upper))
print('True value: %.3f' % yhat_out)
# plot dataset and prediction with interval
pyplot.scatter(x, y)
pyplot.plot(x, yhat, color='red')
pyplot.errorbar(x_in,yhat_out, yerr=interval,color='black',fmt='o')
pyplot.show()

Running the example estimates the yhat standard deviation and then calculates the confidence interval. Once calculated, the prediction interval is presented to the user for the given input variable. Because we contrived this example, we know the true outcome, which we also display. We can see that in this case, the 95% prediction interval does cover the true expected value.

  • 1
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值