统计学习笔记(1) 监督学习概论(1)

原作品:The Elements of Statistical Learning Data Mining, Inference, and Prediction, Second Edition, by Trevor Hastie, Robert Tibshirani and Jerome Friedman

An Introduction to Statistical Learning. by Gareth JamesDaniela WittenTrevor Hastie andRobert Tibshirani

Brief Introduction to Statistical Learning

Suppose that we are statistical consultants hired by a client to provide advice on how to improve sales of a particular product. The Advertising data set consists of the sales of that product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper. We want to increase the sales of the product by controling the advertising expenditure in each of the three media. So we determine the associations between medias and sales.

In this setting, the advertising budgets are input variables while sales input is an output variable. The input variables are typically denoted using the variable output variable symbol X, with a subscript to distinguish them. So X1 might be the TV budget, X2 the radio budget, and X3 the newspaper budget. The inputs go by different names, such as predictors, independent variables, features, predictor independent variable feature or sometimes just variables. The output variable—in this case, sales—is variable often called the response or dependent variable, and is typically denoted response dependent variable using the symbol Y . 

Reducible and irreducible error

For quantitative response Y and different predictors X1, ... Xp. We assume that there is some
relationship between Y and X=(X1,X2,...,Xp), and make predictions based on input vector X and get result Y:


The second term is random error with mean zero and is independent of X.

Interpretations on f:

Sales is a response or target that we wish to predict. We generically refer to the response as Y . TV is a feature, or input, or predictor; we name it X1. Likewise name Radio as X2, and so on.

There is an ideal f(X). In particular, what is a good value for f(X) at any selected value of X, say X = 4? There can be many Y values at X = 4. A good value is f(4) = E(Y |X = 4) E(Y |X = 4) means expected value (average) of Y given X = 4. This ideal f(x) = E(Y |X = x) is called the regression function.


We can understand this as, we use the same input x to make several predictions: f1(x), f2(x),..., fm(x), the predictions are still different although the input keeps unchanged, due to distribution. The statistical average value of f1(x), f2(x),..., fm(x) will be our known correct result f(x) and also is the statistical value of Y. So we can make prediction:


(Different components in X have different importance). E(Y|X=x) means expected value (average) of Y given X=x. This ideal f(x)=E(Y|X=x) is called the regression function.

We can certainly make errors in prediction, the error is divided into 2 parts: one is reducible and the other is irreducible. The bias can be reduced, but the variance cannot.

                                                

This error is reducible because we can potentially improve the accuracy of fˆ by using the most appropriate statistical learning technique to estimate f. The quantity may contain unmeasured variables that are useful in predicting Y: since we don’t measure them, f cannot use them for its prediction.

Parametric and Non-parametric methods

We can apply parametricmodels


But Any parametric approach brings with it the possibility that the functional form used to estimate f is very different from the true f.

So we can use interpolation. Such as thin-plate spline. In order to fit a thin-plate spline, the data analyst must select a level of smoothness.

         

But overfitting might occur. To avoid overfitting, refer to Resampling Methods in "An Introduction to Statistical Learning" (Introduction to thin-plate splines: paper "Thin-Plate Spline Interpolation" on SpringerLink and code onhttp://elonen.iki.fi/code/tpsdemo/).

Trade-offs between Prediction Accuracy and Model Interpretability in curve fitting

Several considerations when choosing models:

1. Linear models are easy interpret while thin-plate splines are not.

2. Good fit Versus Overfitting.

3. We often prefer a simpler model involving fewer variables over a black-box predictor involving them all.

Following is a representation of the tradeoff between flexibility and interpretability, using different statistical learning methods. In general, as the flexibility of a method increases, its interpretability decreases.


Methods in assessing model accuracy:

MSE:


i is the index of observation. We use add in the test set. In practice, we use


the average squared prediction error for these test observations (x0, y0).

Choosing test data:

How can we go about trying to select a method that minimizes the test MSE? In some settings, we may have a test data set available—that is, we may have access to a set of observations that were not used to train the statistical learning method.

Evaluate model:

We choose 3 examples:

The first figure represents a comprehensive example, with high fluctuation and high noise level. The second figure has low fluctuation but high noise level, the third has high fluctuation and low noise level.

The horizontal dashed line indicates , which corresponds to the lowest achievable test MSE among all methods, as is shown in the following graphs, training errorcan be lower than the bound, but test errorcannot.

  

When we try to avoid overfitting by selecting the fresh test data, we find that the fluctuation influences the performance when fitting dimension (flexibility) is low, compare the second figure with the third figure, compare their low-flexibility parts, we can see if fluctuation increases, model with low flexibility cannot deal well, and in the third figure, as the flexibility increases from low values, the mean square error decreases, this is the advantage of high flexibility. From the second point of view, we compare the second figure with the third figure, the second figure suffers from strong noise while the third figure suffers from weak noise, when the flexibility (dimension of model) is very high, we can see from the second image that it fits to the noise well, but it's useless and it's overfitting, we can see when the noise is weak, too high flexibility is almost useless in the third figure. So a moderate in the middle will be enough. The red curve represents the error evaluated on the fresh test set, the gray curve represents the error evaluated on the training set.

Bias-Variance Trade-off

Suppose we have fit a model f^(x) to the training set. The true model is


with f(x)=E(Y|X=x), and


The left side of equal mark represents the expected test MSE. Refers to the average test MSE that we would obtain if we repeatedly estimated test MSE f using a large number of training sets and calculate theaverage value.

Variance: Variance refers to the amount by which fˆ would change if we estimated it using a different training data set. If a method has high variance, then small changes in the training set can result in large changes in f^.

Bias


Bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model. For example, linear regression assumes that there is a linear relationship between Y and X1, X2, . . . , Xp. It is unlikely that any real-life problem truly has such a simple linear relationship, and so performing linear regression will undoubtedly result in some bias in the estimate of f.

As the flexibility (order of model) increases, the variance of estimation increases, the bias decreases. So choosing the flexibility based on average test error amounts to a bias-variance trade-off.

Corresponding to the above 3 graphs, blue-squared bias, orange-variance, red-test MSE, dashed line-.

To help us evaluat test MSE with only training data, we can refer to cross validation in "Resampling Methods".

Classification

Suppose that we seek to estimate f on the basis of training observations {(x1, y1), . . . , (xn, yn)}, where now y1, . . . , yn arequalitative. Training error rate:

                                                                                                                      

Bayes Classifier:

It is possible to show that the test error rate given is minimized, on average, by a very simple classifier that assigns each observation to the most likely class.

As to classification problems, consider a classifier C(X) that assigns a class label to observation X. Denote


The Bayes optimal classifier is (plot)


Typically we measure the estimation performance using 


The Bayes classifier has smallest error.

SVMs build structured models for C(x), we also build structured models for representing the pk(x), e.g. Logistic regression, generalized additive models, K-nearest neighbors.








统计学习数据挖掘推理和预测的要素 This is page v Printer: Opaque this To our parents: Valerie and Patrick Hastie Vera and Sami Tibshirani Florence and Harry Friedman and to our families: Samantha, Timothy, and Lynda Charlie, Ryan, Julie, and Cheryl Melanie, Dora, Monika, and Ildiko vi This is page vii Printer: Opaque this Preface to the Second Edition In God we trust, all others bring data. –William Edwards Deming (1900-1993)1 We have been gratified by the popularity of the first edition of The Elements of Statistical Learning. This, along with the fast pace of research in the statistical learning field, motivated us to update our book with a second edition. We have added four new chapters and updated some of the existing chapters. Because many readers are familiar with the layout of the first edition, we have tried to change it as little as possible. Here is a summary of the main changes: 1On the Web, this quote has been widely attributed to both Deming and Robert W. Hayden; however Professor Hayden told us that he can claim no credit for this quote, and ironically we could find no “data” confirming that Deming actually said this. viii Preface to the Second Edition Chapter What’s new 1. Introduction 2. Overview of Supervised Learning 3. Linear Methods for Regression LAR algorithm and generalizations of the lasso 4. Linear Methods for Classification Lasso path for logistic regression 5. Basis Expansions and Regulariza- Additional illustrations of RKHS tion 6. Kern
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值