统计学习导论 Chapter2--What Is Statistical Learning?

Book: An Introduction to Statistical Learning
with Applications in R
http://www-bcf.usc.edu/~gareth/ISL/

这是第二章,简要介绍统计学习中的一些基本概念

2.1 What Is Statistical Learning?
这里写图片描述

假定我们观察到一个定量响应变量 Y 和 p个不同的 predictors, X_1, X_2 ,…, X_p, X 和Y 存在一定的关系,这里我们用一个公式表示,其中 f 是 关于 X_1, X_2 ,…, X_p 的固定但未知的函数,公式后面一项是一个 随机误差项,独立于 X,均值为 0
这里写图片描述

In essence, statistical learning refers to a set of approaches for estimating f
本质上来说,统计学习就是关于估计 f 的一些方法介绍

2.1.1 Why Estimate f ? 为什么需要估计函数 f
主要有两个原因使我们希望估计 f: prediction 和 inference, 下面分别予以介绍

Prediction
在很多场合,通常一组输入 X 已经得到,但是输出 Y 不是很容易得到。因为误差均值是 0,所以我们可以预测 Y 使用下面的公式
这里写图片描述
其中 f^ 是我们对 f 的估计, Y^ 表示对 Y 的预测结果。 这里的 f^ 通常被看作一个 黑盒子,因为我们不关系 f^ 的具体形式是什么样的,我们只要求 f^ 可以对 Y 进行准确的预测。

Y^ 的预测精度依赖于两个量,我们这里称之为 reducible error and the irreducible error,通常 f^ 不是 f 的一个完美估计,f^ 引入的误差是 reducible ,因为我们可以通过使用更合适的统计学习算法来对 f 进行更准确的估计。 即使 f^ 是对 f 的完美估计,我们的 Y 还有一个误差项,这个误差项是 irreducible error,因为它和 f 无关。
为什么 irreducible error 大于0 了?
这里写图片描述
两个误差项的公式表示
这里写图片描述
The focus of this book is on techniques for estimating f with the aim of minimizing the reducible error. It is important to keep in mind that the irreducible error will always provide an upper bound on the accuracy of our prediction for Y . This bound is almost always unknown in practice.

Inference
在推理中,我们感兴趣的是 X 和 Y 之间具体的影响关系,我们想知道Y具体是怎么改变的, we need to know its exact form
我们可能需要回答下面的问题:
1)Which predictors are associated with the response? 可能有很多变量影响输出,但是找出少数的主要影响变量在实际中及其重要
Identifying the few important predictors among a large set of possible variables can be extremely useful, depending on the application.
2) What is the relationship between the response and each predictor?
具体每个因素是如何影响输出的
3) Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?
输入和输出的关系使用一个线性方程建模足够吗?还是需要使用更复杂的模型来建模

2.1.2 How Do We Estimate f? 我们如何估计 f 了?
总的来说我们估计 f 的方法可以分为两类:parametric or non-parametric

Parametric methods
参数方法通常涉及两个步骤:
1)我们需要先对 f 的函数形式或形状做出一个假设,例如 f 的一个很简单假设是输入的 线性关系
这里写图片描述
2) 模型定下来之后,我们需要一个 procedure 来将训练数据对模型进行拟合或训练。对于线性模型,我们可以使用 (ordinary) least squares 来估计参数。

上面描述的基于模型的方法我们称之为 parametric 参数方法,它将 f 的估计问题 降低为估计一组参数。当这个模型是符合数据的分布,那么参数方法是简单有效的。当选择的模型不符合训练数据的分布,参数方法的效果就不是很好

Non-parametric methods
非参数方法没有对 f 的函数形式作出具体的假设。它尝试估计f 尽可能的符合数据,Instead they seek an estimate of f that gets as close to the
data points as possible without being too rough or wiggly。
它相对于参数方法的一大优势是不需要对 f 的具体函数形式作出假设, they have the potential to accurately fit a wider range of possible shapes for f。

But non-parametric approaches do suffer from a major disadvantage: since they do not reduce the problem of estimating f to a small number of parameters, a very large number of observations (far more than is typically needed for a parametric approach) is required in order to obtain an accurate estimate for f
非参数方法的问题 主要是:它需要的训练数据比参数方法需要的要多很多。只有足够的数据才能得到 f 的准确的估计
一个 非参数方法的例子如下图所示
这里写图片描述

2.1.3 The Trade-off Between Prediction Accuracy and Model Interpretability
模型的可解释性和预测的精度存在一定的负相关性。
这里写图片描述

2.1.4 Supervised versus Unsupervised Learning
supervised 对应的是 输入和输出训练数据是一一对应的,unsupervised 只有输入数据,没有对应的输出数据。我们对输入数据做一些分析,例如 cluster analysis 聚类
这里写图片描述

2.1.5 Regression versus Classification Problems
响应变量的取值范围是连续的 Quantitative variables take on numerical values
problems with a quantitative response as regression problems

响应变量的取值范围 qualitative variables take on values in one of K different classes , or categories
those involving a qualitative response are often referred to as classification problems

  • 2
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Statistical learning refers to a set of tools for modeling and understanding complex datasets. It is a recently developed area in statistics, and blends with parallel developments in computer science, and in particular machine learning. The field encompasses many methods such as the lasso and sparse regression, classification and regression trees, and boosting and support vector machines. With the explosion of “Big Data” problems statistical learning has be- come a very hot field in many scientific areas as well as marketing, finance and other business disciplines. People with statistical learning skills are in high demand. One of the first books in this area — The Elements of Statistical Learn- ing (ESL) (Hastie, Tibshirani, and Friedman) — was published in 2001, with a second edition in 2009. ESL has become a popular text not only in statistics but also in related fields. One of the reasons for ESL’s popu- larity is its relatively accessible style. But ESL is intended for individuals with advanced training in the mathematical sciences. An Introduction to Statistical Learning (ISL) arose from the perceived need for a broader and less technical treatment of these topics. In this new book, we cover many of the same topics as ESL, but we concentrate more on the applications of the methods and less on the mathematical details. We have created labs illustrating how to implement each of the statistical learning methods using the popular statistical software package R . These labs provide the reader with valuable hands-on experience. This book is appropriate for advanced undergraduates or master’s stu- dents in Statistics or related quantitative fields, or for individuals in other disciplines who wish to use statistical learning tools to analyze their data. It can be used as a textbook for a course spanning one or two semesters. We would like to thank several readers for valuable comments on prelim- inary drafts of this book: Pallavi Basu, Alexandra Chouldechova, Patrick Danaher, Will Fithian, Luella Fu, Sam Gross, Max Grazier G’Sell, Court- ney Paulson, Xinghao Qiao, Elisa Sheng, Noah Simon, Kean Ming Tan, Xin Lu Tan. It’s tough to make predictions, especially about the future. -Yogi Berra

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值