统计学习导论 Chapter2--What Is Statistical Learning?

最新推荐文章于 2021-02-04 17:00:00 发布

O天涯海阁O

最新推荐文章于 2021-02-04 17:00:00 发布

阅读量1.8k

点赞数 2

分类专栏：统计学习导论

本文链接：https://blog.csdn.net/zhangjunhit/article/details/78588900

版权

统计学习导论专栏收录该内容

5 篇文章 6 订阅

订阅专栏

Book: An Introduction to Statistical Learning
with Applications in R
http://www-bcf.usc.edu/~gareth/ISL/

这是第二章，简要介绍统计学习中的一些基本概念

2.1 What Is Statistical Learning?
这里写图片描述

假定我们观察到一个定量响应变量 Y 和 p个不同的 predictors， X_1, X_2 ,…, X_p， X 和Y 存在一定的关系，这里我们用一个公式表示，其中 f 是关于 X_1, X_2 ,…, X_p 的固定但未知的函数，公式后面一项是一个随机误差项，独立于 X，均值为 0
这里写图片描述

In essence, statistical learning refers to a set of approaches for estimating f
本质上来说，统计学习就是关于估计 f 的一些方法介绍

2.1.1 Why Estimate f ? 为什么需要估计函数 f
主要有两个原因使我们希望估计 f： prediction 和 inference，下面分别予以介绍

Prediction
在很多场合，通常一组输入 X 已经得到，但是输出 Y 不是很容易得到。因为误差均值是 0，所以我们可以预测 Y 使用下面的公式
这里写图片描述
其中 f^ 是我们对 f 的估计， Y^ 表示对 Y 的预测结果。这里的 f^ 通常被看作一个黑盒子，因为我们不关系 f^ 的具体形式是什么样的，我们只要求 f^ 可以对 Y 进行准确的预测。

Y^ 的预测精度依赖于两个量，我们这里称之为 reducible error and the irreducible error，通常 f^ 不是 f 的一个完美估计，f^ 引入的误差是 reducible ，因为我们可以通过使用更合适的统计学习算法来对 f 进行更准确的估计。即使 f^ 是对 f 的完美估计，我们的 Y 还有一个误差项，这个误差项是 irreducible error，因为它和 f 无关。
为什么 irreducible error 大于0 了？
这里写图片描述
两个误差项的公式表示

The focus of this book is on techniques for estimating f with the aim of minimizing the reducible error. It is important to keep in mind that the irreducible error will always provide an upper bound on the accuracy of our prediction for Y . This bound is almost always unknown in practice.

Inference
在推理中，我们感兴趣的是 X 和 Y 之间具体的影响关系，我们想知道Y具体是怎么改变的， we need to know its exact form
我们可能需要回答下面的问题：
1）Which predictors are associated with the response? 可能有很多变量影响输出，但是找出少数的主要影响变量在实际中及其重要
Identifying the few important predictors among a large set of possible variables can be extremely useful, depending on the application.
2） What is the relationship between the response and each predictor?
具体每个因素是如何影响输出的
3） Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?
输入和输出的关系使用一个线性方程建模足够吗？还是需要使用更复杂的模型来建模

2.1.2 How Do We Estimate f? 我们如何估计 f 了？
总的来说我们估计 f 的方法可以分为两类：parametric or non-parametric

Parametric methods
参数方法通常涉及两个步骤：
1）我们需要先对 f 的函数形式或形状做出一个假设，例如 f 的一个很简单假设是输入的线性关系
这里写图片描述
2）模型定下来之后，我们需要一个 procedure 来将训练数据对模型进行拟合或训练。对于线性模型，我们可以使用 (ordinary) least squares 来估计参数。

上面描述的基于模型的方法我们称之为 parametric 参数方法，它将 f 的估计问题降低为估计一组参数。当这个模型是符合数据的分布，那么参数方法是简单有效的。当选择的模型不符合训练数据的分布，参数方法的效果就不是很好

Non-parametric methods
非参数方法没有对 f 的函数形式作出具体的假设。它尝试估计f 尽可能的符合数据，Instead they seek an estimate of f that gets as close to the
data points as possible without being too rough or wiggly。
它相对于参数方法的一大优势是不需要对 f 的具体函数形式作出假设， they have the potential to accurately fit a wider range of possible shapes for f。

But non-parametric approaches do suffer from a major disadvantage: since they do not reduce the problem of estimating f to a small number of parameters, a very large number of observations (far more than is typically needed for a parametric approach) is required in order to obtain an accurate estimate for f
非参数方法的问题主要是：它需要的训练数据比参数方法需要的要多很多。只有足够的数据才能得到 f 的准确的估计
一个非参数方法的例子如下图所示
这里写图片描述

2.1.3 The Trade-off Between Prediction Accuracy and Model Interpretability
模型的可解释性和预测的精度存在一定的负相关性。
这里写图片描述

2.1.4 Supervised versus Unsupervised Learning
supervised 对应的是输入和输出训练数据是一一对应的，unsupervised 只有输入数据，没有对应的输出数据。我们对输入数据做一些分析，例如 cluster analysis 聚类
这里写图片描述

2.1.5 Regression versus Classification Problems
响应变量的取值范围是连续的 Quantitative variables take on numerical values
problems with a quantitative response as regression problems

响应变量的取值范围 qualitative variables take on values in one of K different classes , or categories
those involving a qualitative response are often referred to as classification problems

O天涯海阁O

关注

2
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
统计学习导论 Chapter2--What Is Statistical Learning?

Book: An Introduction to Statistical Learning with Applications in R http://www-bcf.usc.edu/~gareth/ISL/这是第二章，简要介绍统计学习中的一些基本概念2.1 What Is Statistical Learning? 假定我们观察到一个定量响应变量
复制链接

扫一扫