机器学习之参数估计

最新推荐文章于 2024-07-11 18:01:48 发布

Xurtle

最新推荐文章于 2024-07-11 18:01:48 发布

阅读量8.1k

点赞数 2

分类专栏：数学文章标签：机器学习统计学

本文链接：https://blog.csdn.net/xlinsist/article/details/53147199

版权

本文探讨了在机器学习中如何进行参数估计，包括点估计与区间估计。点估计介绍了最大似然估计和矩估计方法，而区间估计则讲解了不同类型的置信区间的计算，如Z-间隔、t-间隔以及对总体均值、方差和比率的置信区间。此外，文章还涵盖了假设检验的基本概念和可能的错误类型。

摘要由CSDN通过智能技术生成

Probability Theory focus on computing the probability of data arising from a parametric model with known parameters. Statistical Inference flips this on its head: we will estimate the probability of parameters given a parametric model and observed data drawn from it.

比如我得到了一些样本数据，并已知这些数据底层的分布是指数分布，但是并不知道具体是哪个指数分布！因为指数分布不是一个确定的分布，而是 one-parameter family of distributions. 不同的参数 $\lambda$ 会得到不同的指数分布。正态分布，二项分布也都是同样的道理，不同的参数会得到不同的结果。我们通常把这样的分布叫做 parametric distributions or parametric models.

在这篇文章中，我将介绍一些方法，用给定的数据和参数模型，来估算出这些未知的 population parameters：

a population mean $\mu$
the difference in two population means $\mu_1-\mu_2$
a population variance $\sigma^2$
the ratio of two population variances $\sigma_1^2/\sigma_2^2$

Point Estimation VS Interval Estimation

下面是维基百科中关于 Point Estimation 的定义：

In statistics, point estimation involves the use of sample data to calculate a single value which is to serve as a “best guess” or “best estimate” of an unknown population parameter. More formally, it is the application of a point estimator to the data.

下面是维基百科中关于 Interval Estimation 的定义：

In statistics, interval estimation is the use of sample data to calculate an interval of plausible values of an unknown population parameter; this is in contrast to point estimation, which gives a single value.

下面是维基百科中关于 Confidence interval 的定义：

In statistics, a confidence interval is a type of interval estimate (of a population parameter) that is computed from the observed data. The confidence level is the frequency (i.e., the proportion) of possible confidence intervals that contain the true value of their corresponding parameter. In other words, if confidence intervals are constructed using a given confidence level in an infinite number of independent experiments, the proportion of those intervals that contain the true value of the parameter will match the confidence level.

如果你对上面关于 Confidence interval 的定义有些不太理解，没有关系。当我介绍到如何解释一个 Confidence interval 的含义时，你会对这个定义理解的更加深刻。实际上，Interval Estimation 包含很多种方法，但是在这篇文章中我只介绍 confidence intervals.

Point Estimation

假设我们想知道中国人每天读书的平均时间， $\mu$ ，由于我们不可能去问到每个中国人他们每天拿出多少时间来读书，因此我们只能随机抽取出一些国人，得到他们的读书时间，然后用得到的这些数据去估算整个所有国人的每天平均读书时间。

我们有2种方法可以做这样的估算，它们分别是 maximum likelihood estimation 和 method of moments. 在这个小节中，我也会介绍一种方法来评估某个点估计是否为一个 “好” 的点估计。

在介绍这个点估计的方法之前，我先来介绍一下 point estimator（点估计量） 与 point estimate（点估计值） 的含义。

point estimator VS point estimate

We denote the $n$ random variables arising from a random sample as subscripted uppercase letters:

X 1, X 2, \dots, X n

$X_1, X_2, \cdots, X_n$

The corresponding observed values of a specific random sample are then denoted as subscripted lowercase letters:

x 1, x 2, \dots, x n

$x_1, x_2, \cdots, x_n$

比如上面那个读书时间的例子，我们一共寻问了100个中国人，那么我们就得到了100个随机变量， $X_1, X_2, \cdots, X_{100}$ . 他们每个人给出的读书时间为 $x_1, x_2, \cdots, x_{100}$ . 你可以把这个过程理解为做了100次实验。

下面是 point estimator 的定义：

The function of $X_1, X_2, \cdots, X_n$ used to estimate $\theta$ is called a point estimator of $\theta$ . For example, the function: $\bar{X}=\dfrac{1}{n}\sum\limits_{i=1}^n X_i$ is a point estimator of the population mean $\mu$ ; The function: $S^2=\dfrac{1}{n-1}\sum\limits_{i=1}^n (X_i-\bar{X})^2$ is a point estimator of the population variance $\sigma^2$ .

下面是 point estimate 的定义：

The function computed from a set of data is an observed point estimate of $\theta$ . For example, if $x_i$ are the observed grade point averages of a sample of 88 students, then: $\bar{x}=\dfrac{1}{88}\sum\limits_{i=1}^{88} x_i=3.12$ is a point estimate of $\mu$ .

Maximum Likelihood Estimates

有很多方法可以从已知的数据中估算出未知的 population parameters，在这个小节中我会介绍最大似然估计，它属于点估计，它回答的是这样一个问题：

For which parameter value does the observed data have the biggest probability?

接下来，我会用最大似然估计分别求解一个离散的和连续的例子，让大家可以更好的理解它。假设我投掷100次硬币，出现了55个正面，很明显这是一个二项分布，它的参数是 n 和 p，由于 n = 100，现在就只剩下一个未知参数 p 了。那么现在我们很自然的会问这样一个问题：哪个 p 值会最大化观察到的数据的概率。因此我们可以写成一个关于参数 p 的函数：

P (55 h e a d s | p) = (100 55) p 55 (1 - p) 45

$P(55\;\; heads\;|\;p)=\binom{100}{55}p^{55}(1-p)^{45}$

上面的函数叫做 likelihood function，它可以解释成：the probability of 55 heads given p？ 毋庸置疑，接下来的任务就是找出 p 值，最大化这个概率，剩下的任务找微积分搞定吧，这里我就不多说了。通过这个例子，我们可以给出最大似然估计的定义：

Given data the maximum likelihood estimate (MLE) for the parameter p is the value of p that maximizes the likelihood P(data | p). That is, the MLE is the value of p for which the data is most likely.

有时我们会把 likelihood function 取对数，这样会简化计算过程。由于 log 函数是单调递增的，likelihood function 和取对数之后的 likelihood function 它们最终得到的结果是一致的！

接下来，我再介绍一个关于连续型的例子。假设一种品牌的燎灯泡的寿命服从指数分布，当然我们不知道这个指数分布的参数 $\lambda$ 是多少，我们只能用已知的数据去估算。假设我们一共测试了5个这种品牌的灯泡，它们的寿命分别是2,3,1,3,4. 现在已知了数据和模型，我们就可以用最大似然估计来估算出未知参数 $\lambda$ 了。

令 $X_i$ 表示第 i 个灯泡的寿命， $x_i$ 为随机变量 $X_i$ 取到的值。那么每个 $X_i$ 有 PDF： fXi(xi)=λe