学好机器学习必会的统计学知识（第二篇）

最新推荐文章于 2025-03-17 20:46:34 发布

Xurtle

最新推荐文章于 2025-03-17 20:46:34 发布

阅读量2.8w

点赞数 19

分类专栏：机器学习文章标签：机器学习统计学数据

本文链接：https://blog.csdn.net/xlinsist/article/details/52193402

版权

引言

在机器学习应用中，我们不可能离开数据。没有了数据，机器学习算法就像没有了灵魂。更好地理解数据，可以使我们把它更好地应用在机器学习上。在这篇文章中，我会介绍一些在统计学中，理解数据的一些重要概念，从而使大家更准确地操作数据，玩转数据。

注意：在这篇文章中会涉及到很多名词和定义，我就直接用英文了，因为这更加容易理解，翻译成汉语以后会让人更加混乱了。

Populations and Parameters

A population is any large collection of objects or individuals, such as Americans, students, or trees about which information is desired.

A parameter is any summary number, like an average or percentage, that describes the entire population.

下面，我举个例子来说明Populations and Parameters.

我们想要知道中国所有男人体重的平均值( $\mu$ )。这里，population是所有的中国男人，而parameter是体重的平均值。
我们想要知道中国所有大学生吸烟的比例( $p$ )。这里，population是所有的中国大学生，而parameter是吸烟比例。

但不幸的是，我们几乎不可能知道population的parameter. 对于上面的那个例子来说，我们不可能去调查所有中国男人的体重，然后去求平均值。因此，我们只能去估算population的parameter.

Samples and statistics

A sample is a representative group drawn from the population.

A statistic is any summary number, like an average or percentage, that describes the sample.

还用上面的例子来说明问题。

这回我们只选择具有代表性的100个中国男人，求出他们的平均值 $\bar{x}$ . 从而来估计 $\mu$ .
这回我们只选择具有代表性的100个大学生，求出他们吸烟的比例 $\hat(p)$ , 从而来估计 $p$ .

上面的100个大学生就是一个sample，求出的 $\hat{p}$ 就是sample的一个statistic.

因为sample的大小是可控的，因此我们能计算它的任何一个statistic. 从而我们用这个sample statistic去估算未知的population parameter.

有两种方式可以估算population parameter，它们分别是Confidence intervals 和 hypothesis tests. 下面，我来分别介绍这两种方法。

t-based Confidence Interval for the Mean

我们可以用t-interval来估算population mean $\mu$ . 下面，我来给出它的定义：

When the population standard deviation $\sigma$ is not known, an interval estimate for the population mean $\mu$ with confidence level $1 - \alpha$ is given by :

$x ¯ \pm t α / 2, n - 1 (s n ‾ \sqrt)$ $\bar{x}\pm t_{\alpha/2, n-1}\left(\frac{s}{\sqrt{n}}\right)$