GitModel-动手学数理统计_01（python）

瘦弱书虫

已于 2022-06-27 08:22:41 修改

阅读量343

点赞数

分类专栏：数学建模文章标签： python 概率论机器学习

于 2022-06-26 14:28:58 首次发布

本文链接：https://blog.csdn.net/qq_44341554/article/details/125469713

版权

数学建模专栏收录该内容

8 篇文章 0 订阅

订阅专栏

1 动手学数理统计_01

github 上pdf版本及ipynb版本：https://github.com/cx-333/Math-Modeling

1.1 总体与样本

总体：将试验的全部可能的观察值称为总体，这些观察值可能是有限的，也可能是无限的，分别对应有限总体和无限总体，每一个可能观察值称为个体。

由于总体的每一个个体都是随机试验的一个观察值，因此它是某一随机变量 $X$ 的值，一个总体便对应一个随机变量 $X$ ，对随机变量 $X$ 的研究就是对总体的研究，随机变量 $X$ 和总体具有相同的分布函数和数字特征。

样本：设 $X$ 是具有分布函数 $F$ 的随机变量，若 $X_{1}, X_{2}, \cdots, X_{n}$ 是具有同一分布函数 $F$ 的、相互独立的随机变量，则称 $X_{1}, X_{2}, \cdots, X_{n}$ 为从分布函数 $F$ （或总体 $F$ 、或总体 $X$ ）得到的容量为 $n$ 的简单随机样本，简称样本，他们的观察值 $x_{1}, x_{2}, \cdots, x_{n}$ 称为样本值，又称为 $X$ 的 $n$ 个独立的观察值。

由样本的定义（样本中 $n$ 个随机变量相互独立）得：

样本（ $X_{1}, X_{2}, \cdots, X_{n}$ ）的分布函数为 $F^{*}(x_{1}, x_{2}, \cdots, x_{n})=\prod_{i=1}^{n}F(x_{i})$

样本（ $X_{1}, X_{2}, \cdots, X_{n}$ ）的概率密度为 $f^{*}(x_{1}, x_{2}, \cdots, x_{n})=\prod_{i=1}^{n}f(x_{i})$

1.2 经验分布函数、直方图与箱线图

经验分布函数：设 $x_{1}, x_{2}, \cdots, x_{n}$ 是取自总体分布函数为 $F (x)$ 的样本，若将样本观测值由小到大进行排列，记为 $x_{(1)}, x_{(2)}, \cdots, x_{(n)}$ , 则 $x_{(1)}, x_{(2)}, \cdots, x_{(n)}$ 称为有序样本，用有序样本定义如下函数
$F_{n}(x)=\left\{\begin{array}{ll} 0, & \text { 当 } x<x_{(1)}, \\ k / n, & \text { 当 } x_{(k)} \leqslant x<x_{(k+1)}, k=1,2, \cdots, n-1, \\ 1, & \text { 当 } x \geqslant x_{(n)}, \end{array}\right.$
则 $F_{n}(x)$ 是一非减右连续函数，且满足
$F_{n}(-\infty)=0 \text { 和 } F_{n}(\infty)=1 .$

称 $F_{n}(x)$ 为该样本的经验分布函数。

经验分布函数 $F_{n}(x)$ 是总体分布函数 $F (x)$ 的良好的近似。

🔥例子: 随机观察总体 $X$ , 得到一个容量为 10 的样本:
$\quad 2.5, \quad-2, \quad 2.5, \quad 0, \quad 3, \quad 2, \quad 2.5,2, \quad 4$
求 $\mathrm{X}$ 经验分布函数。

🦊解:

排序 $\quad 0, \quad 2, \quad 2, \quad 2.5, \quad 2.5, \quad 2.5, \quad 3, \quad 3.2, \quad 4$
利用公式计算：
$F_{n}(x)=\left\{\begin{array}{ll} 0, & \text { 当 } x<x_{(1)}, \\ k / n, & \text { 当 } x_{(k)} \leqslant x<x_{(k+1)}, k=1,2, \cdots, n-1, \\ 1, & \text { 当 } x \geqslant x_{(n)}, \end{array}\right.$
得:
$F_{10}(x)=\left\{\begin{array}{cc} 0, & x<-2 \\ 1 / 10, & -2 \leq x<0 \\ 2 / 10, & 0 \leq x<2 \\ 4 / 10, & 2 \leq x<2.5 \\ 7 / 10, & 2.5 \leq x<3 \\ 8 / 10, & 3 \leq x<3.2 \\ 9 / 10, & 3.2 \leq x<4 \\ 1, & x \geq 4 \end{array}\right.$

直方图：为研究总体分布的性质，通过独立重复试验得到其样本的观察值 $x_{1}, x_{2}, \cdots, x_{n}$ ，将这些数据进行整理，并以表格或图形的方式展现出来，从而推测出总体的分布。直方图可以反映样本的概率密度，由于样本和其总体服从同一分布，且具有相同的数字特征，则样本的概率密度可看作是总体的概率密度。直方图包括频数直方图和频率直方图。

直方图的绘制步骤：假设一样本包含 $n$ 个样本值 $（x_{1}, x{2}, \cdots, x_{n}）$

选取区间 $[a, b]$ ， $a$ 要小于样本中最小的样本值， $b$ 要大于样本中最大的样本值;

将选取的区间分为 $k$ 个小区间，小区间的长度记为 $\bigtriangleup , \bigtriangleup = \frac{b-a}{k}$ ;💡tips:当 $n < 50$ 时， $k$ 取 $\sim 6$ , 当 $n$ 较大时， $k$ 取 $\sim 20$ ，若 $k$ 取太大，则会出现小区间内频数为 $0$ 的情况（应尽量避免）;

统计小区间 $([a+i\bigtriangleup , a+(i+1)\bigtriangleup ], i = 0, 1, \cdots,k-1)$ 内样本中个体出现的次数 $\{f_{j}, j = 1, 2, \cdots, k-1 \}$ ，或频率 $\{ f_{j}/n, j = 1, 2, \cdots, k-1 \}$ ;

将选取的区间 $[a, b]$ 作为横轴，样本中个体出现的次数 $\{ f_{j}, j = 1, 2, \cdots, k-1 \}$ 或频率 $\{ f_{j}/n, j = 1, 2, \cdots, k-1 \}$ 作为纵轴；

画出每个小区间及其对应的样本中个体次数（频数）的柱状图则得到直方图。

将样本中个体出现的次数 $\{ f_{j}, j = 1, 2, \cdots, k-1\}$ 作为纵轴得到的直方图为频数直方图，将样本中个体出现的频率 $\{f_{j}/n, j = 1, 2, \cdots, k-1\}$ 作为纵轴得到的直方图为频率直方图。

🔥例子：画出下列样本的直方图
$\begin{aligned} &138, \quad 142, \quad 148, \quad 145, \quad 140, \quad 141 \\ &138, \quad 139, \quad 144, \quad 138, \quad 139, \quad 136 \\ &138, \quad 137, \quad 137, \quad 133, \quad 140, \quad 130\\ &145, \quad 141, \quad 135, \quad 131, \quad 136, \quad 131\\ &134, \quad 132, \quad 135, \quad 134, \quad 132, \quad 134\\ &130, \quad 135, \quad 135, \quad 134, \quad 136, \quad 131\\ &139, \quad 140, \quad 141, \quad 138, \quad 137, \quad 137\\ &131, \quad 127, \quad 136, \quad 128, \quad 138, \quad 132\\ &134, \quad 136, \quad 137, \quad 133, \quad 121, \quad 129\\ &137, \quad 132, \quad 131, \quad 139, \quad 136, \quad 135\\ \end{aligned}$

python代码（求解题）

# 1. 按照直方图的步骤一步一步画图
import matplotlib.pyplot as plt
# 图像嵌入
%matplotlib inline  
plt.rcParams['font.sans-serif']=['SimHei','Songti SC','STFangsong']
plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号
import numpy as np
import warnings
warnings.filterwarnings("ignore")

# 样本值
x = [138, 142, 148, 145, 140, 141,
    138, 139, 144, 138, 139, 136,
    138, 137, 137, 133, 140, 130,
    145, 141, 135, 131, 136, 131,
    134, 132, 135, 134, 132, 134,
    130, 135, 135, 134, 136, 131,
    139, 140, 141, 138, 137, 137,
    131, 127, 136, 128, 138, 132,
    134, 136, 137, 133, 121, 129,
    137, 132, 131, 139, 136, 135]

# 1. 选取区间 [a, b]
a = np.min(x) - 1
b = np.max(x) + 1

# 2. 分区间
n = len(x)
if n < 50:
    k = 6
elif n < 100:
    k = 8
else:
    k =15

delta = (b - a) / k

# 3. 统计
region_ab = np.zeros(k)   # 存储区间[a, b]的每个小区间
fi = np.zeros(k)      # 存储每个小区间样本值的频数
for i in range(k):
    region_ab[i] = a+i*delta + (delta / 2)

for idx, cen in enumerate(region_ab):
    for data in x:
        if data >= (cen - delta/2) and data <= (cen + delta/2):
                fi[idx] += 1
        else:
            continue

fi_n = fi / n     # 计算频率
# 4. 画图

# plt.figure(figsize=(10, 8))
plt.bar(region_ab, fi, width=delta)   # 频数直方图
plt.title('频数直方图')
plt.xlabel('x')
plt.ylabel('fi')
plt.show()
# plt.figure(figsize=(10, 8))
plt.bar(region_ab, fi_n, width=delta)  # 频率直方图
plt.title('频率直方图')
plt.xlabel('x')
plt.ylabel('fi/n')
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-NnhesBIW-1656215719268)(%E6%95%B0%E7%90%86%E7%BB%9F%E8%AE%A1_files/%E6%95%B0%E7%90%86%E7%BB%9F%E8%AE%A1_14_0.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-tMS8h0YR-1656215719270)(%E6%95%B0%E7%90%86%E7%BB%9F%E8%AE%A1_files/%E6%95%B0%E7%90%86%E7%BB%9F%E8%AE%A1_14_1.png)]

# 2. 利用matplotlib.pyplot 中的hist方法直接画图
import matplotlib.pyplot as plt
# 图像嵌入
%matplotlib inline  
plt.rcParams['font.sans-serif']=['SimHei','Songti SC','STFangsong']
plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号
import numpy as np
import warnings
warnings.filterwarnings("ignore")
# 样本值
x = [138, 142, 148, 145, 140, 141,
    138, 139, 144, 138, 139, 136,
    138, 137, 137, 133, 140, 130,
    145, 141, 135, 131, 136, 131,
    134, 132, 135, 134, 132, 134,
    130, 135, 135, 134, 136, 131,
    139, 140, 141, 138, 137, 137,
    131, 127, 136, 128, 138, 132,
    134, 136, 137, 133, 121, 129,
    137, 132, 131, 139, 136, 135]
    
a = np.min(x) - 1
b = np.max(x) + 1
k = 8
# plt.figure(figsize=(10, 8))
plt.hist(x, bins=k, alpha=0.8, range=(a, b), density=None)  # density = None, 频数直方图
plt.title('频数直方图')
plt.xlabel('x')
plt.ylabel('fi')
plt.show()
# plt.figure(figsize=(10, 8))
plt.hist(x, bins=k, alpha=0.8, range=(a, b), density=True)  # density = True, 频率直方图
plt.title('频率直方图')
plt.xlabel('x')
plt.ylabel('fi/n')
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-FzDWEflt-1656215719271)(%E6%95%B0%E7%90%86%E7%BB%9F%E8%AE%A1_files/%E6%95%B0%E7%90%86%E7%BB%9F%E8%AE%A1_15_0.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BVAqJZNz-1656215719271)(%E6%95%B0%E7%90%86%E7%BB%9F%E8%AE%A1_files/%E6%95%B0%E7%90%86%E7%BB%9F%E8%AE%A1_15_1.png)]

箱线图

首先介绍样本分位数：设有容量为 $n$ 的样本观察值 $x_{1}, x_{2}, \cdots, x_{n}$ ，样本 $p$ 分位数 $(0 < p < 1)$ 记为 $x_{p}$ ，它具有以下性质：(1)至少有 $n p$ 个观察值小于或等于 $x_{p}$ ；(2)至少有 $n (1 - p)$ 个观察值大于或等于 $x_{p}$ .

样本分位数的求解步骤：

将 $x_{1}, x_{2}, \cdots, x_{n}$ 按自小到大的次序排列成 $x_{(1)}\le x_{(2)}\le \cdots\le x_{(n)}$

使用下述公式计算 $x_{p}$ 分位数 $x_{p}=\left \{ \begin{aligned} &x_{([np]+1)}, &当np不是整数\\&\frac{1}{2}[x_{(np)}+x_{(np+1)}], &当np是整数 \end{aligned}\right.$ 其中， $[\cdot]$ 表示取整。

特别地，当 $p = 0.25$ 时， $0.25$ 分位数 $x_{0.25}$ 也记为 $Q_{1}$ , 称为第一四分位数；当 $p = 0.5$ 时， $0.5$ 分位数 $x_{0.5}$ 也记为 $Q_{2}或M$ ，称为样本中位数；当 $p = 0.75$ 时， $0.75$ 分位数 $x_{0.75}$ 也记为 $Q_{3}$ ，称为第三四分位数。

箱线图的画法：箱线图基于以下 $5$ 个数字特征概括，即最小值 $M i n$ 、第一四分位数 $Q_{1}$ 、中位数 $M$ 、第三四分位数 $Q_{3}$ 和最大值 $M a x$ 。箱线图的形式如下

🔥例子：以下是 $8$ 个病人的血压(收缩压， $m m H g$ )数据，请作出箱线图
$\quad 102 \quad 117 \quad 122 \quad 118 \quad 150 \quad 132 \quad 123$

🦊解：

排序
$\quad 110 \quad 117 \quad 118 \quad 122 \quad 123 \quad 132 \quad 150$
计算各分位点及最小最大值
$\begin{aligned} &\because np=8\times 0.25 = 2, \quad &\therefore Q_{1}=\frac{1}{2}(110+117)=113.5 \\ &\because np=8\times 0.2=5 = 4, \quad &\therefore Q_{2}=\frac{1}{2}(118+122)=120 \\ &\because np=8\times 0.75 = 6, \quad &\therefore Q_{3}=\frac{1}{2}(123+132)=127.5 \\ & Min = 110, Max = 123. \end{aligned}$
画图

python代码(画箱线图）

import matplotlib.pyplot as plt 
%matplotlib inline 
plt.rcParams['font.sans-serif']=['SimHei','Songti SC','STFangsong']
plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号

x = [102, 110, 117, 118, 122, 123, 132, 150]

# 程序会自动找出异常点，即相差太大的点，该点< Q1-1.5(Q3-Q1)=Q1-1.5IQR 或> Q3+1.5(Q3-Q1)=Q3+1.5IQR
fig, ax = plt.subplots()
plt.figure(figsize=(6,4))
ax.boxplot(x)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-CJ2LxeRx-1656215719272)(%E6%95%B0%E7%90%86%E7%BB%9F%E8%AE%A1_files/%E6%95%B0%E7%90%86%E7%BB%9F%E8%AE%A1_24_0.png)]

<Figure size 432x288 with 0 Axes>

1.3 统计量与三大抽样分布

统计量：设 $X_{1}, X_{2}, \cdots, X_{n}$ 是来自总体 $X$ 的一个样本， $g(X_{1}, X_{2}, \cdots, X_{n})$ 是 $X_{1}, X_{2}, \cdots, X_{n}$ 的函数，若 $g$ 中不含任何未知参数，则称 $g(X_{1}, X_{2}, \cdots, X_{n})$ 是一个统计量。

常用统计量，设 $X_{1}, X_{2}, \cdots, X_{n}$ 是来自总体 $X$ 的一个样本， $x_{1}, x_{2}, \cdots, x_{n}$ 是这一样本的观察值。

样本均值 $\overline{X} = \frac{1} {n} \sum_{i=1}^{n}X_{i}$ 对应的观察值为 $\overline{x} = \frac{1} {n} \sum_{i=1}^{n}x_{i}$

样本方差 $\begin{aligned} &1) S_{n}^{2} = \frac{1} {n} \sum_{i=1}^{n}(X_{i} - \overline{X})^{2} \\ &2) S^{2} = \frac{1} {n-1} \sum_{i=1}^{n}(X_{i} - \overline{X})^{2}, 无偏方差，应用较多\end{aligned}$ 对应的观察值分别为 $s_{n}^{2} = \frac{1} {n} \sum_{i=1}^{n}(x_{i} - \overline{x})^{2}和s^{2} = \frac{1} {n-1} \sum_{i=1}^{n}(x_{i} - \overline{x})^{2}$

样本标准差 $\sqrt{S^{2}} = \sqrt{\frac{1} {n-1} \sum_{i=1}^{n}(X_{i} - \overline{X})^{2}}$ 对应的观察值为 $\sqrt{\frac{1} {n-1} \sum_{i=1}^{n}(x_{i} - \overline{x})^{2}}$

样本 $k$ 阶（原点）矩 $A_{k} = \frac{1}{n}\sum_{i=1}^{n}X_{i}^{k}, k =1, 2, \cdots$ 对应的观察值为 $a_{k} = \frac{1}{n}\sum_{i=1}^{n}x_{i}^{k}, k =1, 2, \cdots$

样本 $k$ 阶中心矩 $B_{k} = \frac{1}{n}\sum_{i=1}^{n}(X_{i} - \overline{X})^{k}, k =1, 2, \cdots$ 对应的观察值为 $b_{k} = \frac{1}{n}\sum_{i=1}^{n}(x_{i} - \overline{x})^{k}, k =1, 2, \cdots$

三大抽样分布

(1) $\chi ^{2}$ 分布：设 $X_{1}, X_{2}, \cdots, X_{n}$ 是来自总体 $N (0, 1)$ 的样本，则称统计量
$\chi ^{2} = X_{1}^{2} + X_{2}^{2} + \cdots + X_{n}^{2}$
服从自由度为 $n$ 的 $\chi ^{2}$ 分布，记为 $\chi ^{2} \sim \chi ^{2}(n)$ 。自由度表示上式中右端包含的独立变量的个数。

$\chi ^{2}$ 分布的概率密度函数(不需要记)为
$\left \{ \begin{aligned} & \frac{1}{2^{n/2}\Gamma {(n/2})}y^{n/2-1}e^{-y/2}, &y>0 \\ & 0, & 其他 \end{aligned} \right.$

python代码（ $\chi ^{2}分布的图形$ ）

import matplotlib.pyplot as plt
%matplotlib inline
from scipy.stats import chi2
import numpy as np

fig, ax = plt.subplots(1, 1)
x = np.linspace(0.01, 30, 10000)
ax.plot(x, chi2.pdf(x, df=2), '-', label='n = 2')
ax.plot(x, chi2.pdf(x, 4), '--', label='n = 4')
ax.plot(x, chi2.pdf(x, df=10), '-.', label='n = 10')
ax.set_ylim([0, 0.5])
ax.set_xlabel("y")
ax.set_ylabel("f(y)")
ax.legend()
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-c6hQVlxX-1656215719272)(%E6%95%B0%E7%90%86%E7%BB%9F%E8%AE%A1_files/%E6%95%B0%E7%90%86%E7%BB%9F%E8%AE%A1_29_0.png)]

# 利用定理画卡方分布的图形
import matplotlib.pyplot as plt
%matplotlib inline
from scipy.stats import norm, chi2
import numpy as np

def demonstate_chi(n):
    x = 0
    for i in range(n):
        x += np.square(norm(loc=0, scale=1).rvs(size=10000))
    
    return x

x = np.linspace(0.01, 30, 10000)

n_2 = demonstate_chi(2)
n_4 = demonstate_chi(4)
n_10 = demonstate_chi(10)

plt.figure(figsize=(10, 5))
plt.subplot(1,3, 1)
plt.plot(x, chi2.pdf(x, 2), '-', label='n = 2', c='blue')
plt.hist(n_2, density=True, histtype='stepfilled', alpha=0.5)
plt.legend()
plt.subplot(1,3, 2)
plt.plot(x, chi2.pdf(x, df = 4), '--', label='n = 4', c='gray')
plt.hist(n_4, density=True, histtype='stepfilled', alpha=0.5)
plt.legend()
plt.subplot(1,3, 3)
plt.plot(x, chi2.pdf(x, 10), '-.', label='n = 10', c='red')
plt.hist(n_10, density=True, histtype='stepfilled', alpha=0.5)
plt.legend()
plt.tight_layout(w_pad=3)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-owHWqB1a-1656215719273)(%E6%95%B0%E7%90%86%E7%BB%9F%E8%AE%A1_files/%E6%95%B0%E7%90%86%E7%BB%9F%E8%AE%A1_30_0.png)]

$\chi ^{2}$ 分布的性质

$\chi ^{2}$ 分布的可加性：设 $\chi_{1}^{2} \sim \chi ^{2}(n1), \chi_{2}^{2} \sim \chi ^{2}{n2}$ ，且 $\chi_{1}^{2}, \chi_{2}^{2}$ 相互独立，则 $\chi_{1}^{2} + \chi_{2}^{2} \sim \chi ^{2} (n1 + n2)$

$\chi ^{2}$ 分布的期望和方差:若 $\chi ^{2} \sim \chi ^{2}(n)$ ，则 $E(\chi ^{2}) = n, D(\chi ^{2}) = 2n$ 证： $\begin{aligned} &\chi ^{2} = X_{1}^{2} + X_{2}^{2} + \cdots + X_{n}^{2}, X_{i} \sim N(0, 1) \\ & 故 E(X_{i})=0, E(X_{i}^{2}) = D(X_{i}) = 1 \\ & E(\chi ^{2}) = \sum_{i=1}^{n}E(X_{i}^{2}) = n \\ &D(X_{i}^{2}) = E(X_{i}^{4}) - E^{2}(X_{i}^{2}) = 3 - 1 = 2 \\ & D(\chi ^{2}) = \sum_{i=1}^{n}D(X_{i}^{2}) = 2n\end{aligned}$

$\chi ^{2}$ 分布的分位点：对于给定的正数 $\alpha, 0 <\alpha <1$ ，称满足条件 $P\{\chi^{2} > \chi_{\alpha} ^{2}(n)\} = \int_{\chi_{\alpha} ^{2}(n)}^{\infty}f(y)dy = \alpha$ 的点 $\chi_{\alpha} ^{2}(n)$ 为 $\chi ^{2}(n)$ 分布上的 $\alpha$ 分位点。

(2) $t$ 分布：设 $\sim N(0, 1), Y \sim \chi^{2}(n)$ ，且 $X, Y$ 相互独立，则称随机变量
$\frac{X}{\sqrt{Y/n}}$
服从自由度为 $n$ 的 $t$ 分布，记为 $\sim t(n)$ 。

$t$ 分布的概率密度函数为：
$\frac{\Gamma [(n+1)/2]}{\sqrt{\pi n} \Gamma (n/2)}(1 + \frac{t^{2}}{n})^{-(n+1)/2}, -\infty < t < \infty$

python代码（画 $t$ 分布的图像）

import matplotlib.pyplot as plt
%matplotlib inline
from scipy.stats import t
import numpy as np

fig, ax = plt.subplots(1, 1)
x = np.linspace(-10, 10, 10000)
ax.plot(x, t.pdf(x, df=2), '-', label='n = 2', c='blue')
ax.plot(x, t.pdf(x, 9), '--', label='n = 9', c='gray')
ax.plot(x, t.pdf(x, df=10000), '-.', label='n = 10000', c='red')
ax.set_xlabel("t")
ax.set_ylabel("h(t)")
ax.legend()
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-it3x9i8t-1656215719273)(%E6%95%B0%E7%90%86%E7%BB%9F%E8%AE%A1_files/%E6%95%B0%E7%90%86%E7%BB%9F%E8%AE%A1_34_0.png)]

# 利用定理画 t 分布的分布函数
import matplotlib.pyplot as plt
%matplotlib inline
from scipy.stats import norm, chi2
import numpy as np

def demonstate_t(n):
    x = 0
    y = 0
    x = norm(loc=0, scale=1).rvs(size=10000)
    y = chi2.rvs(df=n)
    t = x / np.sqrt(y/ n)
    
    return t

x = np.linspace(-10, 10, 10000)

n_2 = demonstate_t(2)
n_9 = demonstate_t(9)
n_10000 = demonstate_t(10000)

plt.figure(figsize=(10, 5))
plt.subplot(1,3, 1)
plt.plot(x, t.pdf(x, 2), '-', label='n = 2', c='blue')
plt.hist(n_2,bins=15, density=True, histtype='stepfilled', alpha=0.5)
plt.legend()
plt.subplot(1,3, 2)
plt.plot(x, t.pdf(x, df = 9), '--', label='n = 9', c='gray')
plt.hist(n_9, density=True, histtype='stepfilled', alpha=0.5)
plt.legend()
plt.subplot(1,3, 3)
plt.plot(x, t.pdf(x, 10000), '-.', label='n = 10000', c='red')
plt.hist(n_10000, density=True, histtype='stepfilled', alpha=0.5)
plt.legend(loc="upper right")
plt.tight_layout(w_pad=3)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ltoK96mO-1656215719273)(%E6%95%B0%E7%90%86%E7%BB%9F%E8%AE%A1_files/%E6%95%B0%E7%90%86%E7%BB%9F%E8%AE%A1_35_0.png)]

当 $\rightarrow \infty$ 时， $t$ 分布近似为 $N (0, 1)$ 分布。

$t$ 分布的分位点：对于给定的正数 $\alpha, 0 <\alpha <1$ ，称满足条件 $P\{t > t_{\alpha}(n)\} = \int_{t_{\alpha}(n)}^{\infty}h(t)dt = \alpha$ 的点 $t_{\alpha}(n)$ 为 $t (n)$ 分布上的 $\alpha$ 分位点。
$h (t)$ 图形具有对称性，即 $t_{1 - \alpha}(n) = -t_{\alpha}(n)$

(3) $F$ 分布：设 $\sim \chi ^{2}(n1), V \sim \chi ^{2}(n2)$ ，且 $U, V$ 相互独立，则称随机变量
$\frac{U/n1}{V/n2}$
服从自由度为 $(n 1, n 2)$ 的 $F$ 分布，记为 $\sim F(n1, n2)$ 。

$F$ 分布的概率密度函数为：
$\psi (y) = \left \{ \begin{aligned} & \frac{\Gamma [(n1 +n2)/2](n1/n2)^{n1/2}y^{(n1/2)-1}}{\Gamma (n1/2)\Gamma (n2/2)[1+(n1y/n2)]^{(n1+n2)/2}}, &y>0 \\ & 0, &其它 \end{aligned} \right.$

python代码(画 $F$ 分布函数)

import matplotlib.pyplot as plt
%matplotlib inline
from scipy.stats import f
import numpy as np

fig, ax = plt.subplots(1, 1)
x = np.linspace(0.01, 10, 10000)
ax.plot(x, f.pdf(x, dfn=10, dfd=40), '-', label='F~(10, 40)', c='blue')
ax.plot(x, f.pdf(x, dfn=40, dfd=10), '--', label='F~(40, 10)', c='orange')
ax.plot(x, f.pdf(x, dfn=11, dfd=3), '-.', label='F~(11, 3)', c='red')
ax.set_xlabel("y")
ax.set_ylabel("f(y)")
ax.legend()
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-SQPOz17S-1656215719274)(%E6%95%B0%E7%90%86%E7%BB%9F%E8%AE%A1_files/%E6%95%B0%E7%90%86%E7%BB%9F%E8%AE%A1_40_0.png)]

# 利用定理
import matplotlib.pyplot as plt
%matplotlib inline
from scipy.stats import chi2
import numpy as np

def demonstate_f(n1, n2):
    u = 0
    v = 0
    u = chi2.rvs(df=n1, size=10000)
    v = chi2.rvs(df=n2, size=10000)
    f = (u/n1) / (v/n2)
    
    return f

x = np.linspace(0.01, 10, 10000)

n_10_40 = demonstate_f(10, 40)
n_40_10 = demonstate_f(40 ,10)
n_11_3 = demonstate_f(11, 3)

plt.figure(figsize=(10, 5))
plt.subplot(1,3, 1)
plt.plot(x, f.pdf(x, dfn=10, dfd=40), '-', label='F~(10, 40)', c='blue')
plt.hist(n_10_40, bins=300, density=True, histtype='stepfilled', alpha=0.5)
plt.legend()
plt.subplot(1,3, 2)
plt.plot(x, f.pdf(x, dfn=40, dfd=10), '--', label='F~(40, 10)', c='orange')
plt.hist(n_40_10, bins=300, density=True, histtype='stepfilled', alpha=0.5)
plt.legend()
plt.subplot(1,3, 3)
plt.plot(x, f.pdf(x, dfn=11, dfd=3), '-.', label='F~(11, 3)', c='red')
plt.hist(n_11_3,bins=550, density=True, histtype='stepfilled', alpha=0.5)
plt.xlim([0, 10])
plt.legend(loc="upper right")
plt.tight_layout(w_pad=3)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rsHthGmu-1656215719274)(%E6%95%B0%E7%90%86%E7%BB%9F%E8%AE%A1_files/%E6%95%B0%E7%90%86%E7%BB%9F%E8%AE%A1_41_0.png)]

$F$ 分布的分位点：对于给定的正数 $\alpha, 0 <\alpha <1$ ，称满足条件 $P\{F > F_{\alpha}(n1, n2)\} = \int_{F_{\alpha}(n1, n2)}^{\infty}\psi (y)dy = \alpha$ 的点 $F_{\alpha}(n1, n2)$ 为 $F (n 1, n 2)$ 分布上的 $\alpha$ 分位点。
若 $\sim F(n1, n2)$ ，则 $\frac{1}{F} \sim F(n2, n1)$
$F_{1-\alpha}(n1, n2) =\frac{1} {F_{\alpha}(n2, n1)}$ 。

📕重要定理：关于正态总体的样本均值与样本方差的分布

定理一：设 $X_{1}, X_{2}, \cdots, X_{n}$ 是来自正态总体 $N(\mu, \sigma^{2})$ 的样本， $\overline{X}$ 是样本均值，则
$\overline{X} \sim N(\mu, \sigma^{2}/n).$

定理二：设 $X_{1}, X_{2}, \cdots, X_{n}$ 是来自正态总体 $N(\mu, \sigma^{2})$ 的样本， $\overline{X} {和} S^{2}$ 分别是样本均值和样本方差，则有
$\begin{aligned} & 1. \frac{(n-1)S^{2}}{\sigma^{2}} \sim \chi^{2}(n-1) \\ & 2. \overline{X}与S^{2}相互独立 \end{aligned}$

定理三：设 $X_{1}, X_{2}, \cdots, X_{n}$ 是来自正态总体 $N(\mu, \sigma^{2})$ 的样本， $\overline{X} {和} S^{2}$ 分别是样本均值和样本方差，则有
$\frac{\overline{X} - \mu}{S/\sqrt{n}} \sim t(n-1).$

定理四：设 $X_{1}, X_{2}, \cdots, X_{n1}{和}Y_{1}, Y_{2}, \cdots, Y_{n2}$ 分别是来自正态总体 $N(\mu_1, \sigma_{1}^{2})和N(\mu_2, \sigma_{2}^{2})$ 的样本，且这两个样本相互独立，则有
$\begin{aligned} & 1. \frac{S_{1}^{2}/S_{2}^{2}}{\sigma_{1}^{2}/\sigma_{2}^{2}} \sim F(n1-1, n2-1) \\ & 2. 当\sigma_{1}^{2} = \sigma_{2}^{2} = \sigma^{2}时，\frac{(\overline{X} - \overline{Y}) - (\mu_{1} - \mu_{2})}{S_{w}\sqrt{\frac{1}{n1}+\frac{1}{n2}}} \sim t(n1+n2-2) \end{aligned}$
其中， $S_{w}^{2} = \frac{(n1-1)S_{1}^{2}+(n2-1)S_{2}^{2}}{n1+n2-2}$ .

python代码(验证定理)

import matplotlib.pyplot as plt
%matplotlib inline
from scipy.stats import chi2, t, norm, f
import numpy as np

def theory_1(mu, sigma, n):
    x_mean = []
    for i in range(10000):
        x_mean.append(np.sum(norm.rvs(loc=mu, scale=sigma, size=n))/n)
    return x_mean

def theory_2(mu, sigma, n):
    res = []
    for i in range(10000):
        x = norm.rvs(loc=mu, scale=sigma, size=n)
        x_mean = np.mean(x)
        s2 = np.sum(np.square(x - x_mean))/(n-1)
        res.append((n-1)*s2/(sigma**2))
    return res

def theory_3(mu, sigma, n):
    res = []
    for i in range(10000):
        x = norm.rvs(loc=mu, scale=sigma, size=n)
        x_mean = np.mean(x)
        s = np.sqrt(np.sum(np.square(x - x_mean))/(n-1))
        res.append((x_mean-mu)/(s/np.sqrt(n)))
    return res

def theory_4(mu1, mu2, sigma1, sigma2, n1, n2):
    res = []
    for i in range(10000):
        x1 = norm.rvs(loc=mu1, scale=sigma1, size=n1)
        x1_mean = np.mean(x1)
        x2 = norm.rvs(loc=mu2, scale=sigma2, size=n2)
        x2_mean = np.mean(x2)
        s1_2 = np.sum(np.square(x1-x1_mean)) / (n1-1)
        s2_2 = np.sum(np.square(x2-x2_mean)) / (n2-1)
        temp1 = (s1_2/s2_2)
        temp2 = (sigma1**2/sigma2**2)
        res.append(temp1/temp2)
    return res 

mu = 5
sigma = 10
n = 5
mu1, mu2 = 1, 2
sigma1, sigma2 = 3, 4
n1, n2 = 10, 40
x_mean = theory_1(mu, sigma, n)
t2 = theory_2(mu, sigma, n)
t_ = theory_3(mu, sigma, n)
f_ = theory_4(mu1, mu2, sigma1, sigma2, n1, n2)

x1 =np.linspace(-10, 20, 10000)
x2 = np.linspace(0.01, 30, 10000)
x3 = np.linspace(-5, 5, 10000)
x4 = np.linspace(0.01, 10, 10000)

plt.figure(figsize=(10, 8))
plt.subplot(2,2, 1)
plt.plot(x1, norm.pdf(x1,loc=mu, scale=sigma/np.sqrt(n)), '-', label='N({}, {})'.format(mu, sigma**2/n), c='blue')
plt.hist(x_mean,bins=50, density=True, histtype='stepfilled', alpha=0.5)
plt.title("Theory_1")
plt.xlabel("x")
plt.ylabel("p(x)")
plt.legend()
plt.subplot(2,2, 2)
plt.plot(x2, chi2.pdf(x2, df=n-1), '--', label='X({})'.format(n-1), c='orange')
plt.hist(t2, bins=50,  density=True, histtype='stepfilled', alpha=0.5)
plt.title("Theory_2")
plt.xlabel("x")
plt.ylabel("p(x)")
# plt.xlim([0, 30])
plt.legend()
plt.subplot(2,2, 3)
plt.plot(x3, t.pdf(x3, df=n-1), '-.', label='t({})'.format(n-1), c='red')
plt.hist(t_,bins=50, density=True, histtype='stepfilled', alpha=0.5)
plt.title("Theory_3")
plt.xlabel("x")
plt.ylabel("p(x)")
plt.legend(loc="upper right")
plt.subplot(2,2, 4)
plt.plot(x4, f.pdf(x4, dfn=n1-1, dfd=n2-1), '--', label='F({}, {})'.format(n1-1, n2-1), c='orange')
plt.hist(f_, bins=50, density=True, histtype='stepfilled', alpha=0.5)
plt.title("Theory_4")
plt.xlabel("x")
plt.ylabel("p(x)")
plt.xlim([0, 10])
plt.legend()

plt.tight_layout(w_pad=3)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-9oBslt7k-1656215719275)(%E6%95%B0%E7%90%86%E7%BB%9F%E8%AE%A1_files/%E6%95%B0%E7%90%86%E7%BB%9F%E8%AE%A1_46_0.png)]

1.4 参数估计之点估计的概念

点估计：设总体 $X$ 的分布函数 $F(x;\theta)$ 的形式为已知， $\theta$ 是待估计参数， $X_{1}, X_{2}, \cdots, X_{n}$ 是 $X$ 的一个样本， $x_{1}, x_{2}, \cdots, x_{n}$ 是相应的一个样本值，点估计问题就是要构造一个适当的统计量 $\hat{\theta}(X_{1}, X_{2}, \cdots, X_{n})$ ，用它的观察值 $\hat{\theta}(x_{1}, x_{2}, \cdots, x_{n})$ 作为未知参数 $\theta$ 的近似值。称 $\hat{\theta}(X_{1}, X_{2}, \cdots, X_{n})$ 为 $\theta$ 的估计量， $\hat{\theta}(x_{1}, x_{2}, \cdots, x_{n})$ 为 $\theta$ 的估计值。统称它们为估计，简记为 $\hat{\theta}$ 。

点估计就是用样本统计量去估计总体分布的未知参数。由于估计量是样本的函数，因此，对于不同的样本值， $\theta$ 的估计值一般是不相同的。

1.5 参数估计之点估计的方法：矩估计

矩估计法：设 $X$ 为连续型随机变量，其概率密度为 $f(x;\theta_{1},\theta_{2}, \cdots, \theta_{k})$ ，或 $X$ 为离散随机变量，其分布律为 $P\{X=x\}=p(x;\theta_{1},\theta_{2}, \cdots, \theta_{k})$ ，其中 $\theta_{1},\theta_{2}, \cdots, \theta_{k}$ 为待估计参数， $X_{1}, X_{2}, \cdots, X_{n}$ 是来自 $X$ 的样本。假设总体 $X$ 的前 $k$ 阶矩
$\mu_{l} = E(X^{l}) = \int_{-\infty}^{\infty}x^{l}f(x;\theta_{1},\theta_{2}, \cdots, \theta_{k})dx$
或
$\mu_{l} = E(X^{l}) = =\sum x^{l}p(x;\theta_{1},\theta_{2}, \cdots, \theta_{k})$
然后假设样本 $k$ 阶矩 $A_{k}$ 等于总体 $k$ 阶矩 $\mu_{k}$ ，即 $A_{k} = \mu_{k}$ ，这种利用样本矩估计总体矩，从而估计未知参数的方法称为矩估计法。

样本矩公式

样本原点矩 $A_{k} = \frac{1}{n}\sum_{i=1}^{n}X_{i}^{k}, k =1, 2, \cdots$

样本中心矩 $B_{k} = \frac{1}{n}\sum_{i=1}^{n}(X_{i} - \overline{X})^{k}, k =1, 2, \cdots$

矩估计法的解题步骤：

确定总体分布待估计参数 $\theta_{i}$ 的个数 $n$

列出总体分布的前 $n$ 阶矩 $\mu_{1}到\mu_{n}$ ， $\mu_{n}$ 是关于待估计参数 $\theta_{i}$ 的函数

将 $\mu_{1}到\mu_{n}$ 联立方程组，求解待估计参数 $\theta_{i}$

将求得的 $\theta_{i}$ 中的 $\mu_{k}$ 换成相应的 $A_{k}$ ，即得到待估计参数的估计值

🔥例子：设总体 $X$ 在 $[a, b]$ 上服从均匀分布， $a, b$ 未知， $X_{1}, X_{2}, \cdots, X_{n}$ 是来自总体 $X$ 的样本，求 $a, b$ 的矩估计量。

🦊解：

确定估计参数个数， $a, b$ , $n = 2$
求总体的前 $2$ 阶矩
$\begin{aligned} &\mu_{1} = E(X) = \frac{b-a}{2} \\ &\mu_{2} = E(X^{2}) = D(X) + E^{2}(X) = \frac{(b-a)^{2}}{12} + \frac{(b-a)^{2}}{4} \\ \end{aligned}$
联立方程组并求解
$\left \{ \begin{aligned} &\mu_{1} = \frac{b-a}{2} \\ &\mu_{2} = \frac{(b-a)^{2}}{12} + \frac{(b-a)^{2}}{4} \\ \end{aligned} \right.$
解得
$\mu_{1} - \sqrt{3(\mu_{2}-\mu_{1}^{2})}, b = \mu_{1} + \sqrt{3(\mu_{2}-\mu_{1}^{2})}$
将相应的 $\mu_{k}$ 换成 $A_{k}$
$\begin{aligned} &a = A_{1} - \sqrt{3(A_{2}-A_{1}^{2})} = \frac{1}{n}\sum_{i=1}^{n}X_{i} - \sqrt{3(\frac{1}{n}\sum_{i=1}^{n}X_{i}^{2}-(\frac{1}{n}\sum_{i=1}^{n}X_{i})^{2})} \\ & b = A_{1} + \sqrt{3(A_{2}-A_{1}^{2})} = \frac{1}{n}\sum_{i=1}^{n}X_{i} + \sqrt{3(\frac{1}{n}\sum_{i=1}^{n}X_{i}^{2}-(\frac{1}{n}\sum_{i=1}^{n}X_{i})^{2})} \end{aligned}$

python代码(求解上题)

import numpy as np
from scipy.stats import uniform

a_real = 1
b_real = 6
n = 1000
x = uniform.rvs(loc=1, scale=5, size=n)

A1 = np.sum(x) / n
A2 = np.sum(np.square(x)) / n

a_estimate = A1 - np.sqrt(3 *(A2-A1**2))
b_estimate = A1 + np.sqrt(3 *(A2-A1**2))
print("a的真实值：{}, b的真实值：{}".format(a_real, b_real))
print("a的矩估计值：{:.2f}, b的矩估计值：{:.2f}".format(a_estimate, b_estimate))

a的真实值：1, b的真实值：6
a的矩估计值：1.02, b的矩估计值：6.06

瘦弱书虫

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
GitModel-动手学数理统计_01（python）

Datawhale组队学习GitModel
复制链接

扫一扫

专栏目录