# Bootstrapping

lim ⁡ m → ∞ ( 1 − 1 m ) m = 1 e ≃ 0.368 \lim_{m\to\infty}(1-\frac{1}{m})^m=\frac{1}{e} \simeq 0.368

# Maximum Likelihood Estimation

θ ^ = arg ⁡ max ⁡ θ L ( X ; θ ) = arg ⁡ max ⁡ θ p model ( X ; θ ) = arg ⁡ max ⁡ θ ∏ i = 1 m p model ( x i ; θ ) ≃ arg ⁡ max ⁡ θ ∑ i = 1 m log ⁡ p model ( x i ; θ ) ≃ arg ⁡ max ⁡ θ E x ∼ p ^ data log ⁡ p model ( x ; θ ) \begin{aligned} \hat\theta &=\arg\max_{\theta}L(X;\theta)=\arg\max_{\theta}p_\text{model}(X;\theta)\\ &=\arg\max_\theta\prod_{i=1}^mp_\text{model}(\pmb x_i;\theta)\\ &\simeq\arg\max_\theta\sum_{i=1}^m\log p_\text{model}(x_i;\theta)\\ &\simeq\arg\max_\theta\Bbb E_{\pmb x\sim\hat p_\text{data}}\log p_\text{model}(\pmb x;\theta) \end{aligned}

MLE可解释为极小化经验分布 p ^ data \hat p_{\text {data}} 和模型分布 p model p_{\text{model}} 之间的KL散度，即极小化分布间的交叉熵.

MLE and MSE

J ( θ ) = − E x , y ∼ p ^ data log ⁡ p model ( y ∣ x , θ ) J(\theta)=-\Bbb E_{x,y\sim \hat p_\text {data}}\log p_\text{model}(y|x,\theta)

J ( θ ) = 1 2 E x , y ∼ p ^ data ∣ ∣ y − f ( x ; θ ) ∣ ∣ 2 + const J(\theta)=\frac{1}{2}\Bbb E_{x,y\sim\hat p_\text{data}}||y-f(x;\theta)||^2+\text{const}

Calculus of Variations

f ∗ = arg ⁡ min ⁡ f E x , y ∼ p data ∣ ∣ y − f ( x ) ∣ ∣ 2    ⟹    f ∗ ( x ) = E y ∼ p data ( y ∣ x ) [ y ] f^*=\arg\min_f\Bbb E_{x,y\sim p_\text{data}}||y-f(x)||^2\implies f^*(x)=\Bbb E_{y\sim p_\text{data}(y|x)}[y]

f ∗ = arg ⁡ min ⁡ f E x , y ∼ p data ∣ ∣ y − f ( x ) ∣ ∣ 1 f^*=\arg\min_f\Bbb E_{x,y\sim p_\text{data}}||y-f(x)||_1

# Estimation, Bias and Variance

点估计是根据样本集对真实分布参数的估计，可以是给定数据集的任意函数:
θ ^ = g ( X m ) , θ = θ ^ + ϵ \hat\theta=g(X_m),\quad \theta=\hat\theta+\epsilon

y = f ^ ( x ) + ϵ y = \hat{f}(x)+\epsilon

Bias ( θ ^ ) = E ( θ ^ ) − θ \text{Bias}(\hat\theta)=E(\hat\theta)-\theta

• 无偏性， Bias ( θ ^ m ) = 0 \text{Bias}(\hat\theta_m)=0 ;
• 渐进无偏性， lim ⁡ m → ∞ Bias ( θ ^ m ) = 0 \lim_{m\to\infty}\text{Bias}(\hat\theta_m)=0 ;

• 均值(mean) μ = E ( X ) \mu=E(X) :
μ ^ = 1 m ∑ i = 1 m x i \hat\mu=\dfrac{1}{m}\sum_{i=1}^m x_i

• 方差(variance) σ 2 = Var ( X ) = E [ ( X − E ( X ) ) 2 ] \sigma^2=\text{Var}(X)=E[(X-E(X))^2] :

• 有偏方差估计， σ ^ 2 = 1 m ∑ i = 1 m ( x i − μ ^ ) 2 \hat\sigma^2=\dfrac{1}{m}\sum_{i=1}^m(x_i-\hat\mu)^2 ，偏差为 − σ 2 / m -\sigma^2/m ;
• 无偏方差估计， σ ~ 2 = m m − 1 σ ^ 2 \tilde\sigma^2=\dfrac{m}{m-1}\hat\sigma^2 ;
• 标准差(standard deviation, SD)，亦称为均方差(mean standard deviation, MSD):
σ ~ = 1 m − 1 ∑ i = 1 m ( x i − μ ^ ) 2 \tilde\sigma=\sqrt{\dfrac{1}{m-1}\sum_{i=1}^m(x_i-\hat\mu)^2}
样本值偏离样本均值的程度小于偏离总体均值（未知）的程度，方差被低估，分母为 m − 1 m-1 以修正.

# Standard Error and Machine Learning

均方根误差(root mean squared error, RMSE)，亦称为标准误差(standard error, SE)，反映样本集的可靠性（测量与真实的差别程度），标准误差越低，样本集越能代表总体，定义为
RMSE = SE = 1 m ∑ i = 1 m ( x i − x ^ i ) 2 \text{RMSE}=\text{SE}=\sqrt{\frac{1}{m}\sum_{i=1}^m(x_i-\hat x_i)^2}

均方误差(mean squared error, MSE)是RMSE的平方，其和偏差、方差的关系：
MSE = E [ ( θ ^ − θ ) 2 ] = ( E ( θ ^ ) − θ ) 2 + E ( θ ^ 2 ) − E ( θ ^ ) 2 = Bias ( θ ^ ) 2 + Var ( θ ^ ) \begin{aligned} \text{MSE} &=E[(\hat\theta-\theta)^2]=(E(\hat\theta)-\theta)^2+E(\hat\theta^2)-E(\hat\theta)^2\\[.5ex] &=\text{Bias}(\hat\theta)^2+\text{Var}(\hat\theta) \end{aligned}

Var ( μ ^ ) = σ 2 m    ⟹    SE ( u ^ ) = σ m \text{Var}(\hat\mu)=\frac{\sigma^2}{m}\implies \text{SE}(\hat u)=\frac{\sigma}{\sqrt m}

( μ ^ − 1.96 SE ( μ ^ ) ,   μ ^ + 1.96 SE ( μ ^ ) ) (\hat\mu-1.96\text{SE}(\hat\mu),\ \hat\mu+1.96\text{SE}(\hat\mu))

• 样本集容量越大，置信区间范围越窄，样本均值越具有总体均值代表性；
• 误差均值越小，算法性能越好；
09-07 1万+

08-24 4670
02-28 1万+
11-28 418
05-05 1万+
09-18 234
10-16 1875
12-13 1万+
01-12 111
05-13 1万+
08-01 6万+
12-28 177
12-09 156
10-29 1万+
05-19 1630
05-20 1485