数据挖掘与分析课程笔记
- 参考教材:Data Mining and Analysis : MOHAMMED J.ZAKI, WAGNER MEIRA JR.
文章目录
- 数据挖掘与分析课程笔记(目录)
- 数据挖掘与分析课程笔记(Chapter 1)
- 数据挖掘与分析课程笔记(Chapter 2)
- 数据挖掘与分析课程笔记(Chapter 5)
- 数据挖掘与分析课程笔记(Chapter 7)
- 数据挖掘与分析课程笔记(Chapter 14)
- 数据挖掘与分析课程笔记(Chapter 15)
- 数据挖掘与分析课程笔记(Chapter 20)
- 数据挖掘与分析课程笔记(Chapter 21)
Chapter 2:数值属性
关注代数、几何与统计观点。
2.1 一元分析
仅关注一项属性, D = ( X x 1 x 2 ⋮ x n ) , x i ∈ R \mathbf{D}=\left(\begin{array}{c} X \\ \hline x_{1} \\ x_{2} \\ \vdots \\ x_{n} \end{array}\right),x_i\in\mathbb{R} D=⎝ ⎛Xx1x2⋮xn⎠ ⎞,xi∈R
统计: X X X 可视为(高维)随机变量, x i x_i xi 均是恒等随机变量, x 1 , ⋯ , x n x_1,\cdots,x_n x1,⋯,xn 也看作源于 X X X 的长度为 n n n 的随机样本。
Def.1. 经验积累分布函数
Def.2. 反积累分布函数
Def.3. 随机变量
X
X
X 的经验概率质量函数是指
f
^
(
x
)
=
1
n
∑
i
=
1
n
I
(
x
i
=
x
)
,
∀
x
i
∈
R
I
(
x
i
=
x
)
=
{
1
,
x
i
=
x
0
,
x
i
≠
x
\hat{f}(x)=\frac{1}{n} \sum_{i=1}^{n} I\left(x_{i} = x\right),\forall x_i \in \mathbb{R}\\ I\left(x_{i} = x\right)=\left\{\begin{matrix} 1,x_i=x\\ 0,x_i\ne x \end{matrix}\right.
f^(x)=n1i=1∑nI(xi=x),∀xi∈RI(xi=x)={1,xi=x0,xi=x
2.1.1 集中趋势量数
Def.4. 离散随机变量 X X X 的期望是指: μ : = E ( X ) = ∑ x x f ( x ) \mu:=E(X) = \sum\limits_{x} xf(x) μ:=E(X)=x∑xf(x), f ( x ) f(x) f(x) 是 X X X 的PMF
连续随机变量 X X X 的期望是指: μ : = E ( X ) = ∫ − ∞ + ∞ x f ( x ) d x \mu:=E(X) = \int\limits_{-\infin}^{+\infin} xf(x)dx μ:=E(X)=−∞∫+∞xf(x)dx, f ( x ) f(x) f(x) 是 X X X 的PDF
注: E ( a X + b Y ) = a E ( X ) + b E ( Y ) E(aX+bY)=aE(X)+bE(Y) E(aX+bY)=aE(X)+bE(Y)
Def.5. X X X 的样本平均值是指 μ ^ = 1 n ∑ i = 1 n x i \hat{\mu}=\frac{1}{n} \sum\limits_{i=1}^{n}x_i μ^=n1i=1∑nxi,注 μ ^ \hat{\mu} μ^ 是 μ \mu μ 的估计量
Def.6. 一个估计量(统计量) θ ^ \hat{\theta} θ^ 被称作统计量 θ \theta θ 的无偏估计,如果 E ( θ ^ ) = θ E(\hat{\theta})=\theta E(θ^)=θ
自证:样本平均值 μ ^ \hat{\mu} μ^ 是期望 μ \mu μ 的无偏估计量, E ( x i ) = μ for all x i E(x_i)=\mu \text{ for all } x_i E(xi)=μ for all xi
Def.7. 一个估计量是稳健的,如果它不会被样本中的极值影响。(样本平均值并不是稳健的。)
Def.8. 随机变量 X X X 的中位数
Def.9. 随机变量 X X X 的样本中位数
Def.10. 随机变量 X X X 的众数, 随机变量 X X X 的样本众数
2.2.2 离差量数
Def.11. 随机变量 X X X 的极差与样本极差
Def.12. 随机变量 X X X 的四分位距,样本的四分位距
Def.13. 随机变量
X
X
X 的方差是
σ
2
=
var
(
X
)
=
E
[
(
X
−
μ
)
2
]
=
{
∑
x
(
x
−
μ
)
2
f
(
x
)
if
X
is discrete
∫
−
∞
∞
(
x
−
μ
)
2
f
(
x
)
d
x
if
X
is continuous
\sigma^{2}=\operatorname{var}(X)=E\left[(X-\mu)^{2}\right]=\left\{\begin{array}{ll} \sum_{x}(x-\mu)^{2} f(x) & \text { if } X \text { is discrete } \\ \\ \int_{-\infty}^{\infty}(x-\mu)^{2} f(x) d x & \text { if } X \text { is continuous } \end{array}\right.
σ2=var(X)=E[(X−μ)2]=⎩
⎨
⎧∑x(x−μ)2f(x)∫−∞∞(x−μ)2f(x)dx if X is discrete if X is continuous
标准差 σ \sigma σ 是指 σ 2 \sigma^2 σ2 的正的平方根。
注:方差是关于期望的第二阶动差, r r r 阶动差是指 E [ ( x − μ ) r ] E[(x-\mu)^r] E[(x−μ)r]。
性质:
- σ 2 = E ( X 2 ) − μ 2 = E ( X 2 ) − [ E ( X ) ] 2 \sigma^2=E(X^2)-\mu^2=E(X^2)-[E(X)]^2 σ2=E(X2)−μ2=E(X2)−[E(X)]2
- v a r ( X 1 + X 2 ) = v a r ( X 1 ) + v a r ( X 2 ) var(X_1+X_2)=var(X_1)+var(X_2) var(X1+X2)=var(X1)+var(X2), X 1 , X 2 X_1,X_2 X1,X2 独立
Def.14. 样本方差是 σ ^ 2 = 1 n ∑ i = 1 n ( x i − μ ^ ) 2 \hat{\sigma}^{2}=\frac{1}{n} \sum\limits_{i=1}^{n}\left(x_{i}-\hat{\mu}\right)^{2} σ^2=n1i=1∑n(xi−μ^)2,底下非 n − 1 n-1 n−1
样本方差的几何意义:考虑中心化数据矩阵
C
:
=
(
x
1
−
μ
^
x
2
−
μ
^
⋮
x
n
−
μ
^
)
n
⋅
σ
^
2
=
∑
i
=
1
n
(
x
i
−
μ
^
)
2
=
∣
∣
C
∣
∣
2
C:=\left(\begin{array}{c} x_{1}-\hat{\mu} \\ x_{2}-\hat{\mu} \\ \vdots \\ x_{n}-\hat{\mu} \end{array}\right)\\ n\cdot \hat{\sigma}^2=\sum\limits_{i=1}^{n}\left(x_{i}-\hat{\mu}\right)^{2}=||C||^2
C:=⎝
⎛x1−μ^x2−μ^⋮xn−μ^⎠
⎞n⋅σ^2=i=1∑n(xi−μ^)2=∣∣C∣∣2
问题:
X
X
X 的样本平均数的期望与方差?
E
(
μ
^
)
=
E
(
1
n
∑
i
=
1
n
x
i
)
=
1
n
∑
i
=
1
n
E
(
x
i
)
=
1
n
∑
i
=
1
n
μ
=
μ
E(\hat{\mu})=E(\frac{1}{n} \sum\limits_{i=1}^{n}x_i)=\frac{1}{n} \sum\limits_{i=1}^{n} E(x_i)=\frac{1}{n}\sum\limits_{i=1}^{n}\mu=\mu\\
E(μ^)=E(n1i=1∑nxi)=n1i=1∑nE(xi)=n1i=1∑nμ=μ
方差有两种方法:第一种直接展开,第二种:运用
x
1
,
⋯
,
x
n
x_1,\cdots,x_n
x1,⋯,xn 独立同分布:
v
a
r
(
∑
i
=
1
n
x
i
)
)
=
∑
i
=
1
n
v
a
r
(
x
i
)
=
n
⋅
σ
2
⟹
v
a
r
(
μ
^
)
=
σ
2
n
var(\sum\limits_{i=1}^{n}x_i))=\sum\limits_{i=1}^{n}var(x_i)=n\cdot \sigma^2\Longrightarrow var(\hat{\mu})=\frac{\sigma^2}{n}
var(i=1∑nxi))=i=1∑nvar(xi)=n⋅σ2⟹var(μ^)=nσ2
注:样本方差是有偏估计,因为:
E
(
σ
2
)
=
(
n
−
1
n
)
σ
2
→
n
→
+
∞
σ
2
E(\sigma^2)=(\frac{n-1}{n})\sigma^2\xrightarrow{n\to +\infin}\sigma^2
E(σ2)=(nn−1)σ2n→+∞σ2
2.2 二元分析
略
2.3 多元分析
D = ( X 1 X 2 ⋯ X d x 1 x 11 x 12 ⋯ x 1 d x 2 x 21 x 22 ⋯ x 2 d ⋮ ⋮ ⋮ ⋱ ⋮ x n x n 1 x n 2 ⋯ x n d ) \mathbf{D}=\left(\begin{array}{c|cccc} & X_{1} & X_{2} & \cdots & X_{d} \\ \hline \mathbf{x}_{1} & x_{11} & x_{12} & \cdots & x_{1 d} \\ \mathbf{x}_{2} & x_{21} & x_{22} & \cdots & x_{2 d} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \mathbf{x}_{n} & x_{n 1} & x_{n 2} & \cdots & x_{n d} \end{array}\right) D=⎝ ⎛x1x2⋮xnX1x11x21⋮xn1X2x12x22⋮xn2⋯⋯⋯⋱⋯Xdx1dx2d⋮xnd⎠ ⎞
可视为: X = ( X 1 , ⋯ , X d ) T \mathbf{X}=(X_1,\cdots,X_d)^T X=(X1,⋯,Xd)T
Def.15. 对于随机变量向量 X \mathbf{X} X,其期望向量为: E [ X ] = ( E [ X 1 ] E [ X 2 ] ⋮ E [ X d ] ) E[\mathbf{X}]=\left(\begin{array}{c} E\left[X_{1}\right] \\ E\left[X_{2}\right] \\ \vdots \\ E\left[X_{d}\right] \end{array}\right) E[X]=⎝ ⎛E[X1]E[X2]⋮E[Xd]⎠ ⎞
样本平均值为: μ ^ = 1 n ∑ i = 1 n x i , ( = m e a n ( D ) ) ∈ R d \hat{\boldsymbol{\mu}}=\frac{1}{n} \sum\limits_{i=1}^{n} \mathbf{x}_{i},(=mean(\mathbf{D})) \in \mathbb{R}^{d} μ^=n1i=1∑nxi,(=mean(D))∈Rd
Def.16. 对于 X 1 , X 2 X_1,X_2 X1,X2,定义协方差 σ 12 = E [ ( X 1 − E ( X 1 ) ) ( X 2 − E ( X 2 ) ] = E ( X 1 X 2 ) − E ( X 1 ) E ( X 2 ) \sigma_{12}=E[(X_1-E(X_1))(X_2-E(X_2)]=E(X_1X_2)-E(X_1)E(X_2) σ12=E[(X1−E(X1))(X2−E(X2)]=E(X1X2)−E(X1)E(X2)
Remark:
- σ 12 = σ 21 \sigma_{12}=\sigma_{21} σ12=σ21
- 若两者独立,则 σ 12 = 0 \sigma_{12}=0 σ12=0
Def.17. 对于随机变量向量
X
=
(
X
1
,
⋯
,
X
d
)
T
\mathbf{X}=(X_1,\cdots,X_d)^T
X=(X1,⋯,Xd)T,定义协方差矩阵:
Σ
=
E
[
(
X
−
μ
)
(
X
−
μ
)
T
]
=
(
σ
1
2
σ
12
⋯
σ
1
d
σ
21
σ
2
2
⋯
σ
2
d
⋯
⋯
⋯
⋯
σ
d
1
σ
d
2
⋯
σ
d
2
)
d
×
d
\boldsymbol{\Sigma}=E\left[(\mathbf{X}-\boldsymbol{\mu})(\mathbf{X}-\boldsymbol{\mu})^{T}\right]=\left(\begin{array}{cccc} \sigma_{1}^{2} & \sigma_{12} & \cdots & \sigma_{1 d} \\ \sigma_{21} & \sigma_{2}^{2} & \cdots & \sigma_{2 d} \\ \cdots & \cdots & \cdots & \cdots \\ \sigma_{d 1} & \sigma_{d 2} & \cdots & \sigma_{d}^{2} \end{array}\right)_{d\times d}
Σ=E[(X−μ)(X−μ)T]=⎝
⎛σ12σ21⋯σd1σ12σ22⋯σd2⋯⋯⋯⋯σ1dσ2d⋯σd2⎠
⎞d×d
其为对称矩阵,定义
X
\mathbf{X}
X 的广义方差为
d
e
t
(
Σ
)
det(\boldsymbol{\Sigma})
det(Σ)
注:
- Σ \boldsymbol{\Sigma} Σ 是实对称矩阵且半正定,即所有特征值非负, λ 1 ≥ λ 2 ⋯ ≥ λ d ≥ 0 \lambda_1\ge \lambda_2 \cdots \ge\lambda_d \ge 0 λ1≥λ2⋯≥λd≥0
- v a r ( D ) = t r ( Σ ) = σ 1 2 + ⋯ + σ d 2 var(\mathbf{D})=tr(\Sigma)=\sigma_1^2+\cdots+\sigma_d^2 var(D)=tr(Σ)=σ12+⋯+σd2
Def.18. 对于
X
=
(
X
1
,
⋯
,
X
d
)
T
\mathbf{X}=(X_1,\cdots,X_d)^T
X=(X1,⋯,Xd)T,定义样本协方差矩阵
Σ
^
=
1
n
(
Z
T
Z
)
=
1
n
(
Z
1
T
Z
1
Z
1
T
Z
2
⋯
Z
1
T
Z
d
Z
2
T
Z
1
Z
2
T
Z
2
⋯
Z
2
T
Z
d
⋮
⋮
⋱
⋮
Z
d
T
Z
1
Z
d
T
Z
2
⋯
Z
d
T
Z
d
)
d
×
d
\hat{\boldsymbol{\Sigma}}=\frac{1}{n}\left(\mathbf{Z}^{T} \mathbf{Z}\right)=\frac{1}{n}\left(\begin{array}{cccc} Z_{1}^{T} Z_{1} & Z_{1}^{T} Z_{2} & \cdots & Z_{1}^{T} Z_{d} \\ Z_{2}^{T} Z_{1} & Z_{2}^{T} Z_{2} & \cdots & Z_{2}^{T} Z_{d} \\ \vdots & \vdots & \ddots & \vdots \\ Z_{d}^{T} Z_{1} & Z_{d}^{T} Z_{2} & \cdots & Z_{d}^{T} Z_{d} \end{array}\right)_{d\times d}
Σ^=n1(ZTZ)=n1⎝
⎛Z1TZ1Z2TZ1⋮ZdTZ1Z1TZ2Z2TZ2⋮ZdTZ2⋯⋯⋱⋯Z1TZdZ2TZd⋮ZdTZd⎠
⎞d×d
其中
Z
=
D
−
1
⋅
μ
^
T
=
(
x
1
T
−
μ
^
T
x
2
T
−
μ
^
T
⋮
x
n
T
−
μ
^
T
)
=
(
−
z
1
T
−
−
z
2
T
−
⋮
−
z
n
T
−
)
=
(
∣
∣
∣
Z
1
Z
2
⋯
Z
d
∣
∣
∣
)
\mathbf{Z}=\mathbf{D}-\mathbf{1} \cdot \hat{\boldsymbol{\mu}}^{T}=\left(\begin{array}{c} \mathbf{x}_{1}^{T}-\hat{\boldsymbol{\mu}}^{T} \\ \mathbf{x}_{2}^{T}-\hat{\boldsymbol{\mu}}^{T} \\ \vdots \\ \mathbf{x}_{n}^{T}-\hat{\boldsymbol{\mu}}^{T} \end{array}\right)=\left(\begin{array}{ccc} -& \mathbf{z}_{1}^{T} & - \\ -& \mathbf{z}_{2}^{T} & - \\ & \vdots \\ -& \mathbf{z}_{n}^{T} & - \end{array}\right)=\left(\begin{array}{cccc} \mid & \mid & & \mid \\ Z_{1} & Z_{2} & \cdots & Z_{d} \\ \mid & \mid & & \mid \end{array}\right)
Z=D−1⋅μ^T=⎝
⎛x1T−μ^Tx2T−μ^T⋮xnT−μ^T⎠
⎞=⎝
⎛−−−z1Tz2T⋮znT−−−⎠
⎞=⎝
⎛∣Z1∣∣Z2∣⋯∣Zd∣⎠
⎞
样本总方差是 t r ( Σ ^ ) tr(\hat{\boldsymbol{\Sigma}}) tr(Σ^),广义样本方差是 d e t ( Σ ^ ) ≥ 0 det(\hat{\boldsymbol{\Sigma}})\ge0 det(Σ^)≥0
Σ ^ = 1 n ∑ i = 1 n z i z i T \hat{\boldsymbol{\Sigma}}=\frac{1}{n}\sum\limits_{i=1}^n\mathbf{z}_{i}\mathbf{z}_{i}^T Σ^=n1i=1∑nziziT