Exploratory Data Analysis (EDA) -- 非参数估计

最新推荐文章于 2024-02-25 13:44:48 发布

eveiiii

最新推荐文章于 2024-02-25 13:44:48 发布

阅读量581

点赞数

分类专栏：量化分析文章标签：算法机器学习数据挖掘

本文链接：https://blog.csdn.net/qq_37039935/article/details/108431997

版权

量化分析专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Exploratory Data Analysis

1. Histograms 直方图
2. Non-parametric Density Estimation
3. Kernel density estimation (KDE) 核密度估计
- 3.1 Parzen Windows
- 3.2 Smooth Kernel
4. Multivariate Density Estimation 多元密度估计
5. Transformation Kernel Density Estimation (TKDE)

1. Histograms 直方图

Divide the sample space into a number of bins (组距) and approximate the density at the center of each bin by the fraction of points in the training data that fall into the corresponding bin.即bar的高度记录了每个bin内样本的数量。

The histogram requires two “parameters” to be defined: bin width and starting position of the first bin.

The outliers are difficult to see in the histogram, except that they have caused the x-axis to expand. Because when the sample size is in the thousands, a cell(一个区域) with a small frequency is essentially invisible.(bin 不能很大或很小)

在这里插入图片描述
在图片里面4明显是最合理的bin取值。

Drawbacks of Histogram：

The density estimate depends on the starting position of the bins.
The discontinuities of the estimate are not due to the underlying
density; they are only an artifact of the chosen bin locations.（不连续是由于人为选择bin位置的原因）
The number of bins grows exponentially with the number of dimensions.（高维下各点的距离很远, 维数灾难，很多的bin里面是空的）

2. Non-parametric Density Estimation

The probability that a vector 𝑥, drawn from a distribution 𝑝(𝑥), will fall in a given region ℜ of the sample space is : $P=\int_ℜp(x)dx$ (ℜ区域内PDF与坐标轴围成的面积).
Suppose now that 𝑁 vectors {𝑥(1), 𝑥(2), … 𝑥(𝑁)} are drawn from the distribution; the probability that 𝑘 of these 𝑁 vectors fall in ℜ is given by the binomial distribution:
$p(k)=C_N^k p(x)^k (1-p(x))^{N-k}$
So
$E(k/N)=NP/N=P \\ D(k/N)=NP(1-P)/N^2 =P(1-P)/N$
When N→∞, the distribution becomes sharper (the variance gets smaller), so we can expect that a good estimate of the probability 𝑃 can be obtained from the mean fraction of the points that fall within ℜ.
$P ≅ k / N$
On the other hand, if we assume that ℜ is so small that 𝑝(𝑥) does not vary appreciably within it, then：
$P=\int_ℜp(x)dx=p(x)·V$
where 𝑉 is the volume enclosed by region ℜ. (V是ℜ区域的密闭量，这里可以理解为底边长度，高是p(x)，用微积分的定义求面积)
$P≅k/N\\ P=p(x)·V\\ so \ \ p(x)=K/NV$
This estimate becomes more accurate as we increase the number of sample points 𝑁 and shrink the volume 𝑉. ( N↑,V↓) 增大样本数量，减小R的区域，本质的思想就是微积分。

BUT:

(1) In practice the total number of examples is fixed!!(样本数量固定)
(2) To improve the accuracy of the estimate 𝑝(𝑥) we could let 𝑉 approach zero but then ℜ would become so small that it would enclose no examples. (N固定，V很小的时候没有样本会落入该区域).

Solution:
we will have to find a compromise for 𝑉
(1) Large enough to include enough examples within ℜ.
(2) Small enough to support the assumption that 𝑝(𝑥) is constant within ℜ.
-We can fix 𝑉 and determine 𝑘 from the data. This leads to kernel density estimation (KDE) approach.
-We can fix 𝑘 and determine 𝑉 from the data. This gives rise to the k-nearest-neighbor (KNN) approach.

3. Kernel density estimation (KDE) 核密度估计

3.1 Parzen Windows

Assume that the region ℜ that encloses the 𝑘 examples is a hypercube with sides of length ℎ centered at 𝑥.
在这里插入图片描述

Then its volume is given by $V=h^D$ , where 𝐷 is the number of dimensions.
To find the number of examples that fall within this region we define a kernel function 𝐾(𝑢): (均匀核函数/Parzen window) 每一个维度都在|1/2|域内之内.
在这里插入图片描述

在这里插入图片描述
The quantity 𝐾((𝑥 − 𝑥(𝑛)/ℎ) is then equal to unity if 𝑥(𝑛) is inside a hypercube of side ℎ centered on 𝑥, and zero otherwise.(以x为中心，h为窗口长度，若某点x(n)落在该区域，则计数为1).

The total number of points inside the hypercube is :
$k(x)=\sum_{n=1}^{N}K\left(\frac{x-x\left(n\right)}{h}\right)$
The p(x):
$p_{KDE}(x)=\frac{1}{Nh^D}\sum_{n=1}^{N}K\left(\frac{x-x\left(n\right)}{h}\right)$
Expectation of the estimate (assumed that vectors 𝑥(𝑛) are drawn independently from the true density 𝑝(𝑥)):
${E[p}_{KDE}(x)]=\frac{1}{Nh^D}E\left[\sum_{n=1}^{N}K\left(\frac{x-x\left(n\right)}{h}\right)\right]=\frac{1}{Nh^D}\sum_{n=1}^{N}E\left[K\left(\frac{x-x\left(n\right)}{h}\right)\right]=\frac{1}{h^D}\int{\left[K\left(\frac{x-x\left(n\right)}{h}\right)\right]p(x)dx}$
Result:
The kernel width ℎ plays the role of a smoothing parameter: the wider ℎ is, the smoother the estimate $p_{KDE}(x)$
For ℎ → 0, the kernel approaches a Dirac delta function(狄拉克δ函数:该函数在除了零以外的点取值都等于零，而其在整个定义域上的积分等于1) and $p_{KDE}$ approaches the true density.

However, in practice we have a finite number of points, so ℎ cannot be made arbitrarily small, since the density estimate 𝑝𝐾𝐷𝐸 (x) would then degenerate to a set of impulses located at the training data points.

Drawback:
It yields density estimates that have discontinuities.
It weights equally all points 𝑥𝑖, regardless of their distance to the estimation point 𝑥. (假设在窗口内的点为均匀分布，无论距离的远近，这个方法赋予的权重都相同)

3.2 Smooth Kernel

kernel相当于对区域内的点赋予与中心点距离的权重
unimodal PDF, such as the Gaussian $K\left(x\right)=\left(2\pi\right)^{-D/2}e^{-\frac{1}{2}x^\prime x}$
在这里插入图片描述
Just as the Parzen window estimate can be seen as a sum of boxes centered at the data, the smooth kernel estimate is a sum of “bumps”.(多个核函数对应相加).The parameter ℎ, also called the smoothing parameter or bandwidth, determines their width.
Attention：（choose h is important）
（1）A large ℎ will over-smooth the DE and mask the structure of the data.
在这里插入图片描述
（2）A small ℎ will yield a DE that is spiky and very hard to interpret.

Bandwidth or h Selection:
A natural measure is the MSE at the estimation point 𝑥, defined by：
$E\left(\left[p_{KDE}\left(x\right)-p\left(x\right)\right]^2\right)=E^2\left(p_{KDE}\left(x\right)-p\left(x\right)\right)+Var(p_{KDE}\left(x\right))$
Trade-off: = bias(↑ when h↑)+ variance(↓ when h↑)
We would like to find a value of ℎ that minimizes the error between the estimated density and the true density

这里介绍一个概念：
Mean integral square error (MISE)积分均方误差：可以看成是在每点x处对局部均方根误差的累积。利用MSE我们只能得到某一点的预测值和真实值的均方误差，可是我们评价的是整个模型，我们需要的是一个能评估整个模型的指标，MISE做的就是这件事，它是MSE这个函数的积分的期望，所以MISE是一个数值。
$MISE=E\left[\int\left[p_{KDE}\left(x\right)-p\left(x\right)\right]^2dx\right]$
So the target is to estimate h to minimize MISE.

how to Choose h?

in first case, we assume Gaussian distribution to choose h.
If we assume that the true distribution is Gaussian(以下的假设建立在正态分布的基础上) and we use a Gaussian kernel, it can be shown that the optimal value of ℎ is:
$h^\ast\ =\ 1.06\sigma N^{-1/5}$
where 𝜎 is the sample standard deviation and 𝑁 is the number of training examples.
Better results can be obtained by:
(1) Using a robust measure of the spread instead of the sample variance, and.
(2) Reducing the coefficient 1.06 to better cope with multimodal densities.
(3) The optimal bandwidth then becomes:

IQR is the interquartile range, a robust estimate of the spread.
IQR is the difference between the 75th percentile (𝑄3) and the 25th percentile(𝑄1): 𝐼𝑄𝑅 = 𝑄3 − 𝑄1.
Choose h (Maximum Likelihood Cross-validation):
The ML estimate of ℎ is degenerate since it yields $h_{ML}= 0$ ,(根本是极大化似然函数，为了使似然函数极大（1）因为h在似然函数分母，h越小，结果越大a density estimate with Dirac delta functions at each training data point.（2）尽可能缩小h，使得区域内只包含当前data point的点（至少1个）
A practical alternative is to maximize the “pseudo-likelihood” computed using leave-one-out cross-validation. 伪极大似然估计：假设noise为正态分布

4. Multivariate Density Estimation 多元密度估计

之前假设超立方边长为h, the bandwidth ℎ is the same for all the axes, so this density estimate will be weight all the axis equally
If one or several of the features has larger spread than the others, we should use a vector of smoothing parameters or even a full covariance matrix, which complicates the procedure. (改变不同特征对应的h)
Solution（解决不同特征对结果的影响程度不同）:

Pre-scaling each axis (normalize to unit variance, for instance) 标准化
Pre-whitening the data (linearly transform so Σ(协方差矩阵)= 𝐼), estimate the density, and then transform back. 预白化数据：（i）特征之间相关性较低；（ii）所有特征具有相同的方差 eigenvalue特征值 eigenvector 特征向量

product kernel: A good alternative for multivariate KDE.

The product kernel consists of the product of one-dimensional kernels. 每个维度不同的只有他们的h不同,其他相同
Note that although 𝐾 (𝑥, 𝑥 (𝑛), ℎ1, … ℎ𝐷) uses kernel independence, this does not imply we assume the features are independent.

注意与之前的式子中求和与乘积运算的顺序相反。

5. Transformation Kernel Density Estimation (TKDE)

Reason: KDE undersmooths densities with long tails. So TKDE is to transform the data so that the density of the transformed data is easier to estimate by the KDE. And then use change-of-variables formula to transform back to target.
Data transformation:

Many statistical methods work best when the data are
(1) normally distributed
(2) at least symmetrically distributed and have a constant variance, and the transformed data will often exhibit less skewness and a more constant variable compared to the original variables.

Transformation function g (positive, right-skewed variables–a concave transformation):

normally distributed data have light tails and are suitable for estimation with the KDE.
it is easy to transform data to normality if one knows the CDF, i.e. F.
the CDF can be estimated by assuming a parametric model (i.e. t-distribution) and using MLE.

such as:
(1) Log transformation
(2) Square transformation
(3) Box-Cox transformation: make the transformation continuous in a at 0:
在这里插入图片描述

TKDE 算法:

在这里插入图片描述

eveiiii

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Exploratory Data Analysis (EDA) -- 非参数估计

Exploratory Data Analysis1. Histograms 直方图2. Non-parametric Density Estimation3. Kernel density estimation (KDE) 核密度估计3.1 Parzen Windows3.2 Smooth Kernel4. Multivariate Density Estimation 多元密度估计5. Transformation Kernel Density Estimation (TKDE)1. Histogra
复制链接

扫一扫