数据预处理----白化(Whitening)

目录

1.白化的目的

2.白化步骤

 2.1 零中心数据(数据预处理)

                 平均归一化

                标准化或归一化

2.2 数据去相关

         计算协方差矩阵

         计算特征向量

2.3 缩放处理

3.ZCA白化

 4.PCA白化和ZCA白化的区别


1.白化的目的

                通常,数据集的各个attributes是存在一定相关性的,这就意味着存在冗余信息。白化是一种线性变换,用于对源信号进行去相关,是一种重要的预处理过程,其目的就是降低输入数据的冗余性,使得经过白化处理的输入数据具有如下性质:

  • 消除了特征之间的相关性
  • 所有特征的方差都为 1

2.白化步骤

        白化比其他预处理要复杂一些,它涉及以下步骤:

  • 零中心数据
  • 数据解相关
  • 重新缩放数据

 2.1 零中心数据(数据预处理)

                \blacksquare 平均归一化

                        平均归一化指的是对数据进行减去平均值以达到中心化的目的

                 其中X'是标准化数据集,X是原始数据集,X的平均值。均值归一化具有将数据居中于0的效果。

                \blacksquare标准化或归一化

                        标准化用于将所有特征放在相同的比例上。

        其中X'是标准化的数据集,X是原始数据集, \overline{x} 是平均值, 并且σ是标准偏差。                 

2.2 数据去相关

                什么是去相关呢。

                即原始数据存在一定的相关性,我们想要旋转数据以使得数据不在存在相关性        

                那么我们要怎样旋转才能得到不想关的数据呢,实际上,在协方差矩阵中,其特征向量指示的就是数据扩散的最大方向。

                因此,我们可以通过使用特征向量投影来解相关数据。这将具有应用所需旋转并移除尺寸之间的相关性的效果。以下是步骤:

  • 计算协方差矩阵
  • 计算协方差矩阵的特征向量
  • 将特征向量矩阵应用于数据 - 这将应用于旋转
        \blacksquare 计算协方差矩阵

                我们知道协方差公式如下。

                而对弈已经进行好零中心数据处理的数据中,我们已经对数据进行了去均值处理,所以\overline{x},\overline{y}都=0。

                对于我们的样本X矩阵来说,其协方差矩阵就等于X和它的转置之间的点积。

        \blacksquare 计算特征向量

                        A为n阶矩阵,若数λ和n维非0列向量x满足Ax=λx,那么数λ称为A的特征值,x称为A的对应于特征值λ的特征向量。式Ax=λx也可写成( A-λE)x=0,并且|λE-A|叫做A 的特征多项式。当特征多项式等于0的时候,称为A的特征方程,特征方程是一个齐次线性方程组,求解特征值的过程其实就是求解特征方程的解。

                对协方差矩阵 C进行特征值分解,有

显然, U 是由 C 的特征向量作为列向量构成的矩阵, \Lambda是对角矩阵,对角线元素为特征值。 
        \blacksquare 将特征向量矩阵应用于数据 

                对于任一样本,其去相关后的数据y表示为

2.3 缩放处理

                 要求是每个输入特征具有单位方差,以直接使用作为缩放因子来缩放每个特征,,我们通过将每个维度除以其对应的特征值的平方根\sqrt{\lambda _{i}}来缩放我们的去相关数据。

                

                有时一些特征值在数值上接近于0,这样在缩放步骤时我们除以将导致除以一个接近0的值,这可能使数据上溢 (赋为大数值)或造成数值不稳定。因而在实践中,我们使用少量的正则化实现这个缩放过程,即在取平方根和倒数之前给特征值加上一个很小的常数 :

                当x在区间 [-1,1] 上时, 一般取值为

上述结果即为PCA白化的结果 

3.ZCA白化

                ZCA 白化的全称是 Zero-phase Component Analysis Whitening。我对【零相位】的理解就是,相对于原来的空间(坐标系),白化后的数据并没有发生旋转(坐标变换)。

算法过程

                ZCA白化则是在PCA白化基础上,将PCA白化后的数据旋转回到原来的特征空间,这样可以使得变换后的数据更加接近原始输入数据。 ZCA白化的计算公式:

 4.PCA白化和ZCA白化的区别

PCA白化ZCA白化都降低了特征之间相关性较低,同时使得所有特征具有相同的方差。

1.   PCA白化需要保证数据各维度的方差为1,ZCA白化只需保证方差相等。

2.   PCA白化可进行降维也可以去相关性,而ZCA白化主要用于去相关性另外。

3.   ZCA白化相比于PCA白化使得处理后的数据更加的接近原始数据。

                                

参考博客:dimensionality reduction - What is the difference between ZCA whitening and PCA whitening? - Cross Validated (stackexchange.com)白化变换:PCA白化、ZCA白化 - 知乎 (zhihu.com)

去相关与白化(decorrelation and whitening) - 知乎 (zhihu.com)

  • 2
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
MSR Identity Toolbox: A Matlab Toolbox for Speaker Recognition Research Version 1.0 Seyed Omid Sadjadi, Malcolm Slaney, and Larry Heck Microsoft Research, Conversational Systems Research Center (CSRC) [email protected], {mslaney,larry.heck}@microsoft.com This report serves as a user manual for the tools available in the Microsoft Research (MSR) Identity Toolbox. This toolbox contains a collection of Matlab tools and routines that can be used for research and development in speaker recognition. It provides researchers with a test bed for developing new front-end and back-end techniques, allowing replicable evaluation of new advancements. It will also help newcomers in the field by lowering the “barrier to entry”, enabling them to quickly build baseline systems for their experiments. Although the focus of this toolbox is on speaker recognition, it can also be used for other speech related applications such as language, dialect and accent identification. In recent years, the design of robust and effective speaker recognition algorithms has attracted significant research effort from academic and commercial institutions. Speaker recognition has evolved substantially over the past 40 years; from discrete vector quantization (VQ) based systems to adapted Gaussian mixture model (GMM) solutions, and more recently to factor analysis based Eigenvoice (i-vector) frameworks. The Identity Toolbox provides tools that implement both the conventional GMM-UBM and state-of-the-art i-vector based speaker recognition strategies. A speaker recognition system includes two primary components: a front-end and a back-end. The front-end transforms acoustic waveforms into more compact and less redundant representations called acoustic features. Cepstral features are most often used for speaker recognition. It is practical to only retain the high signal-to-noise ratio (SNR) regions of the waveform, therefore there is also a need for a speech activity detector (SAD) in the fr
Pre-whitening is a technique used in signal processing to remove the spectral correlation of a signal, thus making it easier to analyze or model. Here is an example of how to pre-whiten a signal using Python and the NumPy library. First, let's import the necessary libraries: ```python import numpy as np import matplotlib.pyplot as plt from scipy.signal import lfilter, butter ``` Next, let's generate a simple signal consisting of two sinusoids with different frequencies and amplitudes: ```python fs = 1000 # Sampling rate in Hz t = np.arange(0, 1, 1/fs) # Time vector from 0 to 1 second n = len(t) # Number of samples f1 = 50 # First sinusoid frequency in Hz f2 = 200 # Second sinusoid frequency in Hz A1 = 1 # First sinusoid amplitude A2 = 0.5 # Second sinusoid amplitude x = A1*np.sin(2*np.pi*f1*t) + A2*np.sin(2*np.pi*f2*t) # Signal ``` We can plot the signal to visualize it: ```python plt.plot(t, x) plt.xlabel('Time (s)') plt.ylabel('Amplitude') plt.show() ``` ![Signal plot](https://i.imgur.com/lNPF9fn.png) Now we can pre-whiten the signal using a first-order Butterworth high-pass filter with a cutoff frequency of 10 Hz. This will remove the low-frequency components of the signal and leave us with a white noise signal: ```python f_cutoff = 10 # Cutoff frequency in Hz b, a = butter(1, f_cutoff/(fs/2), btype='highpass') # High-pass filter coefficients x_filt = lfilter(b, a, x) # Apply filter to signal ``` We can plot the filtered signal to visualize it: ```python plt.plot(t, x_filt) plt.xlabel('Time (s)') plt.ylabel('Amplitude') plt.show() ``` ![Filtered signal plot](https://i.imgur.com/vhn6UFW.png) As you can see, the pre-whitened signal has a flat spectral density, which means that its power is uniformly distributed across all frequencies. This makes it easier to analyze or model the signal without being biased by its spectral correlation.

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值