# Deep learning--------------Whitening

## Introduction

We have used PCA to reduce the dimension of the data. There is a closely related preprocessing step called whitening (or, in some other literatures, sphering) which is needed for some algorithms. If we are training on images, the raw input is redundant, since adjacent pixel values are highly correlated. The goal of whitening is to make the input less redundant; more formally, our desiderata are that our learning algorithms sees a training input where (i) the features are less correlated with each other, and (ii) the features all have the same variance.

## 2D example

We will first describe whitening using our previous 2D example. We will then describe how this can be combined with smoothing, and finally how to combine this with PCA.

How can we make our input features uncorrelated with each other? We had already done this when computing $\textstyle x_{\rm rot}^{(i)} = U^Tx^{(i)}$. Repeating our previous figure, our plot for $\textstyle x_{\rm rot}$ was:

The covariance matrix of this data is given by:

\begin{align}\begin{bmatrix}7.29 & 0 \\0 & 0.69\end{bmatrix}.\end{align}

(Note: Technically, many of the statements in this section about the "covariance" will be true only if the data has zero mean. In the rest of this section, we will take this assumption as implicit in our statements. However, even if the data's mean isn't exactly zero, the intuitions we're presenting here still hold true, and so this isn't something that you should worry about.)

It is no accident that the diagonal values are $\textstyle \lambda_1$ and $\textstyle \lambda_2$. Further, the off-diagonal entries are zero; thus, $\textstyle x_{{\rm rot},1}$ and $\textstyle x_{{\rm rot},2}$are uncorrelated, satisfying one of our desiderata for whitened data (that the features be less correlated).

To make each of our input features have unit variance, we can simply rescale each feature $\textstyle x_{{\rm rot},i}$ by $\textstyle 1/\sqrt{\lambda_i}$. Concretely, we define our whitened data $\textstyle x_{{\rm PCAwhite}} \in \Re^n$ as follows:

\begin{align}x_{{\rm PCAwhite},i} = \frac{x_{{\rm rot},i} }{\sqrt{\lambda_i}}. \end{align}

Plotting $\textstyle x_{{\rm PCAwhite}}$, we get:

This data now has covariance equal to the identity matrix $\textstyle I$. We say that $\textstyle x_{{\rm PCAwhite}}$ is our PCA whitened version of the data: The different components of $\textstyle x_{{\rm PCAwhite}}$ are uncorrelated and have unit variance.

Whitening combined with dimensionality reduction. If you want to have data that is whitened and which is lower dimensional than the original input, you can also optionally keep only the top $\textstyle k$ components of $\textstyle x_{{\rm PCAwhite}}$. When we combine PCA whitening with regularization (described later), the last few components of $\textstyle x_{{\rm PCAwhite}}$ will be nearly zero anyway, and thus can safely be dropped.

## ZCA Whitening

Finally, it turns out that this way of getting the data to have covariance identity $\textstyle I$ isn't unique. Concretely, if $\textstyle R$ is any orthogonal matrix, so that it satisfies $\textstyle RR^T = R^TR = I$ (less formally, if $\textstyle R$ is a rotation/reflection matrix), then $\textstyle R \,x_{\rm PCAwhite}$ will also have identity covariance. In ZCA whitening, we choose $\textstyle R = U$. We define

\begin{align}x_{\rm ZCAwhite} = U x_{\rm PCAwhite}\end{align}

Plotting $\textstyle x_{\rm ZCAwhite}$, we get:

It can be shown that out of all possible choices for $\textstyle R$, this choice of rotation causes $\textstyle x_{\rm ZCAwhite}$ to be as close as possible to the original input data $\textstyle x$.

When using ZCA whitening (unlike PCA whitening), we usually keep all $\textstyle n$ dimensions of the data, and do not try to reduce its dimension.

## Regularizaton

When implementing PCA whitening or ZCA whitening in practice, sometimes some of the eigenvalues $\textstyle \lambda_i$ will be numerically close to 0, and thus the scaling step where we divide by $\sqrt{\lambda_i}$ would involve dividing by a value close to zero; this may cause the data to blow up (take on large values) or otherwise be numerically unstable. In practice, we therefore implement this scaling step using a small amount of regularization, and add a small constant $\textstyle \epsilon$ to the eigenvalues before taking their square root and inverse:

\begin{align}x_{{\rm PCAwhite},i} = \frac{x_{{\rm rot},i} }{\sqrt{\lambda_i + \epsilon}}.\end{align}

When $\textstyle x$ takes values around $\textstyle [-1,1]$, a value of $\textstyle \epsilon \approx 10^{-5}$ might be typical.

For the case of images, adding $\textstyle \epsilon$ here also has the effect of slightly smoothing (or low-pass filtering) the input image. This also has a desirable effect of removing aliasing artifacts caused by the way pixels are laid out in an image, and can improve the features learned (details are beyond the scope of these notes).

ZCA whitening is a form of pre-processing of the data that maps it from $\textstyle x$ to $\textstyle x_{\rm ZCAwhite}$. It turns out that this is also a rough model of how the biological eye (the retina) processes images. Specifically, as your eye perceives images, most adjacent "pixels" in your eye will perceive very similar values, since adjacent parts of an image tend to be highly correlated in intensity. It is thus wasteful for your eye to have to transmit every pixel separately (via your optic nerve) to your brain. Instead, your retina performs a decorrelation operation (this is done via retinal neurons that compute a function called "on center, off surround/off center, on surround") which is similar to that performed by ZCA. This results in a less redundant representation of the input image, which is then transmitted to your brain.

• 本文已收录于以下专栏：

## What do /deep/ and ::shadow mean in a CSS selector?

http://stackoverflow.com/a/25609679 HTML5 Web Components offer full encapsulation of CSS st...
• SalmonellaVaccine
• 2015年08月19日 12:04
• 613

## 深度学习新书Deep learning with python 2017

• u014036026
• 2017年04月19日 21:13
• 6962

## 理解《Deep Forest: Towards An Alternative to Deep Neural Network》

Deep Forest 结构 1. Deep Forest 是由多层瀑布组成的，每一层瀑布的输入是前一层瀑布输出的特征信息，且该层瀑布的输出作为下一层瀑布的输入特征的一部分 2. 瀑布是由多个森林...
• Virtual_Func
• 2017年03月05日 14:45
• 2710

## deep learning framework（不同的深度学习框架）

• tina_ttl
• 2016年04月01日 10:33
• 1392

## Bengio大神的《Deep Learning》全书已完稿可获取全书电子版

【推荐】Bengio大神的《Deep Learning》全书已完稿可获取全书电子版 2016-04-07 机器学习研究会 点击上方“机器学习研究会”可以订阅哦！摘要  视觉机器人 经过两年半的努...
• j_study
• 2016年04月09日 12:04
• 15892

## [机器学习入门] 李宏毅机器学习笔记-12 （Why Deep Learning? ; 为什么是深度学习？）

[机器学习入门] 李宏毅机器学习笔记-12 （Why Deep Learning? ; 深度学习解析） PDF VIDEODeeper is Better?Fat + Short v.s. Th...
• soulmeetliang
• 2017年06月13日 22:49
• 860

## 《Wide & Deep Learning for Recommender Systems 》笔记

• Dinosoft
• 2016年09月19日 02:06
• 10672

## 《Wide and Deep Learning for Recommender Systems》学习笔记

• a819825294
• 2017年05月02日 14:26
• 5236

## 深度学习研究理解10：Very Deep Convolutional Networks for Large-Scale Image Recognition

• whiteinblue
• 2015年02月06日 13:37
• 14522

## Deep Learning简介

1. 简介     深度学习通过组合低层特征形成更加抽象的高层表示属性类别或特征，以发现数据的分布式特征表示。     深度学习是机器学习研究中的一个新的领域，其动机在于建立、模拟人脑进行分析学习的神...
• MyArrow
• 2016年04月18日 08:30
• 1340

举报原因： 您举报文章：Deep learning--------------Whitening 色情 政治 抄袭 广告 招聘 骂人 其他 (最多只允许输入30个字)