Sparse Autoencoder4-Autoencoders and Sparsity

本文介绍了自编码器在无监督学习中的应用,特别是在压缩输入特征和发现数据结构方面的作用。通过限制隐藏层的神经元数量和引入稀疏性约束,自编码器能够学习到数据的低维表示,类似于PCA。文中详细解释了如何通过增加额外的惩罚项来实现稀疏性约束,并提供了实现该算法所需的数学推导。

原文:http://deeplearning.stanford.edu/wiki/index.php/Autoencoders_and_Sparsity

So far, we have described the application of neural networks to supervised learning, in which we have labeled training examples. Now suppose we have only a set of unlabeled training examples \textstyle \{x^{(1)}, x^{(2)}, x^{(3)}, \ldots\}, where \textstyle x^{(i)} \in \Re^{n}. An autoencoderneural network is an unsupervised learning algorithm that applies backpropagation, setting the target values to be equal to the inputs. I.e., it uses \textstyle y^{(i)} = x^{(i)}.

Here is an autoencoder:

Autoencoder636.png

The autoencoder tries to learn a function \textstyle h_{W,b}(x) \approx x. In other words, it is trying to learn an approximation to the identity function, so as to output \textstyle \hat{x} that is similar to \textstyle x. The identity function seems a particularly trivial function to be trying to learn; but by placing constraints on the network, such as by limiting the number of hidden units, we can discover interesting structure about the data. As a concrete example, suppose the inputs \textstyle x are the pixel intensity values from a \textstyle 10 \times 10 image (100 pixels) so \textstyle n=100, and there are \textstyle s_2=50 hidden units in layer \textstyle L_2. Note that we also have \textstyle y \in \Re^{100}. Since there are only 50 hidden units, the network is forced to learn a compressed representation of the input. I.e., given only the vector of hidden unit activations \textstyle a^{(2)} \in \Re^{50}, it must try to reconstruct the 100-pixel input \textstyle x. If the input were completely random---say, each \textstyle x_i comes from an IID Gaussian independent of the other features---then this compression task would be very difficult. But if there is structure in the data, for example, if some of the input features are correlated, then this algorithm will be able to discover some of those correlations. In fact, this simple autoencoder often ends up learning a low-dimensional representation very similar to PCAs.

Our argument above relied on the number of hidden units \textstyle s_2 being small. But even when the number of hidden units is large (perhaps even greater than the number of input pixels), we can still discover interesting structure, by imposing other constraints on the network. In particular, if we impose a sparsity constraint on the hidden units, then the autoencoder will still discover interesting structure in the data, even if the number of hidden units is large.

Informally, we will think of a neuron as being "active" (or as "firing") if its output value is close to 1, or as being "inactive" if its output value is close to 0. We would like to constrain the neurons to be inactive most of the time. This discussion assumes a sigmoid activation function. If you are using a tanh activation function, then we think of a neuron as being inactive when it outputs values close to -1.

Recall that \textstyle a^{(2)}_j denotes the activation of hidden unit \textstyle j in the autoencoder. However, this notation doesn't make explicit what was the input \textstyle x that led to that activation. Thus, we will write \textstyle a^{(2)}_j(x) to denote the activation of this hidden unit when the network is given a specific input \textstyle x. Further, let

\begin{align}\hat\rho_j = \frac{1}{m} \sum_{i=1}^m \left[ a^{(2)}_j(x^{(i)}) \right]\end{align}

be the average activation of hidden unit \textstyle j (averaged over the training set). We would like to (approximately) enforce the constraint

\begin{align}\hat\rho_j = \rho,\end{align}

where \textstyle \rho is a sparsity parameter, typically a small value close to zero (say \textstyle \rho = 0.05). In other words, we would like the average activation of each hidden neuron \textstyle j to be close to 0.05 (say). To satisfy this constraint, the hidden unit's activations must mostly be near 0.


To achieve this, we will add an extra penalty term to our optimization objective that penalizes \textstyle \hat\rho_j deviating significantly from \textstyle \rho. Many choices of the penalty term will give reasonable results. We will choose the following:

\begin{align}\sum_{j=1}^{s_2} \rho \log \frac{\rho}{\hat\rho_j} + (1-\rho) \log \frac{1-\rho}{1-\hat\rho_j}.\end{align}

Here, \textstyle s_2 is the number of neurons in the hidden layer, and the index \textstyle j is summing over the hidden units in our network. If you are familiar with the concept of KL divergence, this penalty term is based on it, and can also be written

\begin{align}\sum_{j=1}^{s_2} {\rm KL}(\rho || \hat\rho_j),\end{align}

where \textstyle {\rm KL}(\rho || \hat\rho_j) = \rho \log \frac{\rho}{\hat\rho_j} + (1-\rho) \log \frac{1-\rho}{1-\hat\rho_j} is the Kullback-Leibler (KL) divergence between a Bernoulli random variable with mean \textstyle \rho and a Bernoulli random variable with mean \textstyle \hat\rho_j. KL-divergence is a standard function for measuring how different two different distributions are. (If you've not seen KL-divergence before, don't worry about it; everything you need to know about it is contained in these notes.)

This penalty function has the property that \textstyle {\rm KL}(\rho || \hat\rho_j) = 0 if \textstyle \hat\rho_j = \rho, and otherwise it increases monotonically as \textstyle \hat\rho_j diverges from \textstyle \rho. For example, in the figure below, we have set \textstyle \rho = 0.2, and plotted \textstyle {\rm KL}(\rho || \hat\rho_j) for a range of values of \textstyle \hat\rho_j:

KLPenaltyExample.png

We see that the KL-divergence reaches its minimum of 0 at \textstyle \hat\rho_j = \rho, and blows up (it actually approaches \textstyle \infty) as \textstyle \hat\rho_j approaches 0 or 1. Thus, minimizing this penalty term has the effect of causing \textstyle \hat\rho_j to be close to \textstyle \rho.

Our overall cost function is now

\begin{align}J_{\rm sparse}(W,b) = J(W,b) + \beta \sum_{j=1}^{s_2} {\rm KL}(\rho || \hat\rho_j),\end{align}

where \textstyle J(W,b) is as defined previously, and \textstyle \beta controls the weight of the sparsity penalty term. The term \textstyle \hat\rho_j (implicitly) depends on \textstyle W,b also, because it is the average activation of hidden unit \textstyle j, and the activation of a hidden unit depends on the parameters \textstyle W,b.

To incorporate the KL-divergence term into your derivative calculation, there is a simple-to-implement trick involving only a small change to your code. Specifically, where previously for the second layer (\textstyle l=2), during backpropagation you would have computed

\begin{align}\delta^{(2)}_i = \left( \sum_{j=1}^{s_{2}} W^{(2)}_{ji} \delta^{(3)}_j \right) f'(z^{(2)}_i),\end{align}

now instead compute

\begin{align}\delta^{(2)}_i =  \left( \left( \sum_{j=1}^{s_{2}} W^{(2)}_{ji} \delta^{(3)}_j \right)+ \beta \left( - \frac{\rho}{\hat\rho_i} + \frac{1-\rho}{1-\hat\rho_i} \right) \right) f'(z^{(2)}_i) .\end{align}

One subtlety is that you'll need to know \textstyle \hat\rho_i to compute this term. Thus, you'll need to compute a forward pass on all the training examples first to compute the average activations on the training set, before computing backpropagation on any example. If your training set is small enough to fit comfortably in computer memory (this will be the case for the programming assignment), you can compute forward passes on all your examples and keep the resulting activations in memory and compute the \textstyle \hat\rho_is. Then you can use your precomputed activations to perform backpropagation on all your examples. If your data is too large to fit in memory, you may have to scan through your examples computing a forward pass on each to accumulate (sum up) the activations and compute \textstyle \hat\rho_i (discarding the result of each forward pass after you have taken its activations \textstyle a^{(2)}_i into account for computing \textstyle \hat\rho_i). Then after having computed \textstyle \hat\rho_i, you'd have to redo the forward pass for each example so that you can do backpropagation on that example. In this latter case, you would end up computing a forward pass twice on each example in your training set, making it computationally less efficient.


The full derivation showing that the algorithm above results in gradient descent is beyond the scope of these notes. But if you implement the autoencoder using backpropagation modified this way, you will be performing gradient descent exactly on the objective \textstyle J_{\rm sparse}(W,b). Using the derivative checking method, you will be able to verify this for yourself as well.


内容概要:本文围绕“基于阶梯碳交易的含P基于阶梯碳交易的含 P2G-CCS 耦合和燃气掺氢的虚拟电厂优化调度(Matlab代码实现)2G-CCS耦合和燃气掺氢的虚拟电厂优化调度”展开研究,提出了一种结合电力转气体(P2G)、碳捕集与封存(CCS)技术以及燃气掺氢手段的虚拟电厂优化调度模型,旨在降低碳排放并提升能源利用效率。该模型采用Matlab进行代码实现,充分考虑阶梯碳交易机制对调度成本的影响,将碳排放成本内部化,激励低碳运行。通过引入P2G-CCS耦合系统实现二氧化碳的回收与资源化利用,同时利用燃气掺氢提高天然气系统的灵活性和清洁能源占比,从而构建了一个多能互补、低碳高效的虚拟电厂调度框架。研究重点在于优化系统运行成本、碳交易费用与环境效益之间的平衡,为新型电力系统下的碳减排提供了可行的技术路径与决策支持。; 适合人群:具备一定电力系统、能源工程或优化算法背景的研究生、科研人员及从事新能源、智慧能源系统开发的工程师。; 使用场景及目标:①应用于综合能源系统、虚拟电厂、碳交易机制等相关课题的研究与仿真;②为含氢能、碳捕集等新兴技术的电力系统低碳调度提供模型参考与代码实现范例;③支持学术论文复现、毕业设计或科研项目开发。; 阅读建议:建议读者结合Matlab代码与文档内容同步学习,重点关注目标函数构建、约束条件设置及阶梯碳交易机制的数学表达,同时可扩展研究不同碳价情景对调度结果的影响,以深化对模型性能与应用潜力的理解。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值