机器学习中的神经网络Neural Networks for Machine Learning：Lecture 12 Quiz

本文链接：https://blog.csdn.net/GarfieldEr007/article/details/50598308

Warning: The hard deadline has passed. You can attempt it, but you will not get credit for it. You are welcome to try it as a learning exercise.

In accordance with the Coursera Honor Code, I certify that the answers here are my own work.

Question 1

A Boltzmann Machine is different from a Feed Forward Neural Network in the sense that :

The state of a hidden unit in a Boltzmann Machine is a deterministic function of the inputs and is hard to compute exactly, but in a Neural Net it is easy to compute just by doing a forward pass.

The state of a hidden unit in a Boltzmann Machine is a random variable, but in a Neural Net it is a deterministic function of the inputs.

Boltzmann Machines do not have hidden units but Neural Nets do.

A Boltzmann Machine defines a probability distribution over the data, but a Neural Net defines a deterministic transformation of the data.

Question 2

Throughout the lecture, when talking about Boltzmann Machines, why do we talk in terms of computing the expected value of

sisj and not the value of

sisj ?

It is possible to compute the exact value but it is computationally inefficient.

The expectation only refers to an average over all training cases.

It does not make sense to talk in terms of a unique value of

sisj because

si and

sj are random variables and the Boltzmann Machine defines a probability distribution over them.

It is not possible to compute the exact value no matter how much computation time is provided. So all we can do is compute an approximation.

Question 3

When learning an RBM, we decrease the energy of data particles and increase the energy of fantasy particles. Brian insists that the latter is not needed. He claims that it is should be sufficient to just decrease the energy of data particles and the energy of all other regions of state space would have increased relatively. This would also save us the trouble of sampling from the model distribution. What is wrong with this intuition ?

The model could decrease the energy of data particles in ways such that the energy of negative particles also gets decreased. If this happens there will be no net learning and energy of all particles will keep going down without bounds.

There is nothing wrong with the intuition. This method is an alternative way of learning a Boltzmann Machine.

Since total energy is constant, some particles must loose energy for others to gain energy.

The sum of all updates must be zero so we need to increase the energy of negative particles to balance things out.

Question 4

Restricted Boltzmann Machines are easier to learn than Boltzmann Machines with arbitrary connectivity. Which of the following is a contributing factor ?

For RBMs,

⟨sisj⟩data can be computed exactly, whereas this is not the case for general BMs

RBMs are more powerful models, i.e., they can model more probability distributions than general BMs.

The energy of any configuration of an RBM is a linear function of its states. This is not true for a general BM.

It is possible to run a persistent Markov chain in RBMs but not in general BMs.

Question 5

PCD a better algorithm than CD1 when it comes to training a good generative model of the data. This means that samples drawn from a freely running Boltzmann Machine which was trained with PCD (after enough time) are likely to look more realistic than those drawn from the same model trained with CD1. Why does this happen ?

In PCD, the persistent Markov chain can remember the state of the positive particles across mini-batches and show them when sampling. However, CD1 resets the Markov chain in each update so it cannot retain information about the data for a long time.

In PCD, many Markov chains are used throughout learning, whereas CD1 uses only one. Therefore, samples from PCD are an average of samples from several models. Since model averaging helps, PCD generates better samples.

In PCD, the persistent Markov chain explores different regions of the state space. However, CD1 lets the Markov chain run for only one step. So CD1 cannot explore the space of possibilities much and can miss out on increasing the energy of some states which ought to be improbable. These states might be reached when running the machine for a long time leading to unrealistic samples.

In PCD, only a single Markov chain is used throughout learning, whereas CD1 starts a new one in each update. Therefore, PCD is a more consistent algorithm.