deep learning 学习笔记

link: http://ufldl.stanford.edu/tutorial/

1. linear regression

use MLE to understand the loss function

2. logistic regression— binary classification

use MLE to understand the loss function

3. softmax regression —multiple classification

4. Neural Network

activation function: sigmod(0,1), tanh[-1,1], rectified linear(0,+inf)

forward propagation
用input feature x 通过activation function 来计算最后输出的prediction结果的过程

backpropagation algorithm
整个NN也是算一个loss function 再用batch gradient descent 来计算W和b就行。只是在求 W(l)ij b(l)j 的partial derivative的时候,要用这个backpropagation algorithm来计算。总体思想就是求出先求出最后prediction与true value的difference,然后再向后计算出每一layer的对这个difference的contribution,利用这个contribution可以求出每个partial derivative. 这样求partial derivative会更快。

这里partial derivative是每个training sample的partial derivative的和,就是说没求一次partial derivative 就要scan所有的training sample

Supervised CNN

feature extraction by convolution
当input 的sample 的维度extremely large, we can firstly apply convolution降维, 然后利用pooling降维

SGD
compared with the batch GD. just use a single training example or a small amount of examples called “minibatch”, ususally 256.

notice the the term “minibatch”, “epoch”(iteration over the whole data set), “shuffle”

One final but important point regarding SGD is the order in which we present the data to the algorithm. If the data is given in some meaningful order, this can bias the gradient and lead to poor convergence. Generally a good method to avoid this is to randomly shuffle the data prior to each epoch of training.

Momentum
If the objective has the form of a long shallow ravine leading to the optimum and steep walls on the sides, standard SGD will tend to oscillate across the narrow ravine since the negative gradient will point down one of the steep sides rather than along the ravine towards the optimum. The objectives of deep architectures have this form near local optima and thus standard SGD can lead to very slow convergence particularly after the initial steep gains. Momentum is one method for pushing the objective more quickly along the shallow ravine.

这里就是说如果objective function是一个很陡的谷底,那么每次update都很容易从谷的一边跑到另外一边,即在谷内震荡下行。

Note

filter / kernel: for example, it is the 8*8 patch for convolution

after convolution, we get feature map.

CNN consists of three parts.
normal fully connected NN layers in CNN.
subsampling layer is the pooling layer in CNN.
convolutional layer in CNN

Comparison

对比普通NN, CNN就是可以利用convolution and pooling 处理高维的输入数据

Sparse coding && PCA

ICA && RICA

unsupervised learning

autoencoder

feature extraction for unsupervised learning when we don’t have the trained labels.

Statistical Language Modeling (SLM)

ref: http://homepages.inf.ed.ac.uk/lzhang10/slm.html
definition
p(w1,...,wT) the probability of a word sequence
probabilistic chain rule
p(w1,...,wT)=p(w1)Ti=2p(wi|w1,...,wi1)=p(w1)Ti=2p(wi|hi) where hi denotes the history of the ith word wi

RNN

ref: http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
http://karpathy.github.io/2015/05/21/rnn-effectiveness/

definition
The idea behind RNNs is to make use of sequential information. In a traditional neural network we assume that all inputs (and outputs) are independent of each other. But for many tasks that’s a very bad idea.

i.e. RNN considers the dependency of the training samples. In other words, each training sample has somewhat dependency or sequential relationship

note
the ref (http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language-model-rnn-with-python-numpy-and-theano/) wildml has provided a specific implementation example.
several issues should be noted:

1.在generating text的code中

next_word_probs = model.forward_propagation(new_sentence)

这里其实是用的theano中的forward_propagation,不是作者自己写的forward_propagation.

while sampled_word == word_to_index[unknown_token]:
            samples = np.random.multinomial(1, next_word_probs[-1])
            sampled_word = np.argmax(samples)

所以这里next_word_probs[-1] 其实是表示的o[-1],对应input x的最后一个word,然后samples就是一个one-hot-vector,再用np.argmax取得index。

LSTM

a type of RNN can capture a long dependency

只是为了克服RNN cannot capture the long dependency的问题。

只是计算 st 的方法不一样

loss function

least mean square
cross-entropy loss, i.e. log loss function or logistic loss

Backpropagation Alg.

http://cs231n.github.io/optimization-1/
http://colah.github.io/posts/2015-08-Backprop/
http://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/

内含推导BP公式过程
https://theclevermachine.wordpress.com/2014/09/06/derivation-error-backpropagation-gradient-descent-for-neural-networks/

注意 δli 是 total error 对 zli 的偏导数

computational graph
forward-mode differentiation,一次只能计算output对其中一个input的偏导数,即如果在computational graph 中input很多,那么这种方法计算偏导数就会很慢。求从其中一条input到output所有的path的

AllPathNodeb

reverse-mode differentiation 更快,可以一次性计算 output对所有的node的偏导数,就是backpropagation. ZAllNode

深度学习笔记v5是一本关于深度学习学习资料,下面我将用300字来回答有关这本笔记的内容。 深度学习是机器学习领域的一个重要分支,它主要研究模拟人脑神经网络的算法和模型,用于实现复杂的学习任务。深度学习在图像识别、语音识别和自然语言处理等领域取得了很多重要的突破,成为了人工智能领域的热点研究方向。 深度学习笔记v5中,首先介绍了深度学习的基本概念和原理。笔记详细解释了神经网络结构、前向传播、反向传播以及梯度下降等基本概念和算法。这些内容帮助读者理解深度学习的基本原理和工作机制。 接着,笔记介绍了常用的深度学习框架,如TensorFlow和PyTorch。这些框架提供了丰富的工具和函数,使得深度学习的开发变得更加简单和高效。笔记详细介绍了如何使用这些框架进行模型训练和评估。 此外,笔记还包含了一些深度学习的经典应用案例。这些案例涵盖了图像分类、目标检测、语音识别等多个领域。通过这些案例,读者可以深入了解深度学习在实际问题中的应用,并学习如何利用深度学习解决现实世界中的复杂任务。 最后,笔记还提供了大量的代码示例和练习题,帮助读者巩固所学的知识。通过实践,读者可以更好地理解深度学习的原理和应用。 总而言之,深度学习笔记v5是一本系统而全面的学习资料,适合对深度学习感兴趣的读者。通过阅读这本笔记,读者可以了解深度学习的基本概念和原理,掌握常用的深度学习框架,以及应用深度学习解决实际问题的方法。希望这本笔记能够对读者在深度学习领域的学习和研究有所帮助。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值