The Perceptron
- Forward Propagation
Common Activation Functions
NOTE: All activation functions are non-linear
-
Simoid
-
Hyperbolic Tangent
-
Rectified Linear Unit(ReLU)
Multi Output Perceptron
Because all inputs are densely connected to all outputs, these layers are called Dense layers.(全连接)
-
Tensorflow implements Dense layer
class MyDenseLayer(tf.keras.layers.Layer): def __init__(self, input_dim, output_dim): super(MyDenseLayer, self).__init__() # Initialize weights and bias self.W = self.add_weights([input_dim, output_dim]) self.b = self.add_weights([1, output_dim]) def call(self, inputs): # Forward propagate the inputs z = tf.matmul(inputs, self.W) + self.b # Feed through a non-linear activation output = tf.math.sigmoid(z) return output
对应于keras的实现
import tensorflow as tf layer = tf.keras.layer.Dense(units=2) # Example # as first layer in a sequential model: model = Sequential() model.add(Dense(32, input_shape=(16,))) # now the model will take as input arrays of shape (*, 16) # and output arrays of shape (*, 32) # after the first layer, you don't need to specify # the size of the input anymore: model.add(Dense(32))
Applying Neural Networks
-
Quantifying Loss
The losss of our neural network measures the cost incurred from incorrect predictions
-
Empirical Loss 经验损失
Also known as:
- **Objective function(目标函数) **
- Cost function(代价函数)
- Empirical Risk(经验风险)
The empirical loss measures the total loss over our entire dataset
-
Binary Cross Entropy Loss
Cross Entropy Loss can be used with models that output a probability between 0 and 1
loss = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(y, predicted) )
-
Mean Square Error Loss
MSE loss can be used with regression models that output continuous real numbers
loss = tf.reduce_mean( tf.square(tf.substract(y, predicted)) )
Training Neural Networks
-
Loss Optimization
We want to find the nerwork weights that achieve the lowest loss
- Gradient Descent
- Initialize weights randomly 服从正态分布 N ( 0 , σ 2 ) N(0, \sigma^2) N(0,σ2)
- Loop until convergence:
- Compute gradient, ∂ J ( W ) ∂ W \frac{\partial J(W)}{\partial W} ∂W∂J(W)
- Update weights, W ⇐ W − η ∂ J ( W ) ∂ W W \Leftarrow W - \eta \frac{\partial J(W)}{\partial W} W⇐W−η∂W∂J(W)
- Return weights
import tensorflow as tf
weights = tf.Variable( [tf.random.normal()] )
while True:
with tf.GradientTape() as g:
loss = compute_loss(weights)
gradient = g.gradient(loss, weights)
weights = weights - lr * gradient
- Computing Gradients: Backpropagation
-
Loss Functions Can Be Difficult to Optimize
Setting the Learning Rate
- Small Learning rate converges slowly and gets stuck in false local minima(伪局部最小值)
- Large Learning rate overshoot, become unstable and diverge
- Stable Learning rate converge smoothly and avoid local minima
-
Adaptive Learning Rates
梯度下降的优化方法【https://ruder.io/optimizing-gradient-descent/】
import tensorflow as tf
model = tf.keras.Sequentail([...])
# pick your favourite optimizer
optimizer = tf.keras.optimizer.SGD()
while True: # loop forever
# forever pass through the network
prediction = model(x)
with tf.GradientTape() as tape:
# compute the loss
loss = compute_loss(y, prediction)
# update the weights using the gradient
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
-
Overfitting 过拟合
过拟合的处理方法:使用正则(Regularization)方法
-
Regularization
Technique that constrains our optimization problem to discourage complex models.
一种限制优化问题来阻止复杂化模型的方法。
常用正则化方法:
-
Dropout
During training, randomly set some activations to 0
tf.keras.layers.Dropout(p=0.5)
-
Early Stopping
Stop training before we have a chance to overfit
During training, randomly set some activations to 0
tf.keras.layers.Dropout(p=0.5)
-
Early Stopping
Stop training before we have a chance to overfit
-