Chapter 10 Introduction to Artificial Neural Networks

Chapter 10 Introduction to Artificial Neural Networks

OReilly. Hands-On Machine Learning with Scikit-Learn and TensorFlow读书笔记

10.1 From Biological to Artificial Neurons

A huge quantity of data available, an increase of computing power, improved training algorithms, less occurrence of theoretical limitations of ANNs in practice, and a virtuous circle of funding and progress, boost the revival of ANNs.

10.1.1 Biological Neurons

10.1.2 Logical Computations with Neurons

artificial neuron: it has one or more binary (on/off) inputs and one binary output. The artificial neuron simply activates its output when more than a certain number (in the case of Figure 10-3 is ≥ 2 \ge 2 2) of its inputs are active.

10.1.3 The Perceptron

The Perceptron is one of the simplest ANN architectures, invented in 1957 by Frank Rosenblatt. It is based on a slightly different artificial neuron called a linear threshold unit (LTU): the inputs and output are now numbers (instead of binary on/off values) and each input connection is associated with a weight. The LTU computes a weighted sum of its inputs ( z = w 1 x 1 + w 2 x 2 + ⋯ + w n x n = w T ⋅ x z = w_1 x_1 + w_2 x_2 +\cdots+ w_n x_n = \textbf w^T\cdot \textbf x z=w1x1+w2x2++wnxn=wTx), then applies a step function to that sum and outputs the result: h w ( x ) = step ( z ) = step ( w T ⋅ x ) h_w(\textbf x) = \textrm{step} (z) = \textrm{step} (\textbf w^T \cdot \textbf x) hw(x)=step(z)=step(wTx).

Equation 10-1. Common step functions used in Perceptrons
heaviside ( z ) = { 0  if  z &lt; 0 1  if  z ≥ 0  sgn ( z ) = { − 1  if  z &lt; 0 0  if  z = 0 1  if  z &gt; 0 \begin{array}{ll}\textrm{heaviside}(z)=\left\{\begin{array}{ll}0 &amp;\textrm{ if }z&lt;0\\1 &amp;\textrm{ if } z\ge 0\\ \end{array}\right.&amp;&amp; \textrm{ sgn}(z)=\left\{\begin{array}{ll}-1 &amp;\textrm{ if }z&lt;0\\0 &amp;\textrm{ if } z= 0\\1&amp;\textrm{ if } z&gt; 0\\ \end{array} \right.\end{array} heaviside(z)={01 if z<0 if z0 sgn(z)=101 if z<0 if z=0 if z>0
A single LTU can be used for simple linear binary classification. It computes a linear combination of the inputs and if the result exceeds a threshold, it outputs the positive class or else outputs the negative class (just like a Logistic Regression classifier or a linear SVM).

A Perceptron is simply composed of a single layer of LTUs, with each neuron connected to all the inputs. These connections are often represented using special pass‐through neurons called input neurons: they just output whatever input they are fed. Moreover, an extra bias feature is generally added ( x 0 = 1 x_0 = 1 x0=1). This bias feature is typically represented using a special type of neuron called a bias neuron, which just outputs 1 all the time.

Equation 10-2. Perceptron learning rule (weight update)
w i , j next step = w i , j + η ( y ^ j − y j ) x i w_{i,j}^{\textrm{next step}}=w_{i,j}+\eta(\widehat y_j-y_j)x_i wi,jnext step=wi,j+η(y jyj)xi

  • w i , j w_{i,j} wi,j is the connection weight between the ith input neuron and the jth output neuron.
  • x i x_i xi is the ith input value of the current training instance.
  • y ^ j \widehat y_j y j is the output of the jth output neuron for the current training instance.
  • y j y_j yj is the target output of the jth output neuron for the current training instance.
  • η \eta η is the learning rate.
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron
import numpy as np
iris=load_iris()
X=iris.data[:,(2,3)]
y=(iris.target==0).astype(np.int)
per_clf=Perceptron(random_state=42)
per_clf.fit(X,y)
per_clf.predict([[2,0.5]])

or

from sklearn.linear_model import SGDClassifier
sgd_clf=SGDClassifier(loss="perceptron",learning_rate="constant",
                      eta0=1,penalty=None)
sgd_clf.fit(X,y)
sgd_clf.predict([[2,0.5]])

10.1.4 Multi-Layer Perceptron and Backpropagation

An MLP is composed of one (passthrough) input layer, one or more layers of LTUs, called hidden layers, and one final layer of LTUs called the output layer. Every layer except the output layer includes a bias neuron and is fully connected to the next layer. When an ANN has two or more hidden layers, it is called a deep neural network (DNN).

Backpropagation: for each training instance the backpropagation algorithm first makes a prediction (forward pass), measures the error, then goes through each layer in reverse to measure the error contribution from each connection (reverse pass), and finally slightly tweaks the connection weights to reduce the error (Gradient Descent step).

The step function is replaced with the logistic function, σ ( z ) = 1 / ( 1 + exp ⁡ ( – z ) ) \sigma(z) =1 / (1 + \exp(–z)) σ(z)=1/(1+exp(z)). This was essential because the step function contains only flat segments, so there is no gradient to work with (Gradient Descent cannot move on a flat surface), while the logistic function has a well-defined nonzero derivative everywhere, allowing Gradient Descent to make some progress at every step.

Two other popular activation functions are:

The hyperbolic tangent function tanh ⁡ ( z ) = 2 σ ( 2 z ) – 1 \tanh (z) = 2\sigma (2z) – 1 tanh(z)=2σ(2z)1
Just like the logistic function it is S-shaped, continuous, and differentiable, but its output value ranges from –1 to 1 (instead of 0 to 1 in the case of the logistic function), which tends to make each layer’s output more or less normalized (i.e., centered around 0) at the beginning of training. This often helps speed up convergence.

The ReLU function ReLU ( z ) = max ⁡ ( 0 , z ) \textrm{ReLU} (z) = \max (0, z) ReLU(z)=max(0,z). It is continuous but unfortunately not differentiable at z = 0 z =0 z=0 (the slope changes abruptly, which can make Gradient Descent bounce around). However, in practice it works very well and has the advantage of being fast to compute.

When the classes are exclusive (e.g., classes 0 through 9 for digit image classification), the output layer is typically modified by replacing the individual activation functions by a shared softmax function. The output of each neuron corresponds to the estimated probability of the corresponding class.

10.2 Training an MLP with TensorFLow’s High-Level API

#load dataset
import tensorflow as tf
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
X_train = X_train.astype(np.float32).reshape(-1, 28*28) / 255.0
X_test = X_test.astype(np.float32).reshape(-1, 28*28) / 255.0
y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)
X_valid, X_train = X_train[:5000], X_train[5000:]
y_valid, y_train = y_train[:5000], y_train[5000:]
#train a dnn classifier
import tensorflow as tf
feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(X_train)
dnn_clf=tf.contrib.learn.DNNClassifier(hidden_units=[300,100],n_classes=10,
                                      feature_columns=feature_columns)
dnn_clf.fit(x=X_train,y=y_train,batch_size=50,steps=40000)
#prediction and evaluation
from sklearn.metrics import accuracy_score
y_pred=list(dnn_clf.predict(X_test))
accuracy_score(y_test,y_pred)#0.9759

#use TF.Learn library for evaluation
dnn_clf.evaluate(X_test,y_test)
#{'loss': 0.167284, 'accuracy': 0.9759, 'global_step': 40000}

You can change the activation function by setting the activation_fn \verb+activation_fn+ activation_fn hyperparameter.

10.3 Training a DNN Using Plain TensorFlow

TensorFlow’s lower-level Python API provides more control over the architecture of the network.

Implementing Minibatch Gradient Descent to train it on the MNIST dataset.

The first step is the construction phase, building the TensorFlow graph.

The second step is the execution phase, where you actually run the graph to train the model.

10.3.1 Construction Phase

Specify the number of inputs and outputs, and set the number of hidden neurons in each layer:

import tensorflow as tf
n_inputs=28*28 #MNISt
n_hidden1=300
n_hidden2=100
n_outputs=10

Use placeholder nodes to represent the training data and targets

X=tf.placeholder(tf.float32,shape=(None,n_inputs),name="X")
y=tf.placeholder(tf.int64,shape=(None),name="y")

Create a neuron_layer() \verb+neuron_layer()+ neuron_layer() function to create one layer at a time, and specify the inputs, the number of neurons, the activation function, and the name of the layer in the parameter list:

def neuron_layer(X,n_neurons,name,activation=None):
    with tf.name_scope(name):
        n_inputs=int(X.get_shape()[1])
        stddev=2/np.sqrt(n_inputs)#standard deviation
        init=tf.truncated_normal((n_inputs,n_neurons),stddev=stddev)
        #truncated normal ensures no very large weight
        W=tf.Variable(init,name="weights")
        b=tf.Variabel(tf.zeros([n_neurons]),name="biases")
        #in fact, the shape of b should be batch_size*n_neurons
        z=tf.matmul(X,W)+b
        if activation=="relu":
            return tf.nn.relu(z)
        else:
            return z

Create the deep neural network:

with tf.name_scope("dnn"):
    hidden1=neuron_layer(X,n_hidden1,"hidden1",activation="relu")
    hidden2=neuron_layer(hidden1,n_hidden2,"hidden2",activation="relu")
    logits=neuron_layer(hidden2,n_outputs,"outputs") 

Use TensorFlow’s dense() function as alternative

from tensorflow.layers import dense
with tf.name_scope("dnn"):
    hidden1=dense(X,n_hidden1,name="hidden11",activation=tf.nn.relu)
    hidden2=dense(hidden1,n_hidden2,name="hidden12",activation=tf.nn.relu)
    logits=dense(hidden2,n_outputs,name="outputs1")

Use cross entropy as the cost function. The sparse_softmax_cross_entropy_with_logits() \verb+sparse_softmax_cross_entropy_with_logits()+ sparse_softmax_cross_entropy_with_logits() function is equivalent to applying the softmax activation function and then computing the cross entropy.

with tf.name_scope("loss"):
    xentropy=tf.nn.sparse_softmax_cross_entropy_with_logits(
    labels=y,logits=logits)
    loss=tf.reduce_mean(xentropy,name="loss")

Define a GradientDescentOptimizer \verb+GradientDescentOptimizer+ GradientDescentOptimizer that will tweak the model parameters to minimize the cost function.

learning_rate=0.01
with tf.name_scope("train"):
    optimizer=tf.train.GradientDescentOptimizer(learning_rate)
    training_op=optimizer.minimize(loss)

Evaluate the model.

with tf.name_scope("eval"):
    correct=tf.nn.in_top_k(logits,y,1)
    accuracy=tf.reduce_mean(tf.cast(correct,tf.float32))

in_top_k(predictions, targets, k, name=None) \verb+in_top_k(predictions, targets, k, name=None)+ in_top_k(predictions, targets, k, name=None) says whether the targets are in the top k k k predictions. For example, assume the batch_size \verb+batch_size+ batch_size is 2, and the number of classes is 3.

logits=[[0.2,0.3,0.5],[0.7,0.1,0.2]]
y=[2,1]
# ask if the element of index 2 in [0.2,0.3,0.5] (which is 0.5) is of top-1
# ask if the element of index 1 in [0.7,0.1,0.2] (which is 0.1) is of top-1 
correct=tf.nn.in_top_k(logits,y,1)
#[True,False]

Initialize all variables, and we will also create a Saver to save our trained model parameters to disk:

init=tf.global_variables_initializer()
saver=tf.train.Saver()

10.3.2 Execution Phase

Load MNIST dataset with TensorFlow as in Section 10.2.

Note that, tf.examples.tutorials.mnist is deprecated.

Define the number of epochs and the size of the mini-batches.

n_epochs=400
batch_size=50

Shuffle the training dataset, and provides one mini-batches a time.

def shuffle_batch(X, y, batch_size):
    rnd_idx = np.random.permutation(len(X))
    n_batches = len(X) // batch_size
    for batch_idx in np.array_split(rnd_idx, n_batches):
        X_batch, y_batch = X[batch_idx], y[batch_idx]
        yield X_batch, y_batch

Train the model:

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        acc_batch = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
        acc_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
        print(epoch, "Batch accuracy:", acc_batch, "Val accuracy:", acc_val)
    save_path = saver.save(sess, "./my_model_final.ckpt")

10.3.4 Using the Neural Network

Reuse the model

with tf.Session() as sess:
    saver.restore(sess, save_path="./my_model_final.ckpt") # or better, use save_path
    X_new_scaled = X_test[:20]
    Z = logits.eval(feed_dict={X: X_new_scaled})
    y_pred = np.argmax(Z, axis=1)
y_pred
#array([7, 2, 1, 0, 4, 1, 4, 9, 5, 9, 0, 6, 9, 0, 1, 5, 9, 7, 3, 4],
#      dtype=int64)
y_test[:20]
#array([7, 2, 1, 0, 4, 1, 4, 9, 5, 9, 0, 6, 9, 0, 1, 5, 9, 7, 3, 4])

10.4 Fine-Tuning Neural Network Hyperparameters

The flexibility of neural networks is also one of their main drawbacks: there are many
hyperparameters to tweak.

  • any imaginable network topology (how neurons are interconnected)
  • the number of layers in a simple ML
  • the number of neurons per layer
  • the type of activation function to use in each layer
  • the weight initialization logic

Better use randomized search than grid search, as the latter is much time-consuming.

Another option is to use a tool such as Oscar (http://oscar.calldesk.ai/).

10.4.1 Number of Hidden Layers

Deep networks have a much higher parameter efficiency than shallow ones: they can model complex functions using exponentially fewer neurons than shallow nets, making them much faster to train.

10.4.2 Number of Neurons per Hidden Layer

A simpler approach is to pick a model with more layers and neurons than you actually need, then use early stopping to prevent it from overfitting (and other regularization techniques, especially dropout, as we will see in Chapter 11).

10.4.3 Activation Functions

For hidden layers, in most cases, ReLU or its variants.

For output layer, softmax. For regression tasks, use no activation function.

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值