Chapter 10 Introduction to Artificial Neural Networks

最新推荐文章于 2023-11-21 08:58:04 发布

boywaiter

最新推荐文章于 2023-11-21 08:58:04 发布

阅读量449

点赞数

分类专栏： Hands-On Machine Learning with Scik 文章标签： python 机器学习深度学习

本文链接：https://blog.csdn.net/boywaiter/article/details/87909714

版权

Hands-On Machine Learning with Scik 专栏收录该内容

15 篇文章 1 订阅

订阅专栏

Chapter 10 Introduction to Artificial Neural Networks

OReilly. Hands-On Machine Learning with Scikit-Learn and TensorFlow读书笔记

10.1 From Biological to Artificial Neurons

A huge quantity of data available, an increase of computing power, improved training algorithms, less occurrence of theoretical limitations of ANNs in practice, and a virtuous circle of funding and progress, boost the revival of ANNs.

10.1.1 Biological Neurons

10.1.2 Logical Computations with Neurons

artificial neuron: it has one or more binary (on/off) inputs and one binary output. The artificial neuron simply activates its output when more than a certain number (in the case of Figure 10-3 is $\ge 2$ ) of its inputs are active.

10.1.3 The Perceptron

The Perceptron is one of the simplest ANN architectures, invented in 1957 by Frank Rosenblatt. It is based on a slightly different artificial neuron called a linear threshold unit (LTU): the inputs and output are now numbers (instead of binary on/off values) and each input connection is associated with a weight. The LTU computes a weighted sum of its inputs ( $w_1 x_1 + w_2 x_2 +\cdots+ w_n x_n = \textbf w^T\cdot \textbf x$ ), then applies a step function to that sum and outputs the result: $h_w(\textbf x) = \textrm{step} (z) = \textrm{step} (\textbf w^T \cdot \textbf x)$ .

Equation 10-1. Common step functions used in Perceptrons
$\begin{array}{ll}\textrm{heaviside}(z)=\left\{\begin{array}{ll}0 &\textrm{ if }z<0\\1 &\textrm{ if } z\ge 0\\ \end{array}\right.&& \textrm{ sgn}(z)=\left\{\begin{array}{ll}-1 &\textrm{ if }z<0\\0 &\textrm{ if } z= 0\\1&\textrm{ if } z> 0\\ \end{array} \right.\end{array}$
A single LTU can be used for simple linear binary classification. It computes a linear combination of the inputs and if the result exceeds a threshold, it outputs the positive class or else outputs the negative class (just like a Logistic Regression classifier or a linear SVM).

A Perceptron is simply composed of a single layer of LTUs, with each neuron connected to all the inputs. These connections are often represented using special pass‐through neurons called input neurons: they just output whatever input they are fed. Moreover, an extra bias feature is generally added ( $x_0 = 1$ ). This bias feature is typically represented using a special type of neuron called a bias neuron, which just outputs 1 all the time.

Equation 10-2. Perceptron learning rule (weight update)
$w_{i,j}^{\textrm{next step}}=w_{i,j}+\eta(\widehat y_j-y_j)x_i$

$w_{i,j}$ is the connection weight between the ith input neuron and the jth output neuron.
$x_i$ is the ith input value of the current training instance.
$\widehat y_j$ is the output of the jth output neuron for the current training instance.
$y_j$ is the target output of the jth output neuron for the current training instance.
$\eta$ is the learning rate.

from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron
import numpy as np
iris=load_iris()
X=iris.data[:,(2,3)]
y=(iris.target==0).astype(np.int)
per_clf=Perceptron(random_state=42)
per_clf.fit(X,y)
per_clf.predict([[2,0.5]])

from sklearn.linear_model import SGDClassifier
sgd_clf=SGDClassifier(loss="perceptron",learning_rate="constant",
                      eta0=1,penalty=None)
sgd_clf.fit(X,y)
sgd_clf.predict([[2,0.5]])

10.1.4 Multi-Layer Perceptron and Backpropagation

An MLP is composed of one (passthrough) input layer, one or more layers of LTUs, called hidden layers, and one final layer of LTUs called the output layer. Every layer except the output layer includes a bias neuron and is fully connected to the next layer. When an ANN has two or more hidden layers, it is called a deep neural network (DNN).

Backpropagation: for each training instance the backpropagation algorithm first makes a prediction (forward pass), measures the error, then goes through each layer in reverse to measure the error contribution from each connection (reverse pass), and finally slightly tweaks the connection weights to reduce the error (Gradient Descent step).

The step function is replaced with the logistic function, $\sigma(z) =1 / (1 + \exp(–z))$ . This was essential because the step function contains only flat segments, so there is no gradient to work with (Gradient Descent cannot move on a flat surface), while the logistic function has a well-defined nonzero derivative everywhere, allowing Gradient Descent to make some progress at every step.

Two other popular activation functions are:

The hyperbolic tangent function $\tanh (z) = 2\sigma (2z) – 1$
Just like the logistic function it is S-shaped, continuous, and differentiable, but its output value ranges from –1 to 1 (instead of 0 to 1 in the case of the logistic function), which tends to make each layer’s output more or less normalized (i.e., centered around 0) at the beginning of training. This often helps speed up convergence.

The ReLU function $\textrm{ReLU} (z) = \max (0, z)$ . It is continuous but unfortunately not differentiable at $z = 0$ (the slope changes abruptly, which can make Gradient Descent bounce around). However, in practice it works very well and has the advantage of being fast to compute.

When the classes are exclusive (e.g., classes 0 through 9 for digit image classification), the output layer is typically modified by replacing the individual activation functions by a shared softmax function. The output of each neuron corresponds to the estimated probability of the corresponding class.

10.2 Training an MLP with TensorFLow’s High-Level API

#load dataset
import tensorflow as tf
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
X_train = X_train.astype(np.float32).reshape(-1, 28*28) / 255.0
X_test = X_test.astype(np.float32).reshape(-1, 28*28) / 255.0
y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)
X_valid, X_train = X_train[:5000], X_train[5000:]
y_valid, y_train = y_train[:5000], y_train[5000:]

#train a dnn classifier
import tensorflow as tf
feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(X_train)
dnn_clf=tf.contrib.learn.DNNClassifier(hidden_units=[300,100],n_classes=10,
                                      feature_columns=feature_columns)
dnn_clf.fit(x=X_train,y=y_train,batch_size=50,steps=40000)

#prediction and evaluation
from sklearn.metrics import accuracy_score
y_pred=list(dnn_clf.predict(X_test))
accuracy_score(y_test,y_pred)#0.9759

#use TF.Learn library for evaluation
dnn_clf.evaluate(X_test,y_test)
#{'loss': 0.167284, 'accuracy': 0.9759, 'global_step': 40000}

You can change the activation function by setting the $activation_fn \verb+activation_fn+$ hyperparameter.

10.3 Training a DNN Using Plain TensorFlow

TensorFlow’s lower-level Python API provides more control over the architecture of the network.

Implementing Minibatch Gradient Descent to train it on the MNIST dataset.

The first step is the construction phase, building the TensorFlow graph.

The second step is the execution phase, where you actually run the graph to train the model.

10.3.1 Construction Phase

Specify the number of inputs and outputs, and set the number of hidden neurons in each layer:

import tensorflow as tf
n_inputs=28*28 #MNISt
n_hidden1=300
n_hidden2=100
n_outputs=10

Use placeholder nodes to represent the training data and targets

X=tf.placeholder(tf.float32,shape=(None,n_inputs),name="X")
y=tf.placeholder(tf.int64,shape=(None),name="y")

Create a $neuron_layer() \verb+neuron_layer()+$ function to create one layer at a time, and specify the inputs, the number of neurons, the activation function, and the name of the layer in the parameter list:

def neuron_layer(X,n_neurons,name,activation=None):
    with tf.name_scope(name):
        n_inputs=int(X.get_shape()[1])
        stddev=2/np.sqrt(n_inputs)#standard deviation
        init=tf.truncated_normal((n_inputs,n_neurons),stddev=stddev)
        #truncated normal ensures no very large weight
        W=tf.Variable(init,name="weights")
        b=tf.Variabel(tf.zeros([n_neurons]),name="biases")
        #in fact, the shape of b should be batch_size*n_neurons
        z=tf.matmul(X,W)+b
        if activation=="relu":
            return tf.nn.relu(z)
        else:
            return z

Create the deep neural network:

with tf.name_scope("dnn"):
    hidden1=neuron_layer(X,n_hidden1,"hidden1",activation="relu")
    hidden2=neuron_layer(hidden1,n_hidden2,"hidden2",activation="relu")
    logits=neuron_layer(hidden2,n_outputs,"outputs")

Use TensorFlow’s dense() function as alternative

from tensorflow.layers import dense
with tf.name_scope("dnn"):
    hidden1=dense(X,n_hidden1,name="hidden11",activation=tf.nn.relu)
    hidden2=dense(hidden1,n_hidden2,name="hidden12",activation=tf.nn.relu)
    logits=dense(hidden2,n_outputs,name="outputs1")

Use cross entropy as the cost function. The $sparse_softmax_cross_entropy_with_logits() \verb+sparse_softmax_cross_entropy_with_logits()+$ function is equivalent to applying the softmax activation function and then computing the cross entropy.

with tf.name_scope("loss"):
    xentropy=tf.nn.sparse_softmax_cross_entropy_with_logits(
    labels=y,logits=logits)
    loss=tf.reduce_mean(xentropy,name="loss")

Define a $\verb+GradientDescentOptimizer+$ that will tweak the model parameters to minimize the cost function.

learning_rate=0.01
with tf.name_scope("train"):
    optimizer=tf.train.GradientDescentOptimizer(learning_rate)
    training_op=optimizer.minimize(loss)

Evaluate the model.

with tf.name_scope("eval"):
    correct=tf.nn.in_top_k(logits,y,1)
    accuracy=tf.reduce_mean(tf.cast(correct,tf.float32))

$in_top_k(predictions, targets, k, name=None) \verb+in_top_k(predictions, targets, k, name=None)+$ says whether the targets are in the top $k$ predictions. For example, assume the $batch_size \verb+batch_size+$ is 2, and the number of classes is 3.

logits=[[0.2,0.3,0.5],[0.7,0.1,0.2]]
y=[2,1]
# ask if the element of index 2 in [0.2,0.3,0.5] (which is 0.5) is of top-1
# ask if the element of index 1 in [0.7,0.1,0.2] (which is 0.1) is of top-1 
correct=tf.nn.in_top_k(logits,y,1)
#[True,False]

Initialize all variables, and we will also create a Saver to save our trained model parameters to disk:

init=tf.global_variables_initializer()
saver=tf.train.Saver()

10.3.2 Execution Phase

Load MNIST dataset with TensorFlow as in Section 10.2.

Note that, tf.examples.tutorials.mnist is deprecated.

Define the number of epochs and the size of the mini-batches.

n_epochs=400
batch_size=50

Shuffle the training dataset, and provides one mini-batches a time.

def shuffle_batch(X, y, batch_size):
    rnd_idx = np.random.permutation(len(X))
    n_batches = len(X) // batch_size
    for batch_idx in np.array_split(rnd_idx, n_batches):
        X_batch, y_batch = X[batch_idx], y[batch_idx]
        yield X_batch, y_batch

Train the model:

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        acc_batch = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
        acc_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
        print(epoch, "Batch accuracy:", acc_batch, "Val accuracy:", acc_val)
    save_path = saver.save(sess, "./my_model_final.ckpt")

10.3.4 Using the Neural Network

Reuse the model

with tf.Session() as sess:
    saver.restore(sess, save_path="./my_model_final.ckpt") # or better, use save_path
    X_new_scaled = X_test[:20]
    Z = logits.eval(feed_dict={X: X_new_scaled})
    y_pred = np.argmax(Z, axis=1)
y_pred
#array([7, 2, 1, 0, 4, 1, 4, 9, 5, 9, 0, 6, 9, 0, 1, 5, 9, 7, 3, 4],
#      dtype=int64)
y_test[:20]
#array([7, 2, 1, 0, 4, 1, 4, 9, 5, 9, 0, 6, 9, 0, 1, 5, 9, 7, 3, 4])

10.4 Fine-Tuning Neural Network Hyperparameters

The flexibility of neural networks is also one of their main drawbacks: there are many
hyperparameters to tweak.

any imaginable network topology (how neurons are interconnected)
the number of layers in a simple ML
the number of neurons per layer
the type of activation function to use in each layer
the weight initialization logic

Better use randomized search than grid search, as the latter is much time-consuming.

Another option is to use a tool such as Oscar (http://oscar.calldesk.ai/).

10.4.1 Number of Hidden Layers

Deep networks have a much higher parameter efficiency than shallow ones: they can model complex functions using exponentially fewer neurons than shallow nets, making them much faster to train.

10.4.2 Number of Neurons per Hidden Layer

A simpler approach is to pick a model with more layers and neurons than you actually need, then use early stopping to prevent it from overfitting (and other regularization techniques, especially dropout, as we will see in Chapter 11).

10.4.3 Activation Functions

For hidden layers, in most cases, ReLU or its variants.

For output layer, softmax. For regression tasks, use no activation function.

boywaiter

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Chapter 10 Introduction to Artificial Neural Networks

Chapter 10 Introduction to Artificial Neural NetworksOReilly. Hands-On Machine Learning with Scikit-Learn and TensorFlow读书笔记10.1 From Biological to Artificial NeuronsA huge quantity of data availab...
复制链接

扫一扫

专栏目录