import tensorflow as tf
import numpy as np
import os
import matplotlib.pyplot as plt
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'# 去除AVX2警告
定义公用函数
Preparation for some common functions
# to make output stable across runsdefreset_graph(seed=22):
tf.reset_default_graph()
tf.set_random_seed(seed)
np.random.seed(seed)# to plot pretty figures
plt.rcParams['axes.labelsize']=14
plt.rcParams['xtick.labelsize']=12
plt.rcParams['ytick.labelsize']=12# where to save the figures
PROJECT_ROOT_DIR ="."
CHARTER_ID ="rnn"defsave_fig(fig_id, tight_layout=True):
path = os.path.join(PROJECT_ROOT_DIR,"images", CHARTER_ID)ifnot os.path.exists(path):
os.makedirs(path)
fig_path = os.path.join(path, fig_id +".png")print("Saving figure", fig_id)if tight_layout:
plt.tight_layout()
plt.savefig(fig_path,format='png')# 不使用dpi, IDE中可正常显示# plt.savefig(fig_path, format='png', dpi=300)
Basic RNNs in TensorFlow ?
Manual RNN
First, let’s implement a very simple RNN model, without using any of TensorFlow’s RNN operations, to better understand what goes on under the hood. We will create an RNN composed of a layer of five recurrent neurons, using the tanh activation function. We will assume that the RNN runs over only two time steps, taking input vectors of size 3 at each time step.
That wasn’t too hard, but of course if you want to be able to run an RNN over 100 time steps, the graph is going to be pretty big. Now let’s look at how to create the same model using TensorFlow’s RNN operations.
The static_rnn() function creates an unrolled RNN network by chaining cells. The static_rnn() function returns two objects.
The first is a Python list containing the output tensors for each time step.
The second is a tensor containing the final states of the network.
If there were 50 time steps, it would not be very convenient to have to define 50 input placeholders and 50 output tensors. Moreover, at execution time you would have to feed each of the 50 placeholders and manipulate the 50 outputs. Let’s simplify this.
takes a single input placeholder of shape [None, n_steps, n_inputs] where the first dimension is the mini-batch size.
However, above approach still builds a graph containing one cell per time step. If there were 50 time steps, the graph would look pretty ugly.
With such as large graph, you may even get out-of-memory (OOM) errors during backpropagation (especially with the limited memory of GPU cards), since it must store all tensor values during the forward pass so it can use them to compute gradients during the reverse pass.
The dynamic_rnn() function uses a while_loop() operation to run over the cell the appropriate number of times.
So far we have used only fixed-size input sequences (all exactly two steps long). What if the input sequences have variable lengths (e.g., like sentences)?
In this case you should set the sequence_length parameter when calling the dynamic_rnn() (or static_rnn()) function; it must be a 1D tensor indicating the length of the input sequence for each instance.
n_steps =2
n_inputs =3
n_neurons =5
reset_graph()
X = tf.placeholder(tf.float32,[None, n_steps, n_inputs])
basic_cell = tf.nn.rnn_cell.BasicRNNCell(num_units=n_neurons)
seq_length = tf.placeholder(tf.int32,[None])
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32,
sequence_length=seq_length)
init = tf.global_variables_initializer()# Mini-batch
X_batch = np.array([# step 0 step 1[[0,1,2],[9,8,7]],# instance 1[[3,4,5],[0,0,0]],# instance 2 (padded with zero vectors)[[6,7,8],[6,5,4]],# instance 3[[9,0,1],[3,2,1]],# instance 4])
seq_length_batch = np.array([2,1,2,2])with tf.Session()as sess:
init.run()
outputs_val, states_val = sess.run([outputs, states], feed_dict={X: X_batch, seq_length: seq_length_batch})print(outputs_val)"""
[[[ 0.68722075 -0.509747 0.70100015 0.89299554 0.15244864]
[ 0.9999998 -0.9999988 -0.9972714 0.99999833 0.9986311 ]] # final state
[[ 0.9982136 -0.9972284 0.35704085 0.999562 0.8413011 ] # final state
[ 0. 0. 0. 0. 0. ]] # zero vector
[[ 0.9999914 -0.99998826 -0.12167802 0.99999833 0.9800005 ]
[ 0.9999898 -0.99922127 -0.9930421 0.9968625 0.99406075]] # final state
[[ 0.9998404 -0.9999913 -0.9925783 0.8807097 -0.57542574]
[ 0.98173416 -0.96920955 -0.9693667 0.70200425 0.72634375]]] # final state
"""print(states_val)# final state(step)(time) value"""
[[ 0.9999998 -0.9999988 -0.9972714 0.99999833 0.9986311 ] # t = 1
[ 0.9982136 -0.9972284 0.35704085 0.999562 0.8413011 ] # t = 0 !!!
[ 0.9999898 -0.99922127 -0.9930421 0.9968625 0.99406075] # t = 1
[ 0.98173416 -0.96920955 -0.9693667 0.70200425 0.72634375]] # t = 1
"""
开始实践【一】|【分类】?
Training a Sequence Classifier
Okay, now you know how to build an RNN network (or more precisely an RNN network unrolled through time). But how do you train it?
Let’s train an RNN to classify MNIST images. We will treat each image as a sequence of 28 rows of 28 pixels each (since each MNIST image is 28 × 28 pixels). We will use cells of 150 recurrent neurons, plus a fully connected layer containing 10 neurons (one per class) connected to the output of the last time step, followed by a softmax layer.
Now let’s take a look at how to handle time series, such as stock prices, air temperature, brain wave patterns, and so on.
In this section we will train an RNN to predict the next value in a generated time series.
Each training instance is a randomly selected sequence of 20 consecutive values from the time series, and the target sequence is the same as the input sequence, except it is shifted by one time step into the future.
At each time step we now have an output vector of size 100. But what we actually want is a single output value at each time step. The simplest solution is to wrap the cell in an OutputProjectionWrapper.
The OutputProjectionWrapper adds a fully connected layer of linear neurons (i.e., without any activation function) on top of each output (but it does not affect the cell state).
Although using an OutputProjectionWrapper is the simplest solution to reduce the dimensionality of the RNN’s output sequences down to just one value per time step (per instance), it is not the most efficient.
There is a trickier but more efficient solution: you can reshape the RNN outputs from [batch_size, n_steps, n_neurons] to [batch_size * n_steps, n_neurons], then apply a single fully connected layer with the appropriate output size (in our case just 1), which will result in an output tensor of shape [batch_size * n_steps, n_outputs], and then reshape this tensor to [batch_size, n_steps, n_outputs].
If you build a very deep RNN, it may end up overfitting the training set. To prevent that, a common technique is to apply dropout.
You can simply add a dropout layer before or after the RNN as usual, but if you also want to apply dropout between the RNN layers, you need to use a DropoutWrapper.
Note: the input_keep_prob parameter can be a placeholder, making it possible to set it to any value you want during training, and to 1.0 during testing (effectively turning dropout off).
To train an RNN on long sequences, you will need to run it over many time steps, making the unrolled RNN a very deep network.
Many of the tricks we discussed to alleviate this problem can be used for deep unrolled RNNs as well: good parameter initialization, nonsaturating activation functions (e.g., ReLU), Batch Normalization, Gradient Clipping, and faster optimizers. However, if the RNN needs to handle even moderately long sequences (e.g., 100 inputs), then training will still be very slow.
Besides the long training time, a second problem faced by long-running RNNs is the fact that the memory of the first inputs gradually fades away.
If you consider the LSTM cell as a black box, it can be used very much like a basic cell, except it will perform much better; training will converge faster and it will detect long-term dependencies in the data.
reset_graph()
n_steps =28
n_inputs =28
n_neurons =150
n_outputs =10
n_layers =3
learning_rate =0.001
X = tf.placeholder(tf.float32,[None, n_steps, n_inputs])
y = tf.placeholder(tf.int32,[None])
lstm_cell =[tf.nn.rnn_cell.BasicLSTMCell(num_units=n_neurons)for layer inrange(n_layers)]
multi_cell = tf.nn.rnn_cell.MultiRNNCell(lstm_cell)
outputs, states = tf.nn.dynamic_rnn(multi_cell, X, dtype=tf.float32)
top_layer_h_state = states[-1][1]
logits = tf.layers.dense(top_layer_h_state, n_outputs, name="softmax")
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
loss = tf.reduce_mean(xentropy, name="loss")
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)
correct = tf.nn.in_top_k(logits, y,1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
init = tf.global_variables_initializer()print(states)print(top_layer_h_state)(X_train, y_train),(X_test, y_test)= tf.keras.datasets.mnist.load_data()
X_train = X_train.astype(np.float32).reshape(-1,28*28)/255.0
X_test = X_test.astype(np.float32).reshape(-1,28*28)/255.0
y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)
X_valid, X_train = X_train[:5000], X_train[5000:]
y_valid, y_train = y_train[:5000], y_train[5000:]defshuffle_batch(X, y, batch_size):
rnd_idx = np.random.permutation(len(X))
n_batches =len(X)// batch_size
for batch_idx in np.array_split(rnd_idx, n_batches):
X_batch, y_batch = X[batch_idx], y[batch_idx]yield X_batch, y_batch
n_epochs =10
batch_size =150# added
X_test = X_test.reshape(-1, n_steps, n_inputs)with tf.Session()as sess:
init.run()for epoch inrange(n_epochs):for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
X_batch = X_batch.reshape((-1, n_steps, n_inputs))
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
acc_batch = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
acc_test = accuracy.eval(feed_dict={X: X_test, y: y_test})print(epoch,"Last batch accuracy:", acc_batch,"Test accuracy:", acc_test)"""
0 Last batch accuracy: 0.96 Test accuracy: 0.9496
1 Last batch accuracy: 0.9533333 Test accuracy: 0.968
2 Last batch accuracy: 0.99333334 Test accuracy: 0.9781
3 Last batch accuracy: 0.9866667 Test accuracy: 0.9789
4 Last batch accuracy: 0.97333336 Test accuracy: 0.9824
5 Last batch accuracy: 1.0 Test accuracy: 0.9866
6 Last batch accuracy: 0.97333336 Test accuracy: 0.9835
7 Last batch accuracy: 1.0 Test accuracy: 0.9867
8 Last batch accuracy: 0.9866667 Test accuracy: 0.9878
9 Last batch accuracy: 0.99333334 Test accuracy: 0.9868
"""
开始实践【五】| 【词嵌入】?
Natural Language Processing
Word Embeddings
Most of the state-of-the-art NLP applications, such as machine translation, automatic summarization, parsing, sentiment analysis, and more, are now based (at least in part) on RNNs.
from six.moves import urllib
import errno
import zipfile
from collections import Counter
WORDS_PATH ="datasets/words"
WORDS_URL ='http://mattmahoney.net/dc/text8.zip'deffetch_words_data(words_url=WORDS_URL, words_path=WORDS_PATH):
os.makedirs(words_path, exist_ok=True)
zip_path = os.path.join(words_path,"words.zip")ifnot os.path.exists(zip_path):
urllib.request.urlretrieve(words_url, zip_path)with zipfile.ZipFile(zip_path)as f:
data = f.read(f.namelist()[0])return data.decode("ascii").split()
words = fetch_words_data()print(words[:5])# ['anarchism', 'originated', 'as', 'a', 'term']print(len(words))# 17005207
vocabulary_size =50000# return the top 50000 most common words, UNK means unknown words
vocabulary =[("UNK",None)]+ Counter(words).most_common(vocabulary_size -1)
vocabulary = np.array([word for word, _ in vocabulary])
dictionary ={word: code for code, word inenumerate(vocabulary)}
data = np.array([dictionary.get(word,0)for word in words])print((" ".join(words[:9]), data[:9]))# ('anarchism originated as a term of abuse first used', array([5234, 3081, 12, 6, 195, 2, 3134, 46, 59]))print(" ".join([vocabulary[word_index]for word_index in[5241,3081,12,6,195,2,3134,46,59]]))# cycles originated as a term of abuse first usedprint((words[24], data[24]))# ('culottes', 0)# 说明词汇表中,没有culottes这个词# Generate batchesfrom collections import deque
defgenerate_batch(batch_size, num_skips, skip_window):global data_index
assert batch_size % num_skips ==0assert num_skips <=2* skip_window
batch = np.ndarray(shape=[batch_size], dtype=np.int32)
labels = np.ndarray(shape=[batch_size,1], dtype=np.int32)
span =2* skip_window +1# [ skip_window target skip_window ]buffer= deque(maxlen=span)for _ inrange(span):buffer.append(data[data_index])
data_index =(data_index +1)%len(data)for i inrange(batch_size // num_skips):
target = skip_window # target label at the center of the buffer
targets_to_avoid =[skip_window]for j inrange(num_skips):while target in targets_to_avoid:
target = np.random.randint(0, span)
targets_to_avoid.append(target)
batch[i * num_skips + j]=buffer[skip_window]
labels[i * num_skips + j,0]=buffer[target]buffer.append(data[data_index])
data_index =(data_index +1)%len(data)return batch, labels
np.random.seed(22)
data_index =0
batch, labels = generate_batch(8,2,1)print((batch,[vocabulary[word]for word in batch]))print((labels,[vocabulary[word]for word in labels[:,0]]))# Build the model
batch_size =128
embedding_size =128# Dimension of the embedding vector.
skip_window =1# How many words to consider left and right.
num_skips =2# How many times to reuse an input to generate a label.# We pick a random validation set to sample nearest neighbors. Here we limit the# validation samples to the words that have a low numeric ID, which by# construction are also the most frequent.
valid_size =16# Random set of words to evaluate similarity on.
valid_window =100# Only pick dev samples in the head of the distribution.
valid_examples = np.random.choice(valid_window, valid_size, replace=False)
num_sampled =64# Number of negative examples to sample.
learning_rate =0.01
reset_graph()# Input Data
train_labels = tf.placeholder(tf.int32, shape=[batch_size,1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)# Look up embeddings for inputs.
init_embeds = tf.random_uniform([vocabulary_size, embedding_size],-1.0,1.0)
embeddings = tf.Variable(init_embeds)
train_inputs = tf.placeholder(tf.int32, shape=[None])# from ids...
embed = tf.nn.embedding_lookup(embeddings, train_inputs)# ...to embeddings# Construct the variables for the NCE loss
nce_weights = tf.Variable(
tf.truncated_normal([vocabulary_size, embedding_size],
stddev=1.0/ np.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))# Compute the average NCE loss for the batch.# tf.nce_loss automatically draws a new sample of the negative labels each# time we evaluate the loss.
loss = tf.reduce_mean(
tf.nn.nce_loss(nce_weights, nce_biases, train_labels, embed,
num_sampled, vocabulary_size))# Construct the Adam optimizer
optimizer = tf.train.AdamOptimizer(learning_rate)
training_op = optimizer.minimize(loss)# Compute the cosine similarity between minibatch examples and all embeddings.
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), axis=1, keepdims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)# Add variable initializer.
init = tf.global_variables_initializer()# Train the model
num_steps =10001with tf.Session()as session:
init.run()
average_loss =0for step inrange(num_steps):print("\rIteration: {}".format(step), end="\t")
batch_inputs, batch_labels = generate_batch(batch_size, num_skips, skip_window)
feed_dict ={train_inputs: batch_inputs, train_labels: batch_labels}# We perform one update step by evaluating the training op (including it# in the list of returned values for session.run()
_, loss_val = session.run([training_op, loss], feed_dict=feed_dict)
average_loss += loss_val
if step %2000==0:if step >0:
average_loss /=2000# The average loss is an estimate of the loss over the last 2000 batches.print("Average loss at step ", step,": ", average_loss)
average_loss =0# Note that this is expensive (~20% slowdown if computed every 500 steps)if step %10000==0:
sim = similarity.eval()for i inrange(valid_size):
valid_word = vocabulary[valid_examples[i]]
top_k =8# number of nearest neighbors
nearest =(-sim[i,:]).argsort()[1:top_k +1]
log_str ="Nearest to %s:"% valid_word
for k inrange(top_k):
close_word = vocabulary[nearest[k]]
log_str ="%s %s,"%(log_str, close_word)print(log_str)
final_embeddings = normalized_embeddings.eval()
np.save("./my_final_embeddings.npy", final_embeddings)# Plot the embeddingsdefplot_with_labels(low_dim_embs, labels):assert low_dim_embs.shape[0]>=len(labels),"More labels than embeddings"
plt.figure(figsize=(18,18))# in inchesfor i, label inenumerate(labels):
x, y = low_dim_embs[i,:]
plt.scatter(x, y)
plt.annotate(label,
xy=(x, y),
xytext=(5,2),
textcoords='offset points',
ha='right',
va='bottom')
plt.draw()
save_fig("word embedding visualization")
plt.show()from sklearn.manifold import TSNE
final_embeddings = np.load("./my_final_embeddings.npy")
tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
plot_only =500
low_dim_embs = tsne.fit_transform(final_embeddings[:plot_only,:])
labels =[vocabulary[i]for i inrange(plot_only)]
plot_with_labels(low_dim_embs, labels)