How to Train a ChatBot with the TensorFlow and Google Cloud ML

本文介绍如何利用Google Cloud ML平台及Cloud Shell环境进行聊天机器人的神经网络模型训练，包括环境搭建、数据准备、模型训练及对话测试等关键步骤。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

In our previous article we discussed how to train the RNN based chatbot on a AWS GPU instance. Today we will see how we can easily do the training of the same network, on the Google Cloud ML and with the Google Cloud Shell. With the Cloud Shell you will not need to do literally anything on your local machine! For the sake of the experiment, exactly the same network as we did in our previous article will be used. However, one can use any TensorFlow network.

Special thanks to my patrons from my Patreon page who made the article possible:

Aleksandr Shepeliev, Sergei Ten, Alexey Polietaiev, Никита Пензин, Igor German, Mikhail Sobol, Карнаухов Андрей, Sergei, Matveev Evgeny, Anton Potemkin.

Even though the material in the article is self-contained, I am strongly advising you to go through each and every link in order to understand what is going on under the hood.

Pre-Requirements

There is only one pre-requirement that the reader must satisfy in order to perform all the steps here: to have a Google Cloud Account with billing enabled.

Let’s start our journey by answering 2 main questions:

What is the Google Cloud ML?
What is the Google Cloud Shell?

What Is The Google Cloud ML?

In order to understand what it is, let’s look into the official definition:

Google Cloud Machine Learning brings the power and flexibility of TensorFlow to the cloud. You can use its components to select and extract features from your data, train your machine learning models, and get predictions using the managed resources of Google Cloud Platform.

I don’t know about you, but for me, this definition does not say much. So, let me explain what it actually can do for you:

deploy your code to the instance that has everything that needed to train the TensorFlow model;
provide access to the Google Cloud Storage buckets for your code;
execute your code that is responsible for the training;
host the model in the cloud for you;
use the trained model to give the predictions for the future data.

The focus of the current article will be primarily on the first 3 items. In the following articles, we will see how to deploy the trained model to the Google Cloud ML and how to predict data using your cloud-hosted model.

What Is The Google Cloud Shell?

Again, let’s start with the official definition:

Google Cloud Shell is a shell environment for managing resources hosted on Google Cloud Platform.

And, again, it does not say much to me. So, let me explain what this is actually about. The Cloud Shell is, basically, a provisioned instance in the cloud that:

has Debian based OS on board;
to the shell of which you can access via the Web;
that has everything needed to work with the Google Cloud.

Yes, you got it right, you have a 100% free instance with the shell, that you can have access to from anywhere via the Web.

But nothing comes for free, in the case of the Cloud Shell — one can access it only via the Web with some limitations (personally, I hate to use any other terminals but iTerm). I have asked on the StackOverflow if it is possible to use the Cloud Shell with your own terminal. But for now, there is a way to make your life easier by installing the special Chrome plugin that, at least, enables a terminal friendly key bindings.

You can read more about Cloud Shell features here.

The Overall Process

The overall process includes following steps:

Prepare the Cloud Shell environment for the training
Prepare the Cloud Storage
Prepare the data for training
Prepare the train script
Test the training process locally
Train
Talk to the chatbot

Prepare The Cloud Shell Environment For The Training

Now, it’s the best time to open the Cloud Shell. It is super easy, you just need to:

open your console: https://console.cloud.google.com/ and
press on the shell icon in the right upper corner:

In case of any problems, here is a small page, that describes how to start the Shell with more details.

All the following examples will be executed in the Cloud Shell.

Also, if this is the first time when you are going to use the Cloud ML with the Cloud Shell — you need to prepare all the required dependencies. That can be done by just executing the one-liner:

curl https://raw.githubusercontent.com/GoogleCloudPlatform/cloudml-samples/master/tools/setup_cloud_shell.sh | bash

It will install all the required packages. Also, you need to update your PATH variable:

export PATH=${HOME}/.local/bin:${PATH}

One can check whether everything is installed successfully or not by running one simple command:

➜ curl https://raw.githubusercontent.com/GoogleCloudPlatform/cloudml-samples/master/tools/check_environment.py | python

...
You are using pip version 8.1.1, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
You are using pip version 8.1.1, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Your active configuration is: [cloudshell-12345]
Success! Your environment is configured

Now it’s time to decide which project you will use in order to train the network. I have the dedicated project for all my experiments with the ML. Anyway, it’s up to you, to choose the project.

Let me show you my commands that allow me to switch between projects easily:

➜  gprojects
PROJECT_ID             NAME            PROJECT_NUMBER
ml-lab-123456          ml-lab          123456789012
...

➜  gproject ml-lab-123456
Updated property [core/project].

Now, if you would like to use the same magic, here is what you need to add to your .bashrc/.zshrc/other_rc file:

function gproject() {
  gcloud config set project $1
}

function gprojects() {
  gcloud projects list
}

Okey, so now we have prepared the Cloud Shell and have switched to the desired project. What next? If this is the very first time when you are using the Cloud ML with the project, you will be required to initialize it. And again, this can be done with just one line:

➜ gcloud beta ml init-project

Cloud ML needs to add its service accounts to your project (ml-lab-123456) as Editors. This will enable Cloud Machine Learning to access resources in your project when running your training and prediction jobs.

Do you want to continue (Y/n)?

Added serviceAccount:cloud-ml-service@ml-lab-123456-1234a.iam.gserviceaccount.com as an Editor to project 'ml-lab-123456'.

Finally, the Cloud Shell can be considered as one that has been prepared. We can move on to the next step.

Prepare The Cloud Storage

First of all, let me explain, why would we need cloud storage? Since we are going to train the model in the cloud it will not have any access to the local file system of your current machine. It means that all the required input needs to be stored somewhere in the cloud. As well as, we will need to store the output — somewhere.

Let’s create a brand new bucket that would be used for the training:

➜ PROJECT_NAME=chatbot_generic
➜ TRAIN_BUCKET=gs://${PROJECT_NAME}
➜ gsutil mb ${TRAIN_BUCKET}
Creating gs://chatbot_generic/...

Here I need to tell you something, if you look on the official guide you will find the following text:

Warning: You must specify a region (like us-central1) for your bucket, not a multi-region location (like us). Learn more in the development environment overview.

However, if you try to use the regional bucket instead of a multi-regional, the script will fail to write there anything (do not worry, the issue has been filed).

In the perfect world with ponies where everything works as expected, it is very important to set the region here, and the region should be consistent with the region that will be used during the training. Otherwise, it might have a negative impact on a speed of the training.

Now we are ready to prepare the input data for the training.

Prepare The Data For The Training

This time (compared to the previous article) we will use a slightly modified version of the script that prepares the input data. I would encourage you to read how the script is working and what it is doing in the README. But for now, here is how you can prepare the input data (you might replace “td src” with “mkdir src; cd src”):

➜ td src
➜ ~/src$ git clone https://github.com/b0noI/dialog_converter.git

Cloning into 'dialog_converter'...
remote: Counting objects: 63, done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 63 (delta 0), reused 0 (delta 0), pack-reused 59
Unpacking objects: 100% (63/63), done.
Checking connectivity... done.

➜ ~/src$ cd dialog_converter/
➜ ~/src/dialog_converter$ git checkout converter_that_produces_test_data_as_well_as_train_data
Branch converter_that_produces_test_data_as_well_as_train_data set up to track remote branch converter_that_produces_test_data_as_well_as_train_data from origin.
Switched to a new branch 'converter_that_produces_test_data_as_well_as_train_data'

➜ ~/src/dialog_converter$ python converter.py 
➜ ~/src/dialog_converter$ ls
converter.py  LICENSE  movie_lines.txt  README.md  test.a  test.b  train.a  train.b

One might be wondering, what is “td”? It is actually a short form of “to dir” and it’s one of the most frequently used commands. In order to use it you need to update your rc file with the following code:

function td() {
  mkdir $1
  cd $1
}

This time we will improve quality of our model by splitting our data set into 2 groups: training and test. That’s why we can see 4 files instead of 2, how it was during the process described in the prev article.

Ok, now we have data, let’s upload it to the bucket:

➜ ~/src/dialog_converter$ gsutil cp test.* ${TRAIN_BUCKET}/input

Copying file://test.a [Content-Type=application/octet-stream]...
Copying file://test.b [Content-Type=chemical/x-molconn-Z]...                    
\ [2 files][  2.8 MiB/  2.8 MiB]      0.0 B/s                                   
Operation completed over 2 objects/2.8 MiB.

➜ ~/src/dialog_converter$ gsutil cp train.* ${TRAIN_BUCKET}/input

Copying file://train.a [Content-Type=application/octet-stream]...
Copying file://train.b [Content-Type=chemical/x-molconn-Z]...                   - [2 files][ 11.0 MiB/ 11.0 MiB]                                                
Operation completed over 2 objects/11.0 MiB.

➜ ~/src/dialog_converter$ gsutil ls ${TRAIN_BUCKET}

gs://chatbot_generic/input/

➜ ~/src/dialog_converter$ gsutil ls ${TRAIN_BUCKET}/input

gs://chatbot_generic/input/test.a
gs://chatbot_generic/input/test.b
gs://chatbot_generic/input/train.a
gs://chatbot_generic/input/train.b

Prepare The Training Script

At this moment, we can prepare the training script. We will use the translate.py, even though, the current implementation does not allow to be used with the Cloud ML, so we will need a small refactoring. As usual, I have created a feature request and provided you with a branch with all the required changes. Let’s clone it:

➜ ~/src/dialog_converter$ cd ..
➜ ~/src$ git clone https://github.com/b0noI/models.git

Cloning into 'models'...
remote: Counting objects: 1813, done.
remote: Compressing objects: 100% (39/39), done.
remote: Total 1813 (delta 24), reused 0 (delta 0), pack-reused 1774
Receiving objects: 100% (1813/1813), 49.34 MiB | 39.19 MiB/s, done.
Resolving deltas: 100% (742/742), done.
Checking connectivity... done.

➜ ~/src$ cd models/
➜ ~/src/models$ git checkout translate_tutorial_supports_google_cloud_ml

Branch translate_tutorial_supports_google_cloud_ml set up to track remote branch translate_tutorial_supports_google_cloud_ml from origin.
Switched to a new branch 'translate_tutorial_supports_google_cloud_ml'

➜ ~/src/models$ cd tutorials/rnn/translate/

Again, pay attention that we are not using the master branch!

Test The Training Script Locally

Since the remote training costs money it might be a good idea to test the training locally. The problem here is that local training of our network definitely will kill the Cloud Shell instance. And you would have to restart it. Do not worry, nothing will be lost in the case if this happens, but still, it’s probably not something that we would want. Luckily, our script includes a self testing mode that we can use. Let’s start the training locally in a self-test mode:

➜ ~/src/models/tutorials/rnn/translate$ cd ..
➜ ~/src/models/tutorials/rnn$ gcloud beta ml local train \
>   --package-path=translate \
>   --module-name=translate.translate \
>   -- \
>   --self_test

Self-test for neural translation model.

Pay attention to the folder from which we are executing the command.

Looks like that a self-test has been finished successfully. Let’s talk a little bit about keys that we are using here:

package-path — the path to the python package that needs to be deployed to the remote server in order to execute the training;
module-name — name of the module that needs to be executed during the training;
“- -” —everything that follows after, will be sent as input arguments to the module;
self_test — tells the module to run a self-test without actual training.

Training

This is the most exciting part. Before we start a training we need to prepare all the required buckets that will be used during the process and set all the local variables:

➜ ~/src/models/tutorials/rnn$ INPUT_TRAIN_DATA_A=${TRAIN_BUCKET}/input/train.a
➜ ~/src/models/tutorials/rnn$ INPUT_TRAIN_DATA_B=${TRAIN_BUCKET}/input/train.b
➜ ~/src/models/tutorials/rnn$ INPUT_TEST_DATA_A=${TRAIN_BUCKET}/input/test.a
➜ ~/src/models/tutorials/rnn$ INPUT_TEST_DATA_B=${TRAIN_BUCKET}/input/test.b

➜ ~/src/models/tutorials/rnn$ JOB_NAME=${PROJECT_NAME}_$(date +%Y%m%d_%H%M%S)
➜ ~/src/models/tutorials/rnn$ echo ${JOB_NAME}
chatbot_generic_20161224_203332

➜ ~/src/models/tutorials/rnn$ TRAIN_PATH=${TRAIN_BUCKET}/${JOB_NAME}
➜ ~/src/models/tutorials/rnn$ echo ${TRAIN_PATH}
gs://chatbot_generic/chatbot_generic_20161224_203332

The job’s name needs to be unique each time when we start the training. Now let’s change the current folder to translate (do not ask =) ):

➜ ~/src/models/tutorials/rnn$ cd translate/

At this moment we are ready to start the training. Let’s first create a command that we will need to execute in order to discuss the details, before the actual execution:

gcloud beta ml jobs submit training ${JOB_NAME} \
  --package-path=. \
  --module-name=translate.translate \
  --staging-bucket="${TRAIN_BUCKET}" \
  --region=us-central1 \
  -- \
  --from_train_data=${INPUT_TRAIN_DATA_A} \
  --to_train_data=${INPUT_TRAIN_DATA_B} \
  --from_dev_data=${INPUT_TEST_DATA_A} \
  --to_dev_data=${INPUT_TEST_DATA_B} \
  --train_dir="${TRAIN_PATH}" \
  --data_dir="${TRAIN_PATH}" \
  --steps_per_checkpoint=5 \
  --from_vocab_size=45000 \
  --to_vocab_size=45000

Let’s first discuss some new flags of the training command:

staging-bucket — the bucket, that should be used during the deployment, it makes perfect sense to use the same bucket as for training;
region — the region where you want to start the process.

Now let’s discuss the new flags that will be passed to our script:

from_train_data — source “from” that will be used during the training process;
to_train_data — same, but for “to” source;
from_dev_data/to_dev_data — same, but for test (or “dev” as it called by the script) data, that will be used to evaluate the loss after training;
train_dir — the train dir, could be cloud based (YAY! exactly what we need);
steps_per_checkpoint — how many steps should be executed before saving a checkpoint. 5 is a small value, I have set it to this small value only to verify that the training process goes without any problem. Later on I’ll restart the process with the bigger value (200 for example);
from_vocab_size/to_vocab_size — in order to understand what this is, you need to read prev article. You should find out that the default value (40k) is smaller than the amount of unique words in our dialogues. Though, in our case (after lower casing all the dialogs) 45k might be more than actually needed. Will see.

Looks like everything is set to actually start the training, so let’s rock(be patient, it will take some time for the execution to be started)…

➜ ~/src/models/tutorials/rnn/translate$ gcloud beta ml jobs submit training ${JOB_NAME} \
>   --package-path=. \
>   --module-name=translate.translate \
>   --staging-bucket="${TRAIN_BUCKET}" \
>   --region=us-central1 \
>   -- \
>   --from_train_data=${INPUT_TRAIN_DATA_A} \
>   --to_train_data=${INPUT_TRAIN_DATA_B} \
>   --from_dev_data=${INPUT_TEST_DATA_A} \
>   --to_dev_data=${INPUT_TEST_DATA_B} \
>   --train_dir="${TRAIN_PATH}" \
>   --data_dir="${TRAIN_PATH}" \
>   --steps_per_checkpoint=5 \
>   --from_vocab_size=45000 \
>   --to_vocab_size=45000

INFO    2016-12-24 20:49:24 -0800       unknown_task            Validating job requirements...
INFO    2016-12-24 20:49:25 -0800       unknown_task            Job creation request has been successfully validated.
INFO    2016-12-24 20:49:26 -0800       unknown_task            Job chatbot_generic_20161224_203332 is queued.
INFO    2016-12-24 20:49:31 -0800       service         Waiting for job to be provisioned.
INFO    2016-12-24 20:49:36 -0800       service         Waiting for job to be provisioned.

...
INFO    2016-12-24 20:53:15 -0800       service         Waiting for job to be provisioned.
INFO    2016-12-24 20:53:20 -0800       service         Waiting for job to be provisioned.
INFO    2016-12-24 20:53:20 -0800       service         Waiting for TensorFlow to start.

...
INFO    2016-12-24 20:54:56 -0800       master-replica-0                Successfully installed translate-0.0.0
INFO    2016-12-24 20:54:56 -0800       master-replica-0                Running command: python -m translate.translate --from_train_data=gs://chatbot_generic/input/train.a --to_train_data=gs://chatbot_generic/input/train.b --from_dev_data=gs://chatbot_generic/input/test.a --to_dev_data=gs://chatbot_generic/input/test.b --train_dir=gs://chatbot_generic/chatbot_generic_20161224_203332 --steps_per_checkpoint=5 --from_vocab_size=45000 --to_vocab_size=45000
INFO    2016-12-24 20:56:21 -0800       master-replica-0                Creating vocabulary /tmp/vocab45000 from data gs://chatbot_generic/input/train.b
INFO    2016-12-24 20:56:21 -0800       master-replica-0                  processing line 100000
INFO    2016-12-24 20:56:21 -0800       master-replica-0                Tokenizing data in gs://chatbot_generic/input/train.b
INFO    2016-12-24 20:56:21 -0800       master-replica-0                  tokenizing line 100000
INFO    2016-12-24 20:56:21 -0800       master-replica-0                Tokenizing data in gs://chatbot_generic/input/train.a
INFO    2016-12-24 20:56:21 -0800       master-replica-0                  tokenizing line 100000
INFO    2016-12-24 20:56:21 -0800       master-replica-0                Tokenizing data in gs://chatbot_generic/input/test.b
INFO    2016-12-24 20:56:21 -0800       master-replica-0                Tokenizing data in gs://chatbot_generic/input/test.a
INFO    2016-12-24 20:56:21 -0800       master-replica-0                Creating 3 layers of 1024 units.
INFO    2016-12-24 20:56:21 -0800       master-replica-0                Created model with fresh parameters.
INFO    2016-12-24 20:56:21 -0800       master-replica-0                Reading development and training data (limit: 0).
INFO    2016-12-24 20:56:21 -0800       master-replica-0                  reading data line 100000

You can monitor the state of your training. You just need to open another tab in your Cloud Shell (or the tmux window), create the required variables:

➜ JOB_NAME=chatbot_generic_20161224_213143
➜ gcloud beta ml jobs describe ${JOB_NAME}
...

Now we can actually stop the job and restart it with the default amount of steps per checkpoint (200). The updated command should look like:

➜ ~/src/models/tutorials/rnn/translate$ gcloud beta ml jobs submit training ${JOB_NAME} \
>   --package-path=. \
>   --module-name=translate.translate \
>   --staging-bucket="${TRAIN_BUCKET}" \
>   --region=us-central1 \
>   -- \
>   --from_train_data=${INPUT_TRAIN_DATA_A} \
>   --to_train_data=${INPUT_TRAIN_DATA_B} \
>   --from_dev_data=${INPUT_TEST_DATA_A} \
>   --to_dev_data=${INPUT_TEST_DATA_B} \
>   --train_dir="${TRAIN_PATH}" \
>   --data_dir="${TRAIN_PATH}" \
>   --from_vocab_size=45000 \
>   --to_vocab_size=45000

Talk To The Chatbot

Since you can start using the latest checkpoint from any other machine without interrupting or impacting the training process, this is probably the biggest advantage of using the Cloud Storage.

Now, for example, I’m going to show you how to chat with the bot, that is still bing trained, after only 1600 iterations.

Also, this is the only step that can’t be done in the Cloud Shell. Why? If you are really asking me — Cloud Shell was never designed to run a heavy task on it, so it will gloriously die with honor and with the OutOfMemory error.

Okay, so here is how you can start chatting from your local machine:

mkdir ~/tmp-data

gsutil cp gs://chatbot_generic/chatbot_generic_20161224_232158/translate.ckpt-1600.meta ~/tmp-data

...

gsutil cp gs://chatbot_generic/chatbot_generic_20161224_232158/translate.ckpt-1600.index ~/tmp-data

...

gsutil cp gs://chatbot_generic/chatbot_generic_20161224_232158/translate.ckpt-1600.data-00000-of-00001 ~/tmp-data

...

gsutil cp gs://chatbot_generic/chatbot_generic_20161224_232158/checkpoint ~/tmp-data

TRAIN_PATH=...
python -m translate.translate \
  --data_dir="${TRAIN_PATH}" \
  --train_dir="${TRAIN_PATH}" \
  --from_vocab_size=45000 \
  --to_vocab_size=45000 \
  --decode

Reading model parameters from /Users/b0noi/tmp-data/translate.ckpt-1600
> Hi there
you ? . . . . . . . .
> What do you want?
i . . . . . . . . .
> yes, you
i ? . . . . . . . .
> hi
you ? . . . . . . . .
> who are you?
i . . . . . . . . .
> yes you!
what ? . . . . . . . .
> who are you?
i . . . . . . . . .
>
you ' . . . . . . . .

The TRAIN_PATH should point to the “tmp_data” folder, and you need to be in the “models/tutorials/rnn” folder of the model repo.

As you can see chatbot is not super smart after 1600 iterations. If you want to know how the conversation will look like after 50+k iterations please have a look at the prev article, since the main purpose of this one was not to train the perfect chatbot, but to explain how to do it, so anyone can train it better.

Post factum

I hope that my article helped you to learn what the Google Cloud ML is and how you can use it in order to train your own NN. I also hope that you have enjoyed reading it, and if so, you can support me on my Patreon page and/or by liking/sharing the article.

If you have spotted any problems while executing the steps, please write a comment to me so I could update the article so no one else would have the same problem.