19_Training and Deploying TensorFlowModels at Scale_walk目录_TensorFlow Serving_requests_REST_gRPC_Docker_Google API Client Library_gpu :https://blog.csdn.net/Linli522362242/article/details/119323411
19_2_Training & Deploying TensorFlowModels_%%writefile UsageError_colab_文件名含有空格_No dashboard_gcp
https://blog.csdn.net/Linli522362242/article/details/119626524
1. Create a Project:
https://blog.csdn.net/Linli522362242/article/details/119626524
named: mnist 10272021
==> click SELECT PROJECT ==>
==> Project ID : mnist-10272021
==> Project number : 97885218772
2. Authenticating the notebook to use your Google Cloud Project
This code authenticates the notebook, checking your valid Google Cloud credentials and identity. It is inside the if not tfc.remote()
block to ensure that it is only run in the notebook, and will not be run when the notebook code is sent to Google Cloud.
Note: For Kaggle Notebooks click on "Add-ons"->"Google Cloud SDK" before running the cell below.
# Using tfc.remote() to ensure this code only runs in notebook
# GCP_PROJECT_ID = "mnist-10272021"
if not tfc.remote():
# Authentication for Colab Notebooks
if "google.colab" in sys.modules:
print('google.colab')
from google.colab import auth
auth.authenticate_user()
os.environ['GOOGLE_CLOUD_PROJECT'] = GCP_PROJECT_ID
# Authentication for Kaggle Notebooks
if "kaggle_secrets" in sys.modules:
from kaggle_secrets import UserSecretsClient
UserSecretsClient().set_gcloud_credentials(project=GCP_PROJECT_ID)
==>
3. Create Bucket
We will use this storage bucket for temporary assets as well as to save the model checkpoints. Make a note of the name of the bucket for future reference. Note bucket names are unique globally.
==>==> Browser==>
==>CREATE BUCKET
Name your bucket: mnist_10272021_bucket
OR
GCS_BUCKET = "mnist_10272021_bucket"
# gs://mnist_10272021_bucket/mnist
GCS_BUCKET_PATH = f"gs://{GCS_BUCKET}"
!gsutil mb -p $GCP_PROJECT_ID $GCS_BUCKET_PATH
3. Link your billing account to your project
Next step is to set up the billing account for this project. Google Cloud Creates a project for you by default which is called “My First Project”. Use your Project ID (from step 1) to run the following commands. This will show you your Billing Account_ID, make a note of this for the next step.
!gcloud beta billing accounts list
Use your Billing Account_ID from above and run the following to link your billing account with your project.
Note if you use an existing project you may not see an Account_ID, this means you do not have the proper permissions to run the following commands, contact your admin or create a new project.
BILLING_ACCOUNT_ID = '01F938-DE847D-A19F05'
# GCP_PROJECT_ID = "mnist-10272021"
!gcloud beta billing projects link $GCP_PROJECT_ID --billing-account $BILLING_ACCOUNT_ID
OR Billing account ID : 01F938-DE847D-A19F05
4. Enable Required APIs for tensorflow-cloud in your project¶
For tensorflow_cloud we use two specific APIs: AI Platform Training Jobs API and Cloud builder API. Note that this is a one time setup for this project, you do not need to rerun this command for every notebook.
# GCP_PROJECT_ID = "mnist-10272021"
!gcloud services --project $GCP_PROJECT_ID enable ml.googleapis.com cloudbuild.googleapis.com
OR
click enable
click enable==>
5. Create a service account
This step is required to use HP Tuning on Google Cloud using CloudTuner. To create a service account and give it project editor access run the following command and make a note of your service account name.
# GCP_PROJECT_ID = "mnist-10272021"
# Service account name must be between 6 and 30 characters (inclusive),
# must begin with a lowercase letter, and consist of lowercase alphanumeric
# characters that can be separated by hyphens.
SERVICE_ACCOUNT_NAME ='mnist-10272021-sa'
SERVICE_ACCOUNT_EMAIL = f'{SERVICE_ACCOUNT_NAME}@{GCP_PROJECT_ID}.iam.gserviceaccount.com'
!gcloud iam --project $GCP_PROJECT_ID service-accounts create $SERVICE_ACCOUNT_NAME
!gcloud projects add-iam-policy-binding $GCP_PROJECT_ID \
--member serviceAccount:$SERVICE_ACCOUNT_EMAIL \
--role=roles/editor
The default AI Platform service account is identified by an email address with the format
service-PROJECT_NUMBER@cloud-ml.google.com.iam.gserviceaccount.com
. Using your Project number from step one, we construct the service account email and grant the default AI Platform service account admin role (roles/iam.serviceAccountAdmin) on your new service account.
# GCP_PROJECT_ID = "mnist-10272021"
PROJECT_NUMBER = "97885218772"
DEFAULT_AI_PLATFORM_SERVICE_ACCOUNT = f'service-{PROJECT_NUMBER}@cloud-ml.google.com.iam.gserviceaccount.com'
!gcloud iam --project $GCP_PROJECT_ID service-accounts add-iam-policy-binding \
--role=roles/iam.serviceAccountAdmin \
--member=serviceAccount:$DEFAULT_AI_PLATFORM_SERVICE_ACCOUNT \
$SERVICE_ACCOUNT_EMAIL
OR in the navigation menu, go to IAM & admin → Service accounts, ==>CREATE SERVICE ACCOUNT
https://blog.csdn.net/Linli522362242/article/details/119626524
You are now ready to run tensorflow-cloud. Note that these steps only need to be run one time. Once you have your project setup you can reuse the same project and bucket configuration for future runs. For any new notebooks you will need to repeat the step two to add your Google Cloud auth credentials.
Make a note of the following values as they are needed to run tensorflow-cloud.
print(f"Your GCP_PROJECT_ID is: {GCP_PROJECT_ID}")
print(f"Your SERVICE_ACCOUNT_NAME is: {SERVICE_ACCOUNT_NAME}")
print(f"Your BUCKET_NAME is: {GCS_BUCKET}")
GCP_PROJECT_ID: mnist-10272021
GCS_BUCKET: mnist_10272021_bucket
JOB_NAME: mnist
The JOB_NAME
is optional, and you can set it to any string. If you are doing multiple training experiemnts (for example) as part of a larger project, you may want to give each of them a unique JOB_NAME
.
6. Import required modules
This guide requires TensorFlow Cloud, which you can install via:
!pip install tensorflow_cloud
import os
import sys
import tensorflow as tf
import tensorflow_cloud as tfc
7. Project Configurations
# Set Google Cloud Specific parameters
# set GCP_PROJECT_ID to your own Google Cloud project ID.
GCP_PROJECT_ID = "mnist-10272021"
# set GCS_BUCKET to your own Google Cloud Storage (GCS) bucket.
GCS_BUCKET = "mnist_10272021_bucket"
# DO NOT CHANGE: Currently only the 'us-central1' region is supported.
REGION = "us-central1"
# OPTIONAL: You can change the job name to any string.
JOB_NAME = "mnist"
# Setting location were training logs and checkpoints will be stored
# gs://mnist_10272021_bucket/mnist
GCS_BASE_PATH = f"gs://{GCS_BUCKET}/{JOB_NAME}"
TENSORBOARD_LOGS_DIR = os.path.join(GCS_BASE_PATH, "logs") # gs://mnist_10272021_bucket/mnist/logs
MODEL_CHECKPOINT_DIR = os.path.join(GCS_BASE_PATH, "checkpoints") # gs://mnist_10272021_bucket/mnist/checkpoints
SAVED_MODEL_DIR = os.path.join(GCS_BASE_PATH, "saved_model") # gs://mnist_10272021_bucket/mnist/saved_model
8. Authenticating the notebook to use your Google Cloud Project
This code authenticates the notebook, checking your valid Google Cloud credentials and identity. It is inside the if not tfc.remote()
block to ensure that it is only run in the notebook, and will not be run when the notebook code is sent to Google Cloud.
Note: For Kaggle Notebooks click on "Add-ons"->"Google Cloud SDK" before running the cell below.
# Using tfc.remote() to ensure this code only runs in notebook
if not tfc.remote():
# Authentication for Colab Notebooks
if "google.colab" in sys.modules:
print('google.colab')
from google.colab import auth
auth.authenticate_user()
os.environ['GOOGLE_CLOUD_PROJECT'] = GCP_PROJECT_ID
# Authentication for Kaggle Notebooks
if "kaggle_secrets" in sys.modules:
from kaggle_secrets import UserSecretsClient
UserSecretsClient().set_gcloud_credentials(project=GCP_PROJECT_ID)
==>
9. Model and data setup
From here we are following the basic procedure for setting up a simple Keras model to run classification on the MNIST dataset.
9.1Load and split data
Read raw data and split to train and test data sets.
(x_train, y_train), (x_test,y_test) = tf.keras.datasets.mnist.load_data()
9.2 Create a model and prepare for training
Create a simple model and set up a few callbacks for it.
from tensorflow.keras import layers
from tensorflow import keras
model = keras.Sequential(
[
keras.Input(shape=(28, 28)),
# Use a Rescaling layer to make sure input values are in the [0, 1] range.
layers.experimental.preprocessing.Rescaling(1.0 / 255),
# The original images have shape (28, 28), so we reshape them to (28, 28, 1)
layers.Reshape(target_shape=(28, 28, 1)),
# Follow-up with a classic small convnet
layers.Conv2D(32, 3, activation="relu"),
layers.MaxPooling2D(2),
layers.Conv2D(32, 3, activation="relu"),
layers.MaxPooling2D(2),
layers.Conv2D(32, 3, activation="relu"),
layers.Flatten(),
layers.Dense(128, activation="relu"),
layers.Dense(10),
]
)
model.compile(
optimizer=keras.optimizers.Adam(),
loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=keras.metrics.SparseCategoricalAccuracy(),
)
Quick validation training
We'll train the model for one (1) epoch just to make sure everything is set up correctly, and we'll wrap that training command in `if not tfc.remote`, so that it only happens here in the runtime environment in which you are reading this, not when it is sent to Google Cloud.
if not tfc.remote():
# Run the training for 1 epoch and a small subset of the data to validate setup
model.fit( x=x_train[:100], y=y_train[:100], validation_split=0.2, epochs=1 )
9.3 Prepare for remote training
The code below will only run when the notebook code is sent to Google Cloud, not inside the runtime in which you are reading this.
First, we set up callbacks which will:
- Create logs for TensorBoard.
- Create checkpoints and save them to the checkpoints directory specified above.
- Stop model training if loss is not improving sufficiently.
Then we call model.fit
and model.save
, which (when this code is running on Google Cloud) which actually run the full training (100 epochs) and then save the trained model in the GCS Bucket and directory defined above.
if tfc.remote():
# Configure Tensorboard logs
callbacks = [
# gs://mnist_10272021_bucket/mnist/logs
tf.keras.callbacks.TensorBoard( log_dir = TENSORBOARD_LOGS_DIR ),
# gs://mnist_10272021_bucket/mnist/checkpoints
tf.keras.callbacks.ModelCheckpoint( MODEL_CHECKPOINT_DIR, save_best_only=True ),
# patience: Number of epochs with no improvement after--which training will be stopped.
tf.keras.callbacks.EarlyStopping( monitor="val_loss", min_delta=0.001, patience=3 ),
]
model.fit( x=x_train, y=y_train,
epochs=100, validation_split=0.2,
callbacks=callbacks,
batch_size=100
)
# Let's save the model in GCS after the training is complete.
model.save( SAVED_MODEL_DIR )#gs://mnist_10272021_bucket/mnist/saved_model
Start the remote training
TensorFlow Cloud takes all the code from its local execution environment (this notebook), wraps it up, and sends it to Google Cloud for execution. (That's why the `if` and `if not tfc.remote` wrappers are important.)
This step will prepare your code from this notebook for remote execution and then start a remote training job on Google Cloud Platform to train the model.
- First we add the `tensorflow-cloud` Python package to a `requirements.txt` file, which will be sent along with the code in this notebook. You can add other packages here as needed.
- Then a GPU and a CPU image are specified. You only need to specify one or the other; the GPU is used in the code that follows.
- Finally, the heart of TensorFlow cloud: the call to `tfc.run`. When this is executed inside this notebook, all the code from this notebook, and the rest of the files in this directory, will be packaged and sent to Google Cloud for execution. The parameters on the `run` method specify specify the details of the execution environment and the distribution strategy (if any) to be used.
# If you are using a custom image you can install modules via requirements txt file.
with open("requirements.txt", "w") as f:
f.write( "tensorflow-cloud\n" )
# Optional: Some recommended base images.
# If you provide none the system will choose one for you.
TF_GPU_IMAGE = "gcr.io/deeplearning-platform-release/tf2-cpu.2-5"
TF_CPU_IMAGE = "gcr.io/deeplearning-platform-release/tf2-gpu.2-5"
# Submit a single node training job using GPU.
tfc.run( distribution_strategy="auto",
requirements_txt = "requirements.txt",
# We can also use this storage bucket for Docker image building,
# instead of your local Docker instance.
# For this, just add your bucket to the docker_image_bucket_name parameter.
docker_config = tfc.DockerConfig(
parent_image = TF_GPU_IMAGE,
image_build_bucket = GCS_BUCKET, # GCS_BUCKET = "mnist_10272021_bucket"
),
chief_config = tfc.COMMON_MACHINE_CONFIGS['K80_1X'],
job_labels = {'job':JOB_NAME}
)
Once the job is submitted you can go to the next step to monitor the jobs progress via Tensorboard.
TENSORBOARD_LOGS_DIR
==>
Training Results
Reconnect your Colab instance
Most remote training jobs are long running. If you are using Colab, it may time out before the training results are available.
In that case, **rerun the following sections in order** to reconnect and configure your Colab instance to access the training results.
6. Import required modules
7. Project Configurations
8. Authenticating the notebook to use your Google Cloud Project
**DO NOT** rerun the rest of the code.
### Load Tensorboard
While the training is in progress you can use Tensorboard to view the results. Note the results will show only after your training has started. This may take a few minutes.
Load Tensorboard
While the training is in progress you can use Tensorboard to view the results. Note the results will show only after your training has started. This may take a few minutes.
# Commented out IPython magic to ensure Python compatibility. !!!!!!
# %load_ext tensorboard
%reload_ext tensorboard
%tensorboard --logdir $TENSORBOARD_LOGS_DIR
Load your trained model
Once training is complete, you can retrieve your model from the GCS Bucket you specified above.
trained_model = tf.keras.models.load_model( SAVED_MODEL_DIR )
trained_model.summary()
X_new = x_test[:3]
Y_pred = trained_model.predict(X_new)
import numpy as np
np.argmax( Y_pred, axis=-1 )
Disabling projects linked to your billing account