Kaggle比赛Flower Classification with TPUs中配置TPU以及加载公开数据集的问题

本文链接：https://blog.csdn.net/dmcgow/article/details/104361655

本文详细介绍了如何在TensorFlow Keras中配置TPU进行模型训练，包括TPU的检测、初始化及分布式策略的设置。同时，提供了在Kaggle上使用TPU时，如何从Google Cloud Storage加载数据集的实用代码，确保数据处理和模型训练的高效进行。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

配置TPU

关于如何配置TPU，官方文档（Tensor Processing Units (TPUs)）里面写的明明白白:
Once you have flipped the “Accelerator” switch in your notebook to “TPU v3-8”, this is how to enable TPU training in Tensorflow Keras:

# detect and init the TPU
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)

# instantiate a distribution strategy
tpu_strategy = tf.distribute.experimental.TPUStrategy(tpu)

# instantiating the model in the strategy scope creates the model on the TPU
with tpu_strategy.scope():
    model = tf.keras.Sequential( … ) # define your model normally
    model.compile( … )
    
# train model normally
model.fit(training_dataset, epochs=EPOCHS, steps_per_epoch=…)

#我是这么写的，因为TPU加速只有每周30小时，非常珍贵
#调试的时候放在CPU或者GPU环境下进行，commit的时候再把TPU加速打开
AUTO = tf.data.experimental.AUTOTUNE
ignore_order = tf.data.Options()
ignore_order.experimental_deterministic = False
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver() 
    print('Running on TPU',tpu.master())
except ValueError:
    tpu = None
    
if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
    strategy = tf.distribute.get_strategy() 

BATCH_SIZE = 16 * strategy.num_replicas_in_sync
print(f'tensorflow version : {tf.__version__}')
print("REPLICAS: ", strategy.num_replicas_in_sync)

TPUs are network-connected accelerators and you must first locate them on the network. This is what TPUClusterResolver() does.

Two additional lines of boilerplate and you can define a TPUStrategy.This object contains the necessary distributed training code that will work on TPUs with their 8 compute cores (see hardware section below).

Finally, you use the TPUStrategy by instantiating your model in the scope of the strategy. This creates the model on the TPU. Model size is constrained by the TPU RAM only, not by the amount of memory available on the VM running your Python code. Model creation and model training use the usual Keras APIs.

To go fast on a TPU, increase the batch size. The rule of thumb is to use batches of 128 elements per core (ex: batch size of 128*8=1024 for a TPU with 8 cores). At this size, the 128x128 hardware matrix multipliers of the TPU (see hardware section below) are most likely to be kept busy. You start seeing interesting speedups from a batch size of 8 per core though. In the sample above, the batch size is scaled with the core count through this line of code:

BATCH_SIZE = 16 * tpu_strategy.num_replicas_in_sync

加载Kaggle公开数据集

如何加载数据集官方文档写的就不是那么友好了，尤其针对我们这些不想翻墙的用户，首先是官方说法：

Because TPUs are very fast, many models ported to TPU end up with a data bottleneck. The TPU is sitting idle, waiting for data for the most part of each training epoch. TPUs read training data exclusively from GCS (Google Cloud Storage). And GCS can sustain a pretty large throughput if it is continuously streaming from multiple files in parallel. Following a couple of best practices will optimize the throughput:

For TPU training, organize your data in GCS in a reasonable number (10s to 100s) of reasonably large files (10s to 100s of MB).

With too few files, GCS will not have enough streams to get max throughput. With too many files, time will be wasted accessing each individual file.

Data for TPU training typically comes sharded across the appropriate number of larger files. The usual container format is TFRecords. You can load a dataset from TFRecords files by writing:

# On Kaggle you can also use KaggleDatasets().get_gcs_path() to obtain the GCS path of a Kaggle dataset
filenames = tf.io.gfile.glob("gs://flowers-public/tfrecords-jpeg-512x512/*.tfrec") # list files on GCS
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(...) # TFRecord decoding here...

什么意思呢，就是说，TPU计算速度很快，如果从本地读取训练数据，那TPU大部分时间都是空闲的，模型也会出现 data bottleneck，为了避免这种情况，Google要求所有在Kaggle上使用TPU进行模型训练的用户，将数据放在GCS（Google Cloud Storage）上，如歌还是向以前用GPU训练一样读取本地文件，那么就会收到一堆莫名其妙的报错。。。。。。

好在Kaggle有公开数据集，我们可以直接读取，直接贴代码吧

from kaggle_datasets import KaggleDatasets
gcs_path = KaggleDatasets().get_gcs_path(‘flowe-classification-with-tpus’) 
gcs_path
#gcs_path是GCS上数据集根目录的链接
#若需要使用自己的数据集，看我另一篇博客
我们可以用gsutil来查询返回的gcs_path下有哪些目录，然后找到训练数据地址
gsutil ls gs://kds-b2e6cdbc4af76dcf0363776c09c12fe46872cab211d1de9f60ec7aec

dataset_filenames = tf.io.gfile.glob(r'gs://kds-b2e6cdbc4af76dcf0363776c09c12fe46872cab211d1de9f60ec7aec/tfrecords-jpeg-512x512/*a*/*.tfrec')
dataset = tf.data.TFRecordDataset(dataset_filenames)
dataset = dataset.with_options(ignore_order)

#tfrec格式文件的读取方式，比赛说明里告诉了文件有什么内容，但是事实上
#文件中包含id,class,image
#官方声明为id,label,img
feature_description = {
    'class':tf.io.FixedLenFeature([],tf.int64),
    'image':tf.io.FixedLenFeature([],tf.string)
}
def dataset_decode(data):
    decode_data = tf.io.parse_single_example(data,feature_description)
    label = decode_data['class']
    image = tf.image.decode_jpeg(decode_data['image'],channels=3)
    image = tf.reshape(image,[512,512,3])
    image = tf.cast(image,tf.float32)
    image = (image - 127.5) / 127.5
    return image,label

dataset = dataset.map(dataset_decode)
dataset = dataset.shuffle(DATASET_SIZE).repeat().batch(BATCH_SIZE).prefetch(AUTO)
print(dataset)