使用TorchElastic训练DeepSpeech-CSDN博客

本文介绍了如何使用TorchElastic和Kubernetes降低成本并水平扩展deepspeech.pytorch的端到端语音到文本模型的训练。通过使用可抢占实例，可以显著降低扩展成本。TorchElastic简化了状态保存和恢复，便于处理训练中断，并提供了PyTorch分布式集成以确保跨GPU机器的弹性扩展。此外，文章还讨论了动态管理超参数和未来的扩展计划。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Reduce cost and horizontally scale deepspeech.pytorch using TorchElastic with Kubernetes.

使用TorchElastic和Kubernetes降低成本并水平扩展deepspeech.pytorch。

使用Deepspeech.pytorch的端到端语音到文本模型 (End-to-End Speech To Text Models Using Deepspeech.pytorch)

Deepspeech.pytorch provides training, evaluation and inference of End-to-End (E2E) speech to text models, in particular the highly popularised DeepSpeech2 architecture. Deepspeech.pytorch was developed to provide users the flexibility and simplicity to scale, train and deploy their own speech recognition models, whilst maintaining a minimalist design. Deepspeech.pytorch is a lightweight package for research iterations and integrations that fills the gap between audio research and production.

Deepspeech.pytorch为文本模型(尤其是高度流行的DeepSpeech2体系结构)提供了端到端(E2E)语音的培训，评估和推断。 Deepspeech.pytorch的开发旨在为用户提供灵活性和简便性，以扩展，训练和部署自己的语音识别模型，同时保持简约的设计。 Deepspeech.pytorch是用于研究迭代和集成的轻量级软件包，可填补音频研究与生产之间的空白。

使用TorchElastic进行水平缩放训练 (Scale Training Horizontally Using TorchElastic)

Training production E2E speech-to-text models currently requires thousands of hours of labelled transcription data. In recent cases, we see numbers exceeding 50k hours of labelled audio data. To train with these datasets requires optimised multi-GPU training and hyper-parameters configurations. As we move towards leveraging unlabelled audio data for our speech recognition models with the announcement of wav2vec 2.0, scaling and throughput will continue to be crucial to train larger models across larger datasets.

培训生产端到端语音到文本模型当前需要数千小时的标记转录数据。在最近的情况下，我们看到超过5万小时的标记音频数据。要使用这些数据集进行训练，需要优化的多GPU训练和超参数配置。随着wav2vec 2.0的发布，我们将为语音识别模型利用未标记的音频数据，缩放和吞吐量对于在更大的数据集中训练更大的模型仍然至关重要。

Multiple advancements in the field have improved training iteration times, such as the growth of cuDNN, introduction of Automatic Mixed Precision and in particular, multi-machine training. Many implementations have appeared to assist in multi-machine training such as KubeFlow, but usually come with a vast feature set to replace the entire training workflow. Implementations from scratch require significant engineering efforts, and from experience do not offer the robustness required to reliably scale. TorchElastic provides native PyTorch scaling capabilities and fits the lightweight paradigm of deepspeech.pytorch whilst giving enough customisation and freedom to users. In essence, TorchElastic provides integration to scale training in PyTorch with minimal effort, saving time from having to implement complex custom infrastructure and accelerating research to production times.

该领域的多项进步改善了训练迭代时间，例如cuDNN的增长，自动混合精度的引入，尤其是多机训练。已经出现了许多实现辅助多机器训练的实现，例如KubeFlow ，但是通常带有广泛的功能集来代替整个训练工作流程。从零开始的实现需要大量的工程工作，而从经验上不能提供可靠扩展所需的鲁棒性。 TorchElastic提供本机PyTorch缩放功能，并适合deepspeech.pytorch的轻量级范例，同时为用户提供了足够的自定义和自由度。本质上，TorchElastic以最小的工作量提供了集成，可以在PyTorch中进行规模培训，从而节省了必须实施复杂的自定义基础架构的时间，并缩短了研究到生产的时间。

通过使用可抢占实例来降低扩展成本 (Reduce Scaling Cost By Using Preemptible Instances)

One method to reduce costs when scaling is to utilise preemptible instances; virtual machines that are not being utilised by on demand users can be obtained at a substantially lower price. Comparing NVIDIA V100 prices on Google Cloud, it’s a 3x cost saving. A DeepSpeech training run using the popular LibriSpeech dataset costs around $510 using V100s on Google Cloud. Utilizing preemptible instances reduces this to $153, a massive cost reduction allowing for more research training cycles.

一种在扩展时降低成本的方法是利用可抢占实例。可以以低得多的价格获得未被按需用户使用的虚拟机。在Google Cloud上比较NVIDIA V100的价格，可节省3倍的成本。在Google Cloud上使用V100，使用流行的LibriSpeech数据集进行的DeepSpeech培训费用约为510美元。利用可抢占实例可以将其减少到153美元，从而大大降低了成本，从而可以进行更多的研究培训。

However due to their short life cycle, preemptible instances come with the caveat that interruptions can happen anytime and your code needs to manage this. This can be complex based on the training pipeline, as tracking the state of training to re-initialise may require keeping track of vast amounts of variables. One way to solve this is to abstract the “state” of training to save, load and continue training upon failures in the cluster, making it simpler to track new variables in the future.

但是，由于它们的生命周期短，因此可抢占的实例带有警告，即随时可能发生中断，并且您的代码需要对此进行管理。基于训练流水线，这可能很复杂，因为跟踪训练状态以重新初始化可能需要跟踪大量变量。解决此问题的一种方法是抽象出训练的“状态”，以保存，加载并在集群发生故障时继续进行训练，从而使将来更容易跟踪新变量。

Implementing state in training deepspeech.pytorch. See in full here.

训练deepspeech.pytorch的实施状态。在这里完整看到。

TorchElastic makes abstracting state really easy. Following example guidelines, the crucial functions to implement require state to be saved/resumed within the training code, then TorchElastic handles the rest. After integration, deepspeech.pytorch is able to handle interruptions seamlessly from previously saved state checkpoints. Deepspeech.pytorch also supports saving and loading state from a Google Cloud Storage bucket automatically, allowing us to mount read-only data drives to the node(s) and store our final models within an object store. TorchElastic cleans up a lot of boilerplate code, relieving the need to worry about distributed ranks, local GPU devices and distributed communication with other nodes.

TorchElastic使抽象状态变得非常容易。按照示例准则，要实现的关键功能需要在训练代码中保存/恢复状态，然后由TorchElastic处理其余部分。集成之后，deepspeech.pytorch能够无缝处理来自先前保存的状态检查点的中断。 Deepspeech.pytorch还支持自动从Google Cloud Storage存储桶中保存和加载状态，从而使我们可以将只读数据驱动器安装到节点上，并将最终模型存储在对象存储中。 TorchElastic清除了许多样板代码，从而无需担心分布式行，本地GPU设备以及与其他节点的分布式通信。

deepspeech.pytorch Training Config deepspeech.pytorch培训配置

We rely on Kubernetes to handle interruptions and node management. TorchElastic supplies us PyTorch distributed integration to ensure we’re able to scale across GPU enabled machines using Elastic Jobs. With the TorchElastic Kubernetes Operator (TEK) we’re able to transparently integrate distributed deepspeech.pytorch within our K8s GKE cluster.

我们依靠Kubernetes来处理中断和节点管理。 TorchElastic向我们提供了PyTorch分布式集成，以确保我们能够使用Elastic Jobs在支持GPU的计算机上进行扩展。使用TorchElastic Kubernetes Operator(TEK)，我们可以在我们的K8s GKE集群中透明地集成分布式deepspeech.pytorch。

Supported are both fault tolerant jobs (where nodes can fail at any moment) as well as node pools that are dynamically changing based on demand. This is particularly useful when using an auto-scaling node pool of preemptible GPU instances to train DeepSpeech models, whilst utilising as much of the pool as possible.

受支持的是容错作业(节点随时可能出现故障)以及根据需求动态变化的节点池。当使用可抢占式GPU实例的自动缩放节点池来训练DeepSpeech模型时，这特别有用，同时要尽可能多地利用该池。

缩放和管理超参数 (Scaling and Managing Hyper-parameters)

When node pools are dynamic, stability in hyper-parameters are key in handling variable sized compute pools. AdamW is a popular adaptive learning rate algorithm that provides stability when initially tuned, especially when node pools are dynamic and resources can be terminated or introduced at any time. When scaling to a substantial number of GPUs, fault tolerance has to be taken into consideration to ensure the pool is utilised and training completes even with disruptions. There are many other mini-batch specific and learning rate scheduler hyper-parameters that are also crucial, but thankfully with the recent addition of Hydra to deepspeech.pytorch, keeping track of these hyper-parameters is incredibly easy. In the future, deepspeech.pytorch will support various scaling hyper-parameter configurations for users to extend.

当节点池是动态的时，超参数的稳定性是处理可变大小的计算池的关键。 AdamW是一种流行的自适应学习速率算法，当最初进行调整时，尤其是在节点池是动态的并且可以随时终止或引入资源时，它可以提供稳定性。在扩展到大量GPU时，必须考虑容错能力，以确保使用内存池，即使发生中断也能完成培训。还有许多其他的特定于小批量的速率和学习速率调度程序的超参数也很重要，但是值得庆幸的是，随着最近在deepspeech.pytorch中添加了Hydra ，跟踪这些超参数非常容易。将来，deepspeech.pytorch将支持各种缩放超参数配置，以供用户扩展。

未来的步骤 (Future Steps)

As dataset sizes increase and research continues to show exciting ways to leverage unlabelled audio with new architectures inspired from NLP, scaling and throughput will be key to training speech-to-text models. Deepspeech.pytorch aims to be transparent, simple and allow users to build and extend the library for their use cases.

随着数据集大小的增加以及研究继续显示出令人兴奋的方法，这些方法可以利用未标记的音频和受NLP启发的新架构，缩放和吞吐量将是训练语音到文本模型的关键。 Deepspeech.pytorch旨在透明，简单，并允许用户针对其用例构建和扩展库。

Here are a few future directions for deepspeech.pytorch:

这是deepspeech.pytorch的一些未来方向：

Integrate TorchScript to introduce seamless production integration across Python and C++ back-ends
集成TorchScript以在Python和C ++后端之间引入无缝的生产集成
Introduce Trains by Allegro AI for Visualisation and Job Scheduling
介绍Allegro AI的火车以进行可视化和作业调度
Benchmark DeepSpeech using the new A2 A100 VMs available on Google Cloud, for further throughput/cost benefits
使用Google Cloud上可用的新A2 A100 VM对DeepSpeech进行基准测试，以进一步提高吞吐量/成本优势
Move towards abstracting the model, integrating recent advancements in model architectures such as ContextNet, and additional loss functions such as the ASG criterion
趋向于抽象模型，整合模型体系结构中的最新进展(例如ContextNet )和其他损失函数(例如ASG准则)

To get started with training your own DeepSpeech models using TorchElastic, have a look at the k8s template and the README. Feel free to reach out with any questions or create an issue here!

要开始使用TorchElastic训练自己的DeepSpeech模型，请查看k8s模板和README 。欢迎与任何问题联系或在此处创建问题！