【DeepSpeed 教程翻译】三，在 DeepSpeed 中使用 PyTorch Profiler做性能调试和Flops Profiler教程翻译

最新推荐文章于 2025-04-18 02:39:44 发布

just_sort

最新推荐文章于 2025-04-18 02:39:44 发布

阅读量1.9k

点赞数 1

文章标签： pytorch 深度学习人工智能

本文链接：https://blog.csdn.net/just_sort/article/details/131402383

版权

文章目录

0x0. 前言
0x1. 在 DeepSpeed 中使用 PyTorch Profiler做性能调试
0x2. Flops Profiler

0x0. 前言

这篇翻译是对 https://www.deepspeed.ai/tutorials/pytorch-profiler/ 和 https://www.deepspeed.ai/tutorials/flops-profiler/ 两篇教程做的，使用DeepSpeed训练模型可以基于这两个教程做一下Profile工作判断模型的计算以及内存瓶颈在哪个地方。

0x1. 在 DeepSpeed 中使用 PyTorch Profiler做性能调试

对应原始的教程：https://www.deepspeed.ai/tutorials/pytorch-profiler/

这个教程描述的是如何在DeepSpeed中使用PyTorch Profiler工具（https://pytorch.org/blog/introducing-pytorch-profiler-the-new-and-improved-performance-tool/）。

PyTorch Profiler 是一个开源工具，能够为大规模深度学习模型提供精确且高效的性能分析和故障排查。分析结果可以输出为 .json 追踪文件，并在 Google Chrome 的追踪查看器 (chrome://tracing) 中查看。Microsoft Visual Studio Code 的 Python 扩展将 TensorBoard 集成到代码编辑器中，包括对 PyTorch Profiler 的支持。更多的细节可以参考（https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html#pytorch-profiler）

Profile模型训练的循环

下面展示了如何通过在 Profiler 上下文管理器中封装代码来分析训练循环。Profiler 假设训练过程由steps（从零开始编号）组成。PyTorch Profiler 接受许多参数，例如 schedule, on_trace_ready, with_stack 等。

在下面的例子中，分析器将跳过前5步，使用接下来的2步作为预热，并记录接下来的6步。由于repeat设为2，所以分析器将在两个周期后停止记录（这里的周期的意思是将active的step数重复repeat次）。关于 schedule 的详细使用方法，请参考使用Profiler分析长时间运行的任务（https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html#using-profiler-to-analyze-long-running-jobs）。


from torch.profiler import profile, record_function, ProfilerActivity

with torch.profiler.profile(
    schedule=torch.profiler.schedule(
        wait=5, # During this phase profiler is not active.
        warmup=2, # During this phase profiler starts tracing, but the results are discarded.
        active=6, # During this phase profiler traces and records data.
        repeat=2), # Specifies an upper bound on the number of cycles.
    on_trace_ready=tensorboard_trace_handler,
    with_stack=True # Enable stack tracing, adds extra profiling overhead.
) as profiler:
    for step, batch in enumerate(data_loader):
        print("step:{}".format(step))

        #forward() method
        loss = model_engine(batch)

        #runs backpropagation
        model_engine.backward(loss)

        #weight update
        model_engine.step()
        profiler.step() # Send the signal to the profiler that the next step has started.

标记任意代码范围

可以使用 record_function 上下文管理器标记任意用户指定的代码范围。例如，以下代码将 “model_forward” 标记为一个label：

with profile(record_shapes=True) as prof: # record_shapes indicates whether to record shapes of the operator inputs.
    with record_function("model_forward"):"
        model_engine(inputs)

后续在profile结果里面就可以看到标记的这个"model_forward"的耗时情况了。

Profile CPU/GPU的活动

传递给 Profiler 的 activities 参数指定了在使用 profiler 上下文管理器包装的代码范围执行期间要进行性能分析的活动列表：

ProfilerActivity.CPU - PyTorch 操作符、TorchScript 函数和用户定义的代码标签（record_function）。
ProfilerActivity.CUDA - 在设备上的 CUDA 核函数。请注意，CUDA 性能分析会带来不可忽视的开销。
下面的例子在模型的前向传播中对 CPU 和 GPU 的活动进行了性能分析，并按总 CUDA 时间排序打印了总结表。

with profile(activities=[
        ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
    with record_function("model_forward"):
        model_engine(inputs)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

Profile 内存消耗

通过向 PyTorch Profiler传递 profile_memory=True，我们启用了内存分析功能，该功能记录在模型OP的执行过程中分配（或释放）的模型张量所使用的内存量。例如：

with profile(activities=[ProfilerActivity.CUDA],
        profile_memory=True, record_shapes=True) as prof:
    model(inputs)

print(prof.key_averages().table(sort_by="self_cuda_memory_usage", row_limit=10))

0x2. Flops Profiler

对应原始的教程：https://www.deepspeed.ai/tutorials/flops-profiler/

在这个教程中，我们将介绍 DeepSpeed Flops Profiler，并提供其使用的示例。

总览

有效利用硬件资源对于良好的性能至关重要，但在现有的大规模模型训练和推理实现中，性能低效往往难以察觉并归因于特定的模块组件。DeepSpeed Flops Profiler帮助用户轻松测量模型及其子模块的训练/推理速度（延迟，吞吐量）和效率（每秒浮点运算次数，即FLOPS），旨在消除现有实现中的效率低下问题。

以下是在A100 GPU上，批量大小为80的BERT-Large（NVIDIA）的示例输出：

-------------------------- DeepSpeed Flops Profiler --------------------------
Profile Summary at step 10:
Notations:
data parallel size (dp_size), model parallel size(mp_size),
number of parameters (params), number of multiply-accumulate operations(MACs),
number of floating-point operations (flops), floating-point operations per second (FLOPS),
fwd latency (forward propagation latency), bwd latency (backward propagation latency),
step (weights update latency), iter latency (sum of fwd, bwd and step latency)

world size:                                                   1
data parallel size:                                           1
model parallel size:                                          1
batch size per GPU:                                           80
params per gpu:                                               336.23 M
params of model = params per GPU * mp_size:                   336.23 M
fwd MACs per GPU:                                             3139.93 G
fwd flops per GPU:                                            6279.86 G
fwd flops of model = fwd flops per GPU * mp_size:             6279.86 G
fwd latency:                                                  76.67 ms
bwd latency:                                                  108.02 ms
fwd FLOPS per GPU = fwd flops per GPU / fwd latency:          81.9 TFLOPS
bwd FLOPS per GPU = 2 * fwd flops per GPU / bwd latency:      116.27 TFLOPS
fwd+bwd FLOPS per GPU = 3 * fwd flops per GPU / (fwd+bwd latency):   102.0 TFLOPS
step latency:                                                 34.09 us
iter latency:                                                 184.73 ms
samples/second:                                               433.07

----------------------------- Aggregated Profile per GPU -----------------------------
Top modules in terms of params, MACs or fwd latency at different model depths:
depth 0:
    params      - {'BertForPreTrainingPreLN': '336.23 M'}
    MACs        - {'BertForPreTrainingPreLN': '3139.93 GMACs'}
    fwd latency - {'BertForPreTrainingPreLN': '76.39 ms'}
depth 1:
    params      - {'BertModel': '335.15 M', 'BertPreTrainingHeads': '32.34 M'}
    MACs        - {'BertModel': '3092.96 GMACs', 'BertPreTrainingHeads': '46.97 GMACs'}
    fwd latency - {'BertModel': '34.29 ms', 'BertPreTrainingHeads': '3.23 ms'}
depth 2:
    params      - {'BertEncoder': '302.31 M', 'BertLMPredictionHead': '32.34 M'}
    MACs        - {'BertEncoder': '3092.88 GMACs', 'BertLMPredictionHead': '46.97 GMACs'}
    fwd latency - {'BertEncoder': '33.45 ms', 'BertLMPredictionHead': '2.61 ms'}
depth 3:
    params      - {'ModuleList': '302.31 M', 'Embedding': '31.79 M', 'Linear': '31.26 M'}
    MACs        - {'ModuleList': '3092.88 GMACs', 'Linear': '36.23 GMACs'}
    fwd latency - {'ModuleList': '33.11 ms', 'BertPredictionHeadTransform': '1.83 ms''}
depth 4:
    params      - {'BertLayer': '302.31 M', 'LinearActivation': '1.05 M''}
    MACs        - {'BertLayer': '3092.88 GMACs', 'LinearActivation': '10.74 GMACs'}
    fwd latency - {'BertLayer': '33.11 ms', 'LinearActivation': '1.43 ms'}
depth 5:
    params      - {'BertAttention': '100.76 M', 'BertIntermediate': '100.76 M'}
    MACs        - {'BertAttention': '1031.3 GMACs', 'BertIntermediate': '1030.79 GMACs'}
    fwd latency - {'BertAttention': '19.83 ms', 'BertOutput': '4.38 ms'}
depth 6:
    params      - {'LinearActivation': '100.76 M', 'Linear': '100.69 M'}
    MACs        - {'LinearActivation': '1030.79 GMACs', 'Linear': '1030.79 GMACs'}
    fwd latency - {'BertSelfAttention': '16.29 ms', 'LinearActivation': '3.48 ms'}

------------------------------ Detailed Profile per GPU ------------------------------
Each module profile is listed after its name in the following order:
params, percentage of total params, MACs, percentage of total MACs, fwd latency, percentage of total fwd latency, fwd FLOPS

BertForPreTrainingPreLN(
  336.23 M, 100.00% Params, 3139.93 GMACs, 100.00% MACs, 76.39 ms, 100.00% latency, 82.21 TFLOPS,
  (bert): BertModel(
    335.15 M, 99.68% Params, 3092.96 GMACs, 98.50% MACs, 34.29 ms, 44.89% latency, 180.4 TFLOPS,
    (embeddings): BertEmbeddings(...)
    (encoder): BertEncoder(
      302.31 M, 89.91% Params, 3092.88 GMACs, 98.50% MACs, 33.45 ms, 43.79% latency, 184.93 TFLOPS,
      (FinalLayerNorm): FusedLayerNorm(...)
      (layer): ModuleList(
        302.31 M, 89.91% Params, 3092.88 GMACs, 98.50% MACs, 33.11 ms, 43.35% latency, 186.8 TFLOPS,
        (0): BertLayer(
          12.6 M, 3.75% Params, 128.87 GMACs, 4.10% MACs, 1.29 ms, 1.69% latency, 199.49 TFLOPS,
          (attention): BertAttention(
            4.2 M, 1.25% Params, 42.97 GMACs, 1.37% MACs, 833.75 us, 1.09% latency, 103.08 TFLOPS,
            (self): BertSelfAttention(
              3.15 M, 0.94% Params, 32.23 GMACs, 1.03% MACs, 699.04 us, 0.92% latency, 92.22 TFLOPS,
              (query): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 182.39 us, 0.24% latency, 117.74 TFLOPS,...)
              (key): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 57.22 us, 0.07% latency, 375.3 TFLOPS,...)
              (value): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 53.17 us, 0.07% latency, 403.91 TFLOPS,...)
              (dropout): Dropout(...)
              (softmax): Softmax(...)
            )
            (output): BertSelfOutput(
              1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 114.68 us, 0.15% latency, 187.26 TFLOPS,
              (dense): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 64.13 us, 0.08% latency, 334.84 TFLOPS, ...)
              (dropout): Dropout(...)
            )
          )
          (PreAttentionLayerNorm): FusedLayerNorm(...)
          (PostAttentionLayerNorm): FusedLayerNorm(...)
          (intermediate): BertIntermediate(
            4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 186.68 us, 0.24% latency, 460.14 TFLOPS,
            (dense_act): LinearActivation(4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 175.0 us, 0.23% latency, 490.86 TFLOPS,...)
          )
          (output): BertOutput(
            4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 116.83 us, 0.15% latency, 735.28 TFLOPS,
            (dense): Linear(4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 65.57 us, 0.09% latency, 1310.14 TFLOPS,...)
            (dropout): Dropout(...)
          )
        )
        ...
        (23): BertLayer(...)
      )
    )
    (pooler): BertPooler(...)
  )
  (cls): BertPreTrainingHeads(...)
)
------------------------------------------------------------------------------

在 profile 总结中，DeepSpeed Flops Profiler输出了模型的参数量，浮点运算数（flops），FLOPS，延迟，以及样本/秒的吞吐量。此概况显示了当前模型执行与硬件峰值性能之间的性能差距，并帮助用户调整训练或推理设置（例如，超参数，数据并行性，模型并行性，系统配置等）以获得更好的性能。

DeepSpeed Flops Profiler还可以在不同的模型深度（聚合profile）和模型架构中的特定模块（详细profile）Profile重要模块。通过这些Profile，DeepSpeed用户可以理解每个层或子模块对整个模型复杂性/性能的贡献。然后，用户可以调整或重构模型设计以提高性能。例如，使用Profiler，DeepSpeed用户可以量化地判断堆叠较小的层是否比拥有较大的层更轻量或性能更好。聚合和详细 Profile 还允许用户快速识别瓶颈模块。在上面的BERT-Large示例中，使用DeepSpeed Flops Profiler，我们发现BertLayer是最重要的层，并且包含了很多dropout，softmax和layer norm以及线性层模块。这些模块在flops中并不heavy，但会触发许多GPU Kernel调用并创建过多的内存读/写请求。详细Profile中显示的Pattern表明这是Kernel融合的完美匹配，我们开发了fused transformer-kernels来减少数据移动（参见 https://www.deepspeed.ai/tutorials/bert-pretraining/）。在应用我们的优化之后，我们在DeepSpeed Flops Profiler的输出中看到每GPU的FLOPS和总体训练样本/秒提高了25%。

DeepSpeed Flops Profiler可以与DeepSpeed运行时一起使用，并无需任何用户代码更改，也可以独立于DeepSpeed作为一个独立的包使用。在使用DeepSpeed进行模型训练时，可以在DeepSpeed配置文件（https://www.deepspeed.ai/docs/config-json/#flops-profiler）中启用分析器。作为一个独立的包，分析器API可以在训练和推理代码中使用。DeepSpeed分析器仍在积极开发中，目前仅包含初始功能。请保持关注，很快会添加更多激动人心的功能。

Flops 测量

与现有的flops计算工具或方法类似，DeepSpeed Flops分析器测量Module前向传播的flops，而反向传播的flops则被估计为前向传播flops的两倍。与计算PyTorch Op的flops的PyTorch分析器不同，DeepSpeed Flops分析器测量模型中模块内部的flops，并为用户提供关于模型执行的更多洞察。flops估计部分受到ptflops（https://github.com/sovrasov/flops-counter.pytorch）的启发，主要区别在于，DeepSpeed Flops分析器不仅支持直接在模块级别进行FLOPS计算，还可以捕获在模块中调用的torch.nn.functional来估计flops。因此，DeepSpeed Flops分析器允许在模型中使用自定义模块，例如Megatron-LM中的ParallelTransformerLayerworks、ParallelSelfAttention、RowParallelLinear等。这与ptflops形成对比，ptflops要求用户为每个自定义模块编写自定义的flops计算函数。

多GPU，多节点，数据并行和模型并行

DeepSpeed Flops 分析器输出每个 GPU 的分析结果以及world size，数据并行大小和模型并行大小。

对于在多 GPU 或多节点上运行的模型，只有模型并行（例如，Megatron-LM 中的 --model-parallel-size）的改变会影响浮点操作数和Paramater的分析结果，即，model_parallel_size * flops = total_flops 和 model_parallel_size * parameters = total_parameters。数据并行大小或world size（与 GPU 或节点的数量相关）不会影响每个 GPU 的分析结果。

例子

DeepSpeed Flops 分析器可以与 DeepSpeed 运行时一起使用，也可以作为一个独立的包使用。当使用 DeepSpeed 进行模型训练时，用户无需更改代码，就可以在 deepspeed 配置文件（https://www.deepspeed.ai/docs/config-json/#flops-profiler）中配置分析器。要在 DeepSpeed 运行时之外使用 flops 分析器，安装 DeepSpeed 并导入 flops_profiler 包直接使用 API。下面给出了每种使用方法的示例。

和DeepSpeed运行时一起使用

当使用 DeepSpeed 进行模型训练时，可以在 deepspeed 配置文件中配置分析器。使用分析器不需要明确的 API 调用。可以通过在 deepspeed 的配置 json 文件中添加以下字段来启用分析器。具体详情请参考 flops profiler(https://www.deepspeed.ai/docs/config-json/#flops-profiler)。

{
  "flops_profiler": {
    "enabled": true,
    "profile_step": 1,
    "module_depth": -1,
    "top_modules": 1,
    "detailed": true,
    "output_file": null
    }
}

在Megatron-LM中使用

关于使用 DeepSpeed 运行 Megatron-LM 的信息，请参考我们的教程 Megatron-LM。

下面展示了一个 12 层 Megatron-LM 模型的示例输出（hidden_size = 8192，num_attention_heads = 32，batch_size = 1024，seq_length = 1024）。

-------------------------- DeepSpeed Flops Profiler --------------------------
Profile Summary at step 10:
Notations:
data parallel size (dp_size), model parallel size(mp_size),
number of parameters (params), number of multiply-accumulate operations(MACs),
number of floating-point operations (flops), floating-point operations per second (FLOPS),
fwd latency (forward propagation latency), bwd latency (backward propagation latency),
step (weights update latency), iter latency (sum of fwd, bwd and step latency)

world size:                                                   1
data parallel size:                                           1
model parallel size:                                          1
batch size per GPU:                                           1024
params per gpu:                                               1.29 M
params of model = params per GPU * mp_size:                   1.29 M
fwd MACs per GPU:                                             41271.95 G
fwd flops per GPU:                                            82543.9 G
fwd flops of model = fwd flops per GPU * mp_size:             82543.9 G
fwd latency:                                                  1.89 s
bwd latency:                                                  5.38 s
fwd FLOPS per GPU = fwd flops per GPU / fwd latency:          43.68 TFLOPS
bwd FLOPS per GPU = 2 * fwd flops per GPU / bwd latency:      30.7 TFLOPS
fwd+bwd FLOPS per GPU = 3 * fwd flops per GPU / (fwd+bwd latency):   34.07 TFLOPS
step latency:                                                 34.12 s
iter latency:                                                 41.39 s
samples/second:                                               24.74

----------------------------- Aggregated Profile per GPU -----------------------------
Top 1 modules in terms of params, MACs or fwd latency at different model depths:
depth 0:
    params      - {'GPT2Model': '1.29 M'}
    MACs        - {'GPT2Model': '41271.95 GMACs'}
    fwd latency - {'GPT2Model': '1.84 s'}
depth 1:
    params      - {'TransformerLanguageModel': '1.29 M'}
    MACs        - {'TransformerLanguageModel': '39584.03 GMACs'}
    fwd latency - {'TransformerLanguageModel': '1.83 s'}
depth 2:
    params      - {'ParallelTransformer': '1.29 M'}
    MACs        - {'ParallelTransformer': '39584.03 GMACs'}
    fwd latency - {'ParallelTransformer': '1.81 s'}
depth 3:
    params      - {'ModuleList': '1.28 M'}
    MACs        - {'ModuleList': '39584.03 GMACs'}
    fwd latency - {'ModuleList': '1.3 s'}
depth 4:
    params      - {'ParallelTransformerLayerPart2': '688.15 k'}
    MACs        - {'ParallelTransformerLayerPart2': '26388.28 GMACs'}
    fwd latency - {'ParallelTransformerLayerPart2': '865.73 ms'}
depth 5:
    params      - {'ParallelMLP': '491.54 k'}
    MACs        - {'ParallelMLP': '26388.28 GMACs'}
    fwd latency - {'ParallelMLP': '849.4 ms'}

------------------------------ Detailed Profile per GPU ------------------------------
Each module profile is listed after its name in the following order:
params, percentage of total params, MACs, percentage of total MACs, fwd latency, percentage of total fwd latency, fwd FLOPS

Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss). They are not counted as submodules, thus not to be printed out. However they make up the difference between a parent's MACs(or latency) and the sum of its submodules'.
1. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.
2. The fwd latency listed in the top module's profile is directly captured at the module forward function in PyTorch, thus it's less than the fwd latency shown above which is captured in DeepSpeed.

GPT2Model(
  1.29 M, 100.00% Params, 41271.95 GMACs, 100.00% MACs, 1.84 s, 100.00% latency, 44.78 TFLOPS,
  (language_model): TransformerLanguageModel(
    1.29 M, 100.00% Params, 39584.03 GMACs, 95.91% MACs, 1.83 s, 99.11% latency, 43.34 TFLOPS,
    (embedding): Embedding(
      2, 0.00% Params, 0 MACs, 0.00% MACs, 18.1 ms, 0.98% latency, 0.0 FLOPS,
      (word_embeddings): VocabParallelEmbedding(1, 0.00% Params, 0 MACs, 0.00% MACs, 164.75 us, 0.01% latency, 0.0 FLOPS, )
      (position_embeddings): Embedding(1, 0.00% Params, 0 MACs, 0.00% MACs, 489.23 us, 0.03% latency, 0.0 FLOPS, 1024, 8192)
      (embedding_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 93.94 us, 0.01% latency, 0.0 FLOPS, p=0.1, inplace=False)
    )
    (transformer): ParallelTransformer(
      1.29 M, 100.00% Params, 39584.03 GMACs, 95.91% MACs, 1.81 s, 98.11% latency, 43.78 TFLOPS,
      (layers): ModuleList(
        1.28 M, 98.73% Params, 39584.03 GMACs, 95.91% MACs, 1.3 s, 70.66% latency, 60.79 TFLOPS,
        (0): ParallelTransformerLayerPart1(
          49.15 k, 3.80% Params, 1099.65 GMACs, 2.66% MACs, 23.5 ms, 1.27% latency, 93.6 TFLOPS,
          (input_layernorm): FusedLayerNorm(16.38 k, 1.27% Params, 0 MACs, 0.00% MACs, 128.75 us, 0.01% latency, 0.0 FLOPS, torch.Size([8192]), eps=1e-05, elementwise_affine=True)
          (attention): ParallelSelfAttention(
            32.77 k, 2.53% Params, 1099.65 GMACs, 2.66% MACs, 22.8 ms, 1.24% latency, 96.46 TFLOPS,
            (query_key_value): ColumnParallelLinear(24.58 k, 1.90% Params, 824.63 GMACs, 2.00% MACs, 8.93 ms, 0.48% latency, 184.7 TFLOPS, )
            (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 134.22 MMACs, 0.00% MACs, 151.16 us, 0.01% latency, 1.78 TFLOPS, )
            (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 79.63 us, 0.00% latency, 0.0 FLOPS, p=0.1, inplace=False)
            (dense): RowParallelLinear(8.19 k, 0.63% Params, 274.88 GMACs, 0.67% MACs, 2.67 ms, 0.14% latency, 205.81 TFLOPS, )
          )
        )
        (1): ParallelTransformerLayerPart2(
          57.35 k, 4.43% Params, 2199.02 GMACs, 5.33% MACs, 77.53 ms, 4.21% latency, 56.73 TFLOPS,
          (post_attention_layernorm): FusedLayerNorm(16.38 k, 1.27% Params, 0 MACs, 0.00% MACs, 116.11 us, 0.01% latency, 0.0 FLOPS, torch.Size([8192]), eps=1e-05, elementwise_affine=True)
          (mlp): ParallelMLP(
            40.96 k, 3.16% Params, 2199.02 GMACs, 5.33% MACs, 76.19 ms, 4.13% latency, 57.72 TFLOPS,
            (dense_h_to_4h): ColumnParallelLinear(32.77 k, 2.53% Params, 1099.51 GMACs, 2.66% MACs, 10.79 ms, 0.59% latency, 203.81 TFLOPS, )
            (dense_4h_to_h): RowParallelLinear(8.19 k, 0.63% Params, 1099.51 GMACs, 2.66% MACs, 14.38 ms, 0.78% latency, 152.95 TFLOPS, )
          )
        )
        ...
        (23): ParallelTransformerLayerPart2(...)
      )
      (final_layernorm): FusedLayerNorm(16.38 k, 1.27% Params, 0 MACs, 0.00% MACs, 110.86 us, 0.01% latency, 0.0 FLOPS, torch.Size([8192]), eps=1e-05, elementwise_affine=True)
    )
  )
)
------------------------------------------------------------------------------

可以参考最新的DeepSpeed-Megatron仓库，然后在训练模型时将DeepSpeed的config文件配置DeepSpeed Profiler。

在 DeepSpeed 运行环境之外的使用方法

profiler 可以在 DeepSpeed 运行时环境之外作为一个独立的包来使用。你只需要简单地安装 DeepSpeed 并导入 flops_profiler 包来直接使用 API。关于如何安装 DeepSpeed，请参考 DeepSpeed 的安装指南。

在模型推理中

要对推理状态的训练模型进行性能分析，请使用 get_model_profile 函数。下面给出了一些示例。

AlexNet例子

以下示例展示了如何使用 DeepSpeed flops 分析器对 AlexNet 进行性能分析。

import torchvision.models as models
import torch
from deepspeed.profiling.flops_profiler import get_model_profile
from deepspeed.accelerator import get_accelerator

with get_accelerator().device(0):
    model = models.alexnet()
    batch_size = 256
    flops, macs, params = get_model_profile(model=model, # model
                                    input_shape=(batch_size, 3, 224, 224), # input shape to the model. If specified, the model takes a tensor with this shape as the only positional argument.
                                    args=None, # list of positional arguments to the model.
                                    kwargs=None, # dictionary of keyword arguments to the model.
                                    print_profile=True, # prints the model graph with the measured profile attached to each module
                                    detailed=True, # print the detailed profile
                                    module_depth=-1, # depth into the nested modules, with -1 being the inner most modules
                                    top_modules=1, # the number of top modules to print aggregated profile
                                    warm_up=10, # the number of warm-ups before measuring the time of each module
                                    as_string=True, # print raw numbers (e.g. 1000) or as human-readable strings (e.g. 1k)
                                    output_file=None, # path to the output file. If None, the profiler prints to stdout.
                                    ignore_modules=None) # the list of modules to ignore in the profiling

BERT例子

from functools import partial
import torch
from transformers import BertForSequenceClassification, BertTokenizer
from deepspeed.profiling.flops_profiler import get_model_profile
from deepspeed.accelerator import get_accelerator


def bert_input_constructor(batch_size, seq_len, tokenizer):
    fake_seq = ""
    for _ in range(seq_len - 2):  # ignore the two special tokens [CLS] and [SEP]
      fake_seq += tokenizer.pad_token
    inputs = tokenizer([fake_seq] * batch_size,
                       padding=True,
                       truncation=True,
                       return_tensors="pt")
    labels = torch.tensor([1] * batch_size)
    inputs = dict(inputs)
    inputs.update({"labels": labels})
    return inputs


with get_accelerator().device(0):
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
    batch_size = 4
    seq_len = 128
    enable_profile = True
    if enable_profile:
      flops, macs, params = get_model_profile(
          model,
          kwargs=bert_input_constructor(batch_size, seq_len, tokenizer),
          print_profile=True,
          detailed=True,
      )
    else:
      inputs = bert_input_constructor((batch_size, seq_len), tokenizer)
      outputs = model(inputs)

在模型训练工作流中

要在训练工作流中对模型的前向过程进行性能分析，请使用 FlopsProfiler 类。FlopsProfiler 类提供了以下方法：

start_profile() - 开始profiling。
get_total_flops(as_string=False)- 返回模型中的浮点操作的总数。
get_total_macs(as_string=False- 返回模型中的macs的总数。
get_total_params(as_string=False)- 返回模型中参数的总数。
print_model_profile(profile_step=1, module_depth=-1, top_modules=3, detailed=True, output_file=None)-打印模型profile。
stop_profile()-停止性能分析。这将停止模型中的浮点运算计数。
end_profile()-进行清理。这将清理在性能分析过程中添加到模型的性能分析属性。这应该在性能分析结束并且在调用get_total_flops、get_total_params或print_model_profile之后进行。

训练工作流例子

以下是一个典型的训练工作流中使用该方法的示例。

from deepspeed.profiling.flops_profiler import FlopsProfiler

model = Model()
prof = FlopsProfiler(model)

profile_step = 5
print_profile= True

for step, batch in enumerate(data_loader):
  # start profiling at training step "profile_step"
  if step == profile_step:
    prof.start_profile()

  # forward() method
  loss = model(batch)

  # end profiling and print output
  if step == profile_step: # if using multi nodes, check global_rank == 0 as well
    prof.stop_profile()
    flops = prof.get_total_flops()
    macs = prof.get_total_macs()
    params = prof.get_total_params()
    if print_profile:
        prof.print_model_profile(profile_step=profile_step)
    prof.end_profile()

  # runs backpropagation
  loss.backward()

  # weight update
  optimizer.step()