RL4CO代码学习笔记01:


熟悉资源

\experiment 文件

目录概览

  • configs/experiment/base.yaml:7 通过 Hydra 的 defaults 组合 `model/env/callbacks/trainer/logger,提供 AM+TSP+REINFORCE 的教学模板。
  • configs/experiment/base.yaml:18 设定 TSP 生成器 num_loc=50 并关闭 check_solution,示范如何在任务级别覆写环境参数。
  • configs/experiment/base.yaml:24 使用 WandB 记录项目/标签/命名约定,便于同一问题规模聚合实验。
  • configs/experiment/base.yaml:34 在 model 与 trainer 区块固定数据规模、批量与 max_epochs,最后 seed:1234 统一随机性。
  • configs/experiment/routing/am.yaml:3 显示子目录会覆写 defaults 指向各自的 model/env 配置组,从而派生具体实验脚本。
# @package _global_
# Example configuration for experimenting. Trains the Attention Model on
# the TSP environment with 50 locations via REINFORCE with greedy rollout baseline.
# You may find comments on the most common hyperparameters below.

# Override defaults: take configs from relative path 
# 通过 Hydra 的 defaults 组合 model/env/callbacks/trainer/logger,提供 AM+TSP+REINFORCE 的教学模板。
defaults:
  - override /model: am.yaml
  - override /env: tsp.yaml
  - override /callbacks: default.yaml
  - override /trainer: default.yaml
  # - override /logger: null # comment this line to enable logging
  - override /logger: wandb.yaml

# Environment configuration
# Note that here we load by default the `.npz` files for the TSP environment
# that are automatically generated with seed following Kool et al. (2019).
# 设定 TSP 生成器 num_loc=50 并关闭 check_solution,示范如何在任务级别覆写环境参数。
env:
  generator_params:
    num_loc: 50
  check_solution: False # optimization

# Logging: we use Wandb in this case
# 使用 WandB 记录项目/标签/命名约定,便于同一问题规模聚合实验。
logger:
  wandb:
    project: "rl4co"
    tags: ["am", "tsp"]
    group: "tsp${env.generator_params.num_loc}"
    name: "am-tsp${env.generator_params.num_loc}"

# Model: this contains the environment (which gets automatically passed to the model on
# initialization), the policy network and other hyperparameters.
# This is a `LightningModule` and can be trained with PyTorch Lightning.
# 在 model 与 trainer 区块固定数据规模、批量与 max_epochs,最后 seed:1234 统一随机性。
model:
  batch_size: 512
  val_batch_size: 1024
  test_batch_size: 1024
  train_data_size: 1_280_000
  val_data_size: 10_000
  test_data_size: 10_000
  optimizer_kwargs:
    lr: 1e-4

# Trainer: this is a customized version of the PyTorch Lightning trainer.
trainer:
  max_epochs: 100

seed: 1234

EDA & Graph

  • configs/experiment/eda/am.yaml:3 针对离散事件调度 (MDPP) 使用 AM+REINFORCE,减小 batch/数据规模并把学习率调高到 1e-4、weight_decay=1e-3。
# @package _global_
# 显示子目录会覆写 defaults 指向各自的 model/env 配置组,从而派生具体实验脚本。
# 针对离散事件调度 (MDPP) 使用 AM+REINFORCE,减小 batch/数据规模并把学习率调高到 1e-4、weight_decay=1e-3。
defaults:
  - override /model: am.yaml
  - override /env: mdpp.yaml
  - override /callbacks: default.yaml
  - override /trainer: default.yaml
  - override /logger: wandb.yaml

logger:
  wandb:
    project: "rl4co"
    tags: ["am", "${env.name}"]
    group: ${env.name}
    name: am-${env.name}


model:
  batch_size: 64
  train_data_size: 500
  val_data_size: 100
  test_data_size: 100
  optimizer_kwargs:
    lr: 1e-4
    weight_decay: 1e-3

trainer:
  max_epochs: 10

seed: 1234
  • configs/experiment/eda/am-a2c.yaml:18 将 model 切换为 rl4co.models.A2C,复用 AM 策略,但拆分 actor/critic 优化器并沿用小样本设定。
# @package _global_

defaults:
  - override /model: am.yaml
  - override /env: mdpp.yaml
  - override /callbacks: default.yaml
  - override /trainer: default.yaml
  - override /logger: wandb.yaml

logger:
  wandb:
    project: "rl4co"
    tags: ["am", "${env.name}"]
    group: ${env.name}
    name: am-a2c-${env.name}

# 将 model 切换为 rl4co.models.A2C,复用 AM 策略,但拆分 actor/critic 优化器并沿用小样本设定。
model:
  _target_: rl4co.models.A2C
  policy:
    _target_: rl4co.models.AttentionModelPolicy
    env_name: "${env.name}"
  actor_optimizer_kwargs:
    lr: 1e-4
    weight_decay: 1e-3
  critic_optimizer_kwargs: null # default to actor_optimizer_kwargs
  batch_size: 64
  train_data_size: 500
  val_data_size: 100
  test_data_size: 100

trainer:
  max_epochs: 10

seed: 1234
  • configs/experiment/eda/am-ppo.yaml:19 改用 Step-wise PPO 设置 (clip_range/ppo_epochs/mini_batch),保持 10 个 epoch 的快速实验。
# @package _global_

defaults:
  - override /model: am-ppo.yaml
  - override /env: mdpp.yaml
  - override /callbacks: default.yaml
  - override /trainer: default.yaml
  - override /logger: wandb.yaml


logger:
  wandb:
    project: "rl4co"
    tags: ["am-ppo", "${env.name}"]
    group: ${env.name}
    name: am-ppo-${env.name}

# 改用 Step-wise PPO 设置 (clip_range/ppo_epochs/mini_batch),保持 10 个 epoch 的快速实验。
model:
  batch_size: 64
  train_data_size: 1000
  val_data_size: 100
  test_data_size: 100
  clip_range: 0.2
  ppo_epochs: 2
  mini_batch_size: ${model.batch_size}
  vf_lambda: 0.5
  entropy_lambda: 0.01
  normalize_adv: False
  max_grad_norm: 0.5
  optimizer_kwargs:
    lr: 1e-4
    weight_decay: 1e-3

trainer:
  max_epochs: 10
  gradient_clip_val: Null # not supported in manual optimization
  precision: "32-true" # NOTE: this seems to be important during manual optimization

seed: 1234
  • configs/experiment/graph/am.yaml:3 为图匹配任务加载 AM 策略,和 EDA 模板类似但以 graph 相关的 env 组与批量设置为 256/1024。
# @package _global_
# 为图匹配任务加载 AM 策略,和 EDA 模板类似但以 graph 相关的 env 组与批量设置为 256/1024。
defaults:
  - override /model: am.yaml
  - override /env: flp.yaml
  - override /callbacks: default.yaml
  - override /trainer: default.yaml
  - override /logger: wandb.yaml

logger:
  wandb:
    project: "rl4co"
    tags: ["am", "${env.name}"]
    group: ${env.name}
    name: am-${env.name}

model:
  batch_size: 1000
  train_data_size: 100_000
  val_data_size: 1000
  test_data_size: 1000
  optimizer_kwargs:
    lr: 1e-4
  
trainer:
  max_epochs: 100

seed: 1234

Routing A

  • configs/experiment/routing/am.yaml:3 作为路由基线,AM+TSP+REINFORCE,配 lr_scheduler=MultiStepLR (80/95) 且批量 512。
# @package _global_
# 作为路由基线,AM+TSP+REINFORCE,配 lr_scheduler=MultiStepLR (80/95) 且批量 512。
defaults:
  - override /model: am.yaml
  - override /env: tsp.yaml
  - override /callbacks: default.yaml
  - override /trainer: default.yaml
  - override /logger: wandb.yaml

env:
  generator_params:
    num_loc: 50

logger:
  wandb:
    project: "rl4co"
    tags: ["am", "${env.name}"]
    group: ${env.name}${env.generator_params.num_loc}
    name: am-${env.name}${env.generator_params.num_loc}


model:
  batch_size: 512
  val_batch_size: 1024
  test_batch_size: 1024
  train_data_size: 1_280_000
  val_data_size: 10_000
  test_data_size: 10_000
  optimizer_kwargs:
    lr: 1e-4
    weight_decay: 1e-6
  lr_scheduler:
    "MultiStepLR"
  lr_scheduler_kwargs:
    milestones: [80, 95]
    gamma: 0.1

trainer:
  max_epochs: 100

seed: 1234
  • configs/experiment/routing/am-xl.yaml:22 扩展模型:num_encoder_layers=6、batch_size=2048、训练 500 epoch,定位长程大批量训练。
# @package _global_

defaults:
  - override /model: am.yaml
  - override /env: tsp.yaml
  - override /callbacks: default.yaml
  - override /trainer: default.yaml
  - override /logger: wandb.yaml

env:
  generator_params:
    num_loc: 50

logger:
  wandb:
    project: "rl4co"
    tags: ["am", "${env.name}"]
    group: "${env.name}${env.generator_params.num_loc}"
    name: "am-xl-${env.name}${env.generator_params.num_loc}"

# 扩展模型:num_encoder_layers=6、batch_size=2048、训练 500 epoch,定位长程大批量训练。
model:
  policy_kwargs:
    num_encoder_layers: 6
    normalization: 'instance'
  batch_size: 2048
  val_batch_size: 1024
  test_batch_size: 1024
  train_data_size: 1_280_000
  val_data_size: 10_000
  test_data_size: 10_000
  optimizer_kwargs:
    lr: 1e-4
    weight_decay: 1e-6
  lr_scheduler:
    "MultiStepLR"
  lr_scheduler_kwargs:
    milestones: [480, 495]
    gamma: 0.1

trainer:
  max_epochs: 500

seed: 1234
  • configs/experiment/routing/am-svrp.yaml:20 切换到 SVRP 环境 (需求随机) 并把学习率降到 1e-6 以稳定训练。
# @package _global_

defaults:
  - override /model: am.yaml
  - override /env: svrp.yaml
  - override /callbacks: default.yaml
  - override /trainer: default.yaml
  - override /logger: wandb.yaml

env:
  num_loc: 50

logger:
  wandb:
    project: "rl4co"
    tags: ["am", "${env.name}"]
    group: ${env.name}${env.generator_params.num_loc}
    name: am-${env.name}${env.generator_params.num_loc}
# 切换到 SVRP 环境 (需求随机) 并把学习率降到 1e-6 以稳定训练。
model:
  batch_size: 512
  val_batch_size: 1024
  test_batch_size: 1024
  train_data_size: 1_280_000
  val_data_size: 10_000
  test_data_size: 10_000
  optimizer_kwargs:
    lr: 1e-6
    weight_decay: 0
  lr_scheduler:
    "MultiStepLR"
  lr_scheduler_kwargs:
    milestones: [80, 95]
    gamma: 0.1

trainer:
  max_epochs: 100

seed: 1234
  • configs/experiment/routing/am-a2c.yaml:14 将模型换成 A2C,保持 AM 策略,使用与 REINFORCE 同样的 Hydra 组合以快速比较算法。
# @package _global_

# Use the following to take the default values from am.yaml
# Replace below only the values that you want to change compared to the default values
defaults:
  - routing/am.yaml
  - _self_

logger:
  wandb:
    tags: ["am-a2c", "${env.name}"]
    name: am-a2c-${env.name}${env.generator_params.num_loc}
# 将模型换成 A2C,保持 AM 策略,使用与 REINFORCE 同样的 Hydra 组合以快速比较算法。
model:
  _target_: rl4co.models.A2C
  policy:
    _target_: rl4co.models.AttentionModelPolicy
    env_name: "${env.name}"
  actor_optimizer_kwargs:
    lr: 1e-4
    weight_decay: 1e-6
  critic_optimizer_kwargs: null # default to actor_optimizer_kwargs
  • configs/experiment/routing/am-ppo.yaml:22 配置 AM 的 PPO 变体:clip_range=0.2、ppo_epochs=2、max_grad_norm=0.5,并强制 precision=“32-true” 适配手动优化。
# @package _global_

defaults:
  - override /model: am-ppo.yaml
  - override /env: tsp.yaml
  - override /callbacks: default.yaml
  - override /trainer: default.yaml
  - override /logger: wandb.yaml

env:
  generator_params:
    num_loc: 50

logger:
  wandb:
    project: "rl4co"
    tags: ["am-ppo", "${env.name}"]
    group: ${env.name}${env.generator_params.num_loc}
    name: ppo-${env.name}${env.generator_params.num_loc}

# 配置 AM 的 PPO 变体:clip_range=0.2、ppo_epochs=2、max_grad_norm=0.5,并强制 precision="32-true" 适配手动优化。
model:
  batch_size: 512
  val_batch_size: 1024
  test_batch_size: 1024
  train_data_size: 1_280_000
  val_data_size: 10_000
  test_data_size: 10_000
  clip_range: 0.2
  ppo_epochs: 2
  mini_batch_size: 512
  vf_lambda: 0.5
  entropy_lambda: 0.01
  normalize_adv: False
  max_grad_norm: 0.5
  optimizer_kwargs:
    lr: 1e-4
    weight_decay: 1e-6
  lr_scheduler:
    "MultiStepLR"
  lr_scheduler_kwargs:
    milestones: [80, 95]
    gamma: 0.1

trainer:
  max_epochs: 100
  gradient_clip_val: Null # not supported in manual optimization
  precision: "32-true" # NOTE: this seems to be important during manual optimization

seed: 1234

Routing B

  • configs/experiment/routing/pomo.yaml:3 设定 POMO (多起点策略梯度) 基线,批量 64、160k 训练样本,维持 MultiStepLR。
# @package _global_
# 设定 POMO (多起点策略梯度) 基线,批量 64、160k 训练样本,维持 MultiStepLR。
defaults:
  - override /model: pomo.yaml
  - override /env: tsp.yaml
  - override /callbacks: default.yaml
  - override /trainer: default.yaml
  - override /logger: wandb.yaml

env:
  generator_params:
    num_loc: 50

logger:
  wandb:
    project: "rl4co"
    tags: ["pomo", "${env.name}"]
    group: "${env.name}${env.generator_params.num_loc}"
    name: "pomo-${env.name}${env.generator_params.num_loc}"


model:
  batch_size: 64
  train_data_size: 160_000
  val_data_size: 10_000
  test_data_size: 10_000
  optimizer_kwargs:
    lr: 1e-4
    weight_decay: 1e-6
  lr_scheduler:
    "MultiStepLR"
  lr_scheduler_kwargs:
    milestones: [80, 95]
    gamma: 0.1

trainer:
  max_epochs: 100

seed: 1234

使用上述配置文件:

  • 先按 README.mdPROJECT_MANUAL.md 的方式配置好虚拟环境并安装依赖(推荐 uv sync --all-extraspip install -e .[all])。
  • 在项目根目录运行 Hydra 入口脚本,例如:
  python run.py experiment=routing/pomo

这里 experiment=routing/pomo 指向 ` ;Hydra 会自动加载该文件,并根据其中的 defaults 组合 model: pomo.yaml、env: tsp.yaml 等子配置。

修改方案

要改 configs/experiment/routing/pomo.yaml 对应的“方法”,需要调整它指向的 Python 类/模块,主要有以下几个入口:

  • rl4co/models/zoo/pomo/model.py:15 定义了 target: rl4co.models.POMO,也就是 POMO 算法的 LightningModule;其中 init、shared_step 等方法控制训练/推理流程,是最直接的“方法”实现。
  • rl4co/models/zoo/am/policy.py:10 中的 AttentionModelPolicy 是 POMO 默认使用的策略网络;若要改编码/解码逻辑,可继续深入到
    • rl4co/models/zoo/am/encoder.py(注意力编码器)、
    • rl4co/models/zoo/am/decoder.py(指针解码器)。
  • rl4co/envs/routing/tsp/env.py:24 定义了 TSPEnv,因为 pomo.yaml 默认加载 configs/env/tsp.yaml,如需改环境状态转移或奖励,需要在这里修改。
  • 其它依赖:
    • 基线 SharedBaseline 位于 rl4co/models/rl/baselines/shared.py
    • 数据增广 StateAugmentation 在 rl4co/data/transforms/state.py
    • 如果想改训练回合或调度器,可以查看 configs/model/pomo.yaml(Hydra 配置层)或 rl4co/models/rl/reinforce/reinforce.py(父类逻辑)。


  • configs/experiment/routing/ar-gnn.yaml:22 在 POMO 框架下引入 NARGNNNodeEncoder,强调 GNN 编码器替换 AM encoder。
# @package _global_

defaults:
  - override /model: pomo.yaml
  - override /env: tsp.yaml
  - override /callbacks: default.yaml
  - override /trainer: default.yaml
  - override /logger: wandb.yaml

env:
  generator_params:
    num_loc: 50

logger:
  wandb:
    project: "rl4co"
    tags: ["pomo", "${env.name}"]
    group: "${env.name}${env.generator_params.num_loc}"
    name: "pomo-${env.name}${env.generator_params.num_loc}"

# 在 POMO 框架下引入 NARGNNNodeEncoder,强调 GNN 编码器替换 AM encoder。
model:
  policy:
      _target_: rl4co.models.zoo.am.policy.AttentionModelPolicy
      encoder:
        _target_: rl4co.models.zoo.nargnn.encoder.NARGNNNodeEncoder
        embed_dim: 128
        env_name: "${env.name}"
      env_name: "${env.name}"
  batch_size: 64
  train_data_size: 160_000
  val_data_size: 10_000
  test_data_size: 10_000
  optimizer_kwargs:
    lr: 1e-4
    weight_decay: 1e-6
  lr_scheduler:
    "MultiStepLR"
  lr_scheduler_kwargs:
    milestones: [80, 95]
    gamma: 0.1

trainer:
  max_epochs: 100

seed: 1234
  • configs/experiment/routing/mdpomo.yaml:24 针对 CVRP 的混合分布生成器,延长到 10k 训练集和 10k epoch 以匹配论文版 MDPOMO。
# @package _global_
#
defaults:
  - override /model: pomo.yaml
  - override /env: cvrp.yaml
  - override /callbacks: default.yaml
  - override /trainer: default.yaml
  - override /logger: wandb.yaml

env:
  generator_params:
    num_loc: 50
    loc_distribution: "mix_distribution"
    

logger:
  wandb:
    project: "rl4co"
    tags: ["mdpomo", "${env.name}"]
    group: "${env.name}${env.generator_params.num_loc}"
    name: "mdpomo-${env.name}${env.generator_params.num_loc}"

# 针对 CVRP 的混合分布生成器,延长到 10k 训练集和 10k epoch 以匹配论文版 MDPOMO。
model:
  batch_size: 512
  train_data_size: 10_000
  val_data_size: 10_000
  test_data_size: 10_000
  optimizer_kwargs:
    lr: 1e-4
    weight_decay: 1e-6
  lr_scheduler:
    "MultiStepLR"
  lr_scheduler_kwargs:
    milestones: [9001]
    gamma: 0.1

trainer:
  max_epochs: 10000

seed: 1234
  • configs/experiment/routing/ptrnet.yaml:21 复现 Pointer Network 基线,额外定义 data 区块保证 DataModule 批量与模型一致。
# @package _global_
defaults:
  - override /model: ptrnet.yaml
  - override /env: tsp.yaml
  - override /callbacks: default.yaml
  - override /trainer: default.yaml
  - override /logger: wandb.yaml

env:
  generator_params:
    num_loc: 50

logger:
  wandb:
    project: "rl4co"
    tags: ["ptrnet", "${env.name}"]
    group: ${env.name}${env.generator_params.num_loc}
    name: ptrnet-${env.name}${env.generator_params.num_loc}

# 复现 Pointer Network 基线,额外定义 data 区块保证 DataModule 批量与模型一致。
model:
  batch_size: 512
  val_batch_size: 1024
  test_batch_size: 1024
  train_data_size: 1_280_000
  val_data_size: 10_000
  test_data_size: 10_000
  optimizer_kwargs:
    lr: 1e-4
    weight_decay: 1e-6
  lr_scheduler:
    "MultiStepLR"
  lr_scheduler_kwargs:
    milestones: [80, 95]
    gamma: 0.1

trainer:
  max_epochs: 100

data:
  batch_size: 512
  train_size: 1_280_000
  val_size: 10_000

seed: 1234
  • configs/experiment/routing/polynet.yaml:22 POLYNET 多解训练,指定 k=100 与 val_num_solutions,并在日志名中注入 k 值。
# @package _global_

defaults:
  - override /model: polynet.yaml
  - override /env: tsp.yaml
  - override /callbacks: default.yaml
  - override /trainer: default.yaml
  - override /logger: wandb.yaml

env:
  generator_params:
    num_loc: 50
  check_solution: False # optimization

logger:
  wandb:
    project: "rl4co"
    tags: ["polynet", "${env.name}"]
    group: "${env.name}${env.generator_params.num_loc}"
    name: "polynet-${env.name}${env.generator_params.num_loc}-${model.k}"
# POLYNET 多解训练,指定 k=100 与 val_num_solutions,并在日志名中注入 k 值。
model:
  k: 100
  val_num_solutions: ${model.k}
  batch_size: 512
  val_batch_size: 1024
  test_batch_size: 1024
  train_data_size: 1_280_000
  val_data_size: 10_000
  test_data_size: 10_000
  optimizer_kwargs:
    lr: 1e-4
    weight_decay: 1e-6
  lr_scheduler:
    "MultiStepLR"
  lr_scheduler_kwargs:
    milestones: [80, 95]
    gamma: 0.1

trainer:
  max_epochs: 100

seed: 1234

Routing C

  • configs/experiment/routing/symnco.yaml:21 SYMNCO 多起点策略,启用 num_augment=10 与 num_starts=0 表示只用对称增广。
# @package _global_

defaults:
  - override /model: symnco.yaml
  - override /env: tsp.yaml
  - override /callbacks: default.yaml
  - override /trainer: default.yaml
  - override /logger: wandb.yaml

env:
  generator_params:
    num_loc: 50

logger:
  wandb:
    project: "rl4co"
    tags: ["symnco", "${env.name}"]
    group: "${env.name}${env.generator_params.num_loc}"
    name: "symnco-${env.name}${env.generator_params.num_loc}"
# SYMNCO 多起点策略,启用 num_augment=10 与 num_starts=0 表示只用对称增广。
model:
  batch_size: 512
  val_batch_size: 1024
  test_batch_size: 1024
  train_data_size: 1_280_000
  val_data_size: 10_000
  test_data_size: 10_000
  num_starts: 0 # 0 for no augmentation for multi-starts
  num_augment: 10
  optimizer_kwargs:
    lr: 1e-4
    weight_decay: 1e-6
  lr_scheduler:
    "MultiStepLR"
  lr_scheduler_kwargs:
    milestones: [80, 95]
    gamma: 0.1

trainer:
  max_epochs: 100

seed: 1234
  • configs/experiment/routing/deepaco.yaml:21 DeepACO 设定:超小数据集、train_with_local_search=True、细化蚁群 policy_kwargs,并用 CosineAnnealingLR。
# @package _global_

defaults:
  - override /model: deepaco.yaml
  - override /env: tsp.yaml
  - override /callbacks: default.yaml
  - override /trainer: default.yaml
  - override /logger: wandb.yaml

env:
  generator_params:
    num_loc: 50

logger:
  wandb:
    project: "rl4co"
    tags: ["deepaco", "${env.name}"]
    group: ${env.name}${env.generator_params.num_loc}
    name: deepaco-${env.name}${env.generator_params.num_loc}
# DeepACO 设定:超小数据集、train_with_local_search=True、细化蚁群 policy_kwargs,并用 CosineAnnealingLR。
model:
  batch_size: 20
  val_batch_size: 20
  test_batch_size: 20
  train_data_size: 400
  val_data_size: 20
  test_data_size: 100
  optimizer: "AdamW"
  optimizer_kwargs:
    lr: 5e-4
    weight_decay: 0
  lr_scheduler:
    "CosineAnnealingLR"
  lr_scheduler_kwargs:
    T_max: 50
    eta_min: 1e-5
  metrics:
    test:
      - reward_000
      - reward_002
      - reward_009  # since n_iterations["text"] = 10
  train_with_local_search: True
  ls_reward_aug_W: 0.99

  policy_kwargs:
    n_ants:
      train: 30
      val: 30
      test: 100
    n_iterations:
      train: 1 # unused value
      val: 5
      test: 10
    temperature: 1.0
    top_p: 0.0
    top_k: 0
    start_node: null
    multistart: False
    k_sparse: 5  # this should be adjusted based on the `num_loc` value

    aco_kwargs:
      alpha: 1.0
      beta: 1.0
      decay: 0.95
      use_local_search: True
      use_nls: True
      n_perturbations: 5
      local_search_params:
        max_iterations: 1000
      perturbation_params:
        max_iterations: 20

trainer:
  max_epochs: 50
  gradient_clip_val: 3.0
  precision: "bf16-mixed"
  device:
    - 0

seed: 1234
  • configs/experiment/routing/gfacs.yaml:21 GFACS 扩展 DeepACO,引入自适应 alpha/beta 退火参数并共享局部搜索配置。
# @package _global_

defaults:
  - override /model: gfacs.yaml
  - override /env: tsp.yaml
  - override /callbacks: default.yaml
  - override /trainer: default.yaml
  - override /logger: wandb.yaml

env:
  generator_params:
    num_loc: 50

logger:
  wandb:
    project: "rl4co"
    tags: ["gfacs", "${env.name}"]
    group: ${env.name}${env.generator_params.num_loc}
    name: gfacs-${env.name}${env.generator_params.num_loc}
# GFACS 扩展 DeepACO,引入自适应 alpha/beta 退火参数并共享局部搜索配置。
model:
  batch_size: 20
  val_batch_size: 20
  test_batch_size: 20
  train_data_size: 400
  val_data_size: 20
  test_data_size: 100
  optimizer: "AdamW"
  optimizer_kwargs:
    lr: 5e-4
    weight_decay: 0
  lr_scheduler:
    "CosineAnnealingLR"
  lr_scheduler_kwargs:
    T_max: 50
    eta_min: 1e-5
  metrics:
    test:
      - reward_000
      - reward_002
      - reward_009  # since n_iterations["text"] = 10
  train_with_local_search: True
  alpha_min: 0.5
  alpha_max: 1.0
  alpha_flat_epochs: 5
  beta_min: 100
  beta_max: 500
  beta_flat_epochs: 5

  policy_kwargs:
    n_ants:
      train: 30
      val: 30
      test: 100
    n_iterations:
      train: 1 # unused value
      val: 5
      test: 10
    temperature: 1.0
    top_p: 0.0
    top_k: 0
    multistart: False
    k_sparse: 5  # this should be adjusted based on the `num_loc` value

    aco_kwargs:
      alpha: 1.0  # This alpha is different from the alpha in the model
      beta: 1.0  # This beta is different from the beta in the model
      decay: 0.95
      use_local_search: True
      use_nls: True
      n_perturbations: 5
      local_search_params:
        max_iterations: 1000
      perturbation_params:
        max_iterations: 20

trainer:
  max_epochs: 50
  gradient_clip_val: 3.0
  precision: "bf16-mixed"
  devices:
    - 0

seed: 1234
  • configs/experiment/routing/glop.yaml:22 针对大规模 CVRPMVC (num_loc=1000),小批量、多样本 (policy_kwargs.n_samples=20) 与 50 epoch 训练。
# @package _global_

defaults:
  - override /model: glop.yaml
  - override /env: cvrpmvc.yaml
  - override /callbacks: default.yaml
  - override /trainer: default.yaml
  - override /logger: wandb.yaml

env:
  generator_params:
    num_loc: 1000

logger:
  wandb:
    project: "rl4co"
    tags: ["glop", "${env.name}"]
    group: ${env.name}${env.generator_params.num_loc}
    name: glop-${env.name}${env.generator_params.num_loc}

# 针对大规模 CVRPMVC (num_loc=1000),小批量、多样本 (policy_kwargs.n_samples=20) 与 50 epoch 训练。
model:
  batch_size: 16
  val_batch_size: 128
  test_batch_size: 128
  train_data_size: 3200
  val_data_size: 1024
  test_data_size: 10_000
  optimizer_kwargs:
    lr: 1e-4
    weight_decay: 0
  lr_scheduler:
    "MultiStepLR"
  lr_scheduler_kwargs:
    milestones: [37, 45]
    gamma: 0.1
  policy_kwargs:
    n_samples: 20

trainer:
  max_epochs: 50
  precision: 32
  gradient_clip_val: 1

seed: 1234
  • configs/experiment/routing/tsp-stepwise-ppo.yaml:27 使用 Stepwise PPO + L2D 风格解码器解决 TSP,embed_dim/num_heads 显式暴露以便调参。
# @package _global_

defaults:
  - override /model: l2d.yaml
  - override /callbacks: default.yaml
  - override /trainer: default.yaml
  - override /logger: wandb.yaml

env:
  _target_: rl4co.envs.TSPEnv4PPO
  generator_params:
    num_loc: 20

logger:
  wandb:
    project: "rl4co"
    tags: ["am-stepwise-ppo", "${env.name}"]
    group: ${env.name}${env.generator_params.num_loc}
    name: ppo-${env.name}${env.generator_params.num_loc}

trainer:
  max_epochs: 10
  precision: 32-true

embed_dim: 256
num_heads: 8
# 使用 Stepwise PPO + L2D 风格解码器解决 TSP,embed_dim/num_heads 显式暴露以便调参。
model:
  _target_: rl4co.models.StepwisePPO
  policy:
    _target_: rl4co.models.L2DPolicy4PPO
    decoder:
      _target_: rl4co.models.zoo.l2d.decoder.L2DDecoder
      env_name: ${env.name}
      embed_dim: ${embed_dim}
      feature_extractor:
        _target_: rl4co.models.zoo.am.encoder.AttentionModelEncoder
        embed_dim: ${embed_dim}
        num_heads: ${num_heads}
        num_layers: 4
        normalization: "batch"
        env_name: "tsp"
      actor:
        _target_: rl4co.models.zoo.l2d.decoder.AttnActor
        embed_dim: ${embed_dim}
        num_heads: ${num_heads}
        env_name: ${env.name}
    embed_dim: ${embed_dim}
    env_name: ${env.name}
    het_emb: False
  batch_size: 512
  mini_batch_size: 512
  train_data_size: 20000
  val_data_size: 1_000
  test_data_size: 1_000
  reward_scale: scale
  optimizer_kwargs:
    lr: 1e-4

Scheduling A

  • configs/experiment/scheduling/base.yaml:3 定义所有调度实验共用的 WandB 命名、32 位精度与 scaling_factor(来自 env 最大加工时间)。
# @package _global_
# 定义所有调度实验共用的 WandB 命名、32 位精度与 scaling_factor(来自 env 最大加工时间)。
defaults:
  - override /model: l2d.yaml
  - override /callbacks: default.yaml
  - override /trainer: default.yaml
  - override /logger: wandb.yaml

logger:
  wandb:
    project: "rl4co"
    log_model: "all"
    group: "${env.name}-${env.generator_params.num_jobs}-${env.generator_params.num_machines}"
    tags: ???
    name: ???

trainer:
  max_epochs: 10
  # NOTE for some reason l2d is extremely sensitive to precision
  # ONLY USE 32-true for l2d!
  precision: 32-true

seed: 12345678

scaling_factor: ${env.generator_params.max_processing_time}

model:
  _target_: ???
  batch_size: ???
  train_data_size: 2_000
  val_data_size: 1_000
  test_data_size: 100
  optimizer_kwargs:
    lr: 2e-4
    weight_decay: 1e-6
  lr_scheduler: "ExponentialLR"
  lr_scheduler_kwargs:
    gamma: 0.95
  reward_scale: scale
  max_grad_norm: 1
  • configs/experiment/scheduling/am-pomo.yaml:11 使用 L2DAttnPolicy + POMO,保留 num_starts=10 多起点评估并沿用 base 的 reward scaling。
  • configs/experiment/scheduling/am-ppo.yaml:14 构建 Stepwise PPO + L2D 解码器堆栈,指定 MatNet encoder 与 het_emb=True,并让环境输出 stepwise reward。
# @package _global_

defaults:
  - scheduling/base

logger:
  wandb:
    tags: ["am-ppo", "${env.name}"]
    name: "am-ppo-${env.name}-${env.generator_params.num_jobs}j-${env.generator_params.num_machines}m"
# 使用 L2DAttnPolicy + POMO,保留 num_starts=10 多起点评估并沿用 base 的 reward scaling。
embed_dim: 256
num_heads: 8
# 构建 Stepwise PPO + L2D 解码器堆栈,指定 MatNet encoder 与 het_emb=True,并让环境输出 stepwise reward。
model:
  _target_: rl4co.models.StepwisePPO
  policy:
    _target_: rl4co.models.L2DPolicy4PPO
    decoder:
      _target_: rl4co.models.zoo.l2d.decoder.L2DDecoder
      env_name: ${env.name}
      embed_dim: ${embed_dim}
      feature_extractor:
        _target_: rl4co.models.zoo.matnet.matnet_w_sa.Encoder
        embed_dim: ${embed_dim}
        num_heads: ${num_heads}
        num_layers: 4
        normalization: "batch"
        init_embedding:
          _target_: rl4co.models.nn.env_embeddings.init.FJSPMatNetInitEmbedding
          embed_dim: ${embed_dim}
          scaling_factor: ${scaling_factor}
      actor:
        _target_: rl4co.models.zoo.l2d.decoder.L2DAttnActor
        embed_dim: ${embed_dim}
        num_heads: ${num_heads}
        env_name: ${env.name}
        scaling_factor:  ${scaling_factor}
        stepwise: True
    env_name: ${env.name}
    embed_dim: ${embed_dim}
    scaling_factor: ${scaling_factor}
    het_emb: True
  batch_size: 128
  val_batch_size: 512
  test_batch_size: 64
  train_data_size: 2000
  mini_batch_size: 512

env:
  stepwise_reward: True
  • configs/experiment/scheduling/matnet-pomo.yaml:13 将 POMO 结合 MatNet encoder,强调 het_emb=True 与 FJSPMatNetInitEmbedding 初始化。
# @package _global_

defaults:
  - scheduling/base

logger:
  wandb:
    tags: ["matnet-pomo", "${env.name}"]
    name: "matnet-pomo-${env.name}-${env.generator_params.num_jobs}j-${env.generator_params.num_machines}m"

embed_dim: 256
# 将 POMO 结合 MatNet encoder,强调 het_emb=True 与 FJSPMatNetInitEmbedding 初始化。
model:
  _target_: rl4co.models.POMO
  policy:
    _target_: rl4co.models.L2DPolicy
    encoder:
      _target_: rl4co.models.zoo.matnet.matnet_w_sa.Encoder
      embed_dim: ${embed_dim}
      num_heads: 8
      num_layers: 4
      normalization: "batch"
      init_embedding:
        _target_: rl4co.models.nn.env_embeddings.init.FJSPMatNetInitEmbedding
        embed_dim: ${embed_dim}
        scaling_factor: ${scaling_factor}
    env_name: ${env.name}
    embed_dim: ${embed_dim}
    stepwise_encoding: False
    het_emb: True
    scaling_factor: ${scaling_factor}
  batch_size: 64
  num_starts: 10
  num_augment: 0
  baseline: "shared"
  metrics:
    val: ["reward", "max_reward"]
    test: ${model.metrics.val}
  • configs/experiment/scheduling/matnet-ppo.yaml:13 同样基于 Stepwise PPO,但使用 MatNet encoder 且保持 stepwise reward。
# @package _global_

defaults:
  - scheduling/base

logger:
  wandb:
    tags: ["matnet-ppo", "${env.name}"]
    name: "matnet-ppo-${env.name}-${env.generator_params.num_jobs}j-${env.generator_params.num_machines}m"

embed_dim: 256
# 同样基于 Stepwise PPO,但使用 MatNet encoder 且保持 stepwise reward。
model:
  _target_: rl4co.models.StepwisePPO
  policy:
    _target_: rl4co.models.L2DPolicy4PPO
    decoder:
      _target_: rl4co.models.zoo.l2d.decoder.L2DDecoder
      env_name: ${env.name}
      embed_dim: ${embed_dim}
      het_emb: True
      feature_extractor:
        _target_: rl4co.models.zoo.matnet.matnet_w_sa.Encoder
        embed_dim: ${embed_dim}
        num_heads: 8
        num_layers: 4
        normalization: "batch"
        init_embedding:
          _target_: rl4co.models.nn.env_embeddings.init.FJSPMatNetInitEmbedding
          embed_dim: ${embed_dim}
          scaling_factor: ${scaling_factor}
    env_name: ${env.name}
    embed_dim: ${embed_dim}
    scaling_factor: ${scaling_factor}
    het_emb: True
  batch_size: 128
  val_batch_size: 512
  test_batch_size: 64
  mini_batch_size: 512

env:
  stepwise_reward: True

Scheduling B

  • configs/experiment/scheduling/ffsp-matnet.yaml:3 为柔性流车间 (FFSP) 染指 MatNet:覆写 env:ffsp、增大 train_data_size=10_000、max_epochs=50。
# @package _global_
# 为柔性流车间 (FFSP) 染指 MatNet:覆写 env:ffsp、增大 train_data_size=10_000、max_epochs=50。
defaults:
  - override /model: matnet.yaml
  - override /callbacks: default.yaml
  - override /trainer: default.yaml
  - override /logger: wandb.yaml
  - override /env: ffsp.yaml

logger:
  wandb:
    project: "rl4co"
    log_model: "all"
    group: "${env.name}-${env.generator_params.num_job}-${env.generator_params.num_machine}"
    tags: ["matnet", "${env.name}"]
    name: "matnet-${env.name}-${env.generator_params.num_job}j-${env.generator_params.num_machine}m"

env:
  generator_params:
    num_stage: 3
    num_machine: 4
    num_job: 20
    flatten_stages: False

trainer:
  max_epochs: 50
  # NOTE for some reason l2d is extremely sensitive to precision
  # ONLY USE 32-true for l2d!
  precision: 32-true
  gradient_clip_val: 10 # orig paper does not use grad clipping

seed: 12345678

model:
  batch_size: 50
  train_data_size: 10_000
  val_data_size: 1_000
  test_data_size: 1_000
  optimizer_kwargs:
    lr: 1e-4
    weight_decay: 1e-6
  lr_scheduler:
    "MultiStepLR"
  lr_scheduler_kwargs:
    milestones: [35, 45]
    gamma: 0.1
  • configs/experiment/scheduling/gnn-ppo.yaml:12 配置 L2DPPOModel,采用更轻量的 GNN encoder (num_encoder_layers=3) 并保持 10 epoch 的快速迭代。
# @package _global_

defaults:
  - scheduling/base

logger:
  wandb:
    tags: ["gnn-ppo", "${env.name}"]
    name: "gnn-ppo-${env.name}-${env.generator_params.num_jobs}j-${env.generator_params.num_machines}m"

# params from Song et al.
# 配置 L2DPPOModel,采用更轻量的 GNN encoder (num_encoder_layers=3) 并保持 10 epoch 的快速迭代。
model:
  _target_: rl4co.models.L2DPPOModel
  policy_kwargs:
    embed_dim: 256
    num_encoder_layers: 3
    scaling_factor: ${scaling_factor}
    ppo_epochs: 2
    het_emb: False
    normalization: instance
    test_decode_type: greedy
  batch_size: 128
  val_batch_size: 512
  test_batch_size: 64
  mini_batch_size: 512


trainer:
  max_epochs: 10


env:
  stepwise_reward: True
  • configs/experiment/scheduling/hgnn-pomo.yaml:11 H-GNN 版本的 POMO,启用 het_emb=True、num_starts=10,用于更复杂特征抽取。
# @package _global_

defaults:
  - scheduling/base

logger:
  wandb:
    tags: ["hgnn-pomo", "${env.name}"]
    name: "hgnn-pomo-${env.name}-${env.generator_params.num_jobs}j-${env.generator_params.num_machines}m"
# H-GNN 版本的 POMO,启用 het_emb=True、num_starts=10,用于更复杂特征抽取。
model:
  _target_: rl4co.models.POMO
  policy:
    _target_: rl4co.models.L2DPolicy
    env_name: ${env.name}
    embed_dim: 256
    num_encoder_layers: 3
    stepwise_encoding: False
    scaling_factor: ${scaling_factor}
    het_emb: True
    normalization: instance
  num_starts: 10
  batch_size: 64
  num_augment: 0
  baseline: "shared"
  metrics:
    val: ["reward", "max_reward"]
    test: ${model.metrics.val}
  • configs/experiment/scheduling/hgnn-ppo.yaml:12 H-GNN + PPO 组合,保持 stepwise 奖励并重用 base 的尺度归一化。
# @package _global_

defaults:
  - scheduling/base

logger:
  wandb:
    tags: ["hgnn-ppo", "${env.name}"]
    name: "hgnn-ppo-${env.name}-${env.generator_params.num_jobs}j-${env.generator_params.num_machines}m"

# params from Song et al.
# H-GNN + PPO 组合,保持 stepwise 奖励并重用 base 的尺度归一化。
model:
  _target_: rl4co.models.L2DPPOModel
  policy_kwargs:
    embed_dim: 256
    num_encoder_layers: 3
    scaling_factor: ${scaling_factor}
    ppo_epochs: 2
    het_emb: True
    normalization: instance
  batch_size: 128
  val_batch_size: 512
  test_batch_size: 64
  mini_batch_size: 512

env:
  stepwise_reward: True

总结

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值