RL-CO SOLO: Search Online, Learn Offline for Combinatorial Optimization Problems_solo:在线搜索,离线学习组合优化问题-CSDN博客

本文链接：https://blog.csdn.net/qq_38480311/article/details/126550891

Oren, Joel, Chana Ross, Maksym Lefarov, Felix Richter, Ayal Taitler, Zohar Feldman, Christian Daniel和Dotan Di Castro. 《SOLO: Search Online, Learn Offline for Combinatorial Optimization Problems》. arXiv, 2021年5月18日. http://arxiv.org/abs/2104.01646.

使用强化学习（RL）解决组合优化问题（Combinatorial Optimization Problems，并给出了两个问题实例：Capacitated Vehicle Routing 和 Machine Scheduling。

为什么用RL解决CO：

传统启发式算法的缺点：1.highly specialized 2. typically aim for worst-case scenarios;

RL解决组合优化问题更具有通用性。

对组合优化问题的建模：

组合优化问题用三元组表示 < $I,S,f$ >,

$I$ is the set of problem instances, $S$ maps an instance I ∈ I to its set of feasible solutions, and $f$ is the objective function mapping solutions in $S(I)$ to real values.

用马尔可夫决策过程MDP来建模顺序解决的过程，将决策分为T个步骤。

在每个步骤t，状态 $s_t$ 对应solution的一部分，动作 $a_t$ 对应状态的一个拓展，reward的定义为

状态转移为 $p(s'|s,a)$ 。

reward 定义的自我理解： $f(s_{t+1})-f(s_{t})$ 是表示子在 t+1 时刻目标值与t 时刻目标值之差，反应了t+1 时刻比 t 时刻好的程度，通过最大化累计 reward，相当于在不断改进目标值，让目标值在最后的T时刻比原来的 t=0时刻大得越多越好。

一个轨迹的分布为 $\rho =(<s_t,a_t, r_{t+1}>)_{t=0,..,T-1}$ , Q-function 定义如下：

目标是找到策略

应用例子： Parallel Machine Scheduling Problem和Capacitated Vehicle Routing problem (CVRP)。

Online variants: 以上两个问题都有 online variations，即问题的某些部分不是提前已知的。

如：在CVPR中，随着车辆的当前路线进行，客户不同的时间到达。

在PMSP中，任务在当机器正在处理先前到达的工作时出现。

Solution Approach：

方案主要是两个部分，一是离线训练，二是在线搜索。

learn offline：使用 off policy DQN

DQN网络用来近似Q，

，因为状态和动作空间在整个情episode中可能会有所不同，所以不用feed-forward or convolutional neural network architectures而是使用GNN来近似Q。

表示状态和动作。

GNN由三个部分组成：

（1）embedding model: a vanilla feed forward model with leaky ReLU activation function. This component converts the features of each node to a higher dimension.

(2) encoder model.

(3) decoder model.

The encoder and decoder models both use a variant of the message-passing GNN architecture