NEURAL COMBINATORIAL OPTIMIZATION WITH REINFORCEMENT LEARNING 翻译

最新推荐文章于 2021-04-26 10:55:04 发布

streamedfish

最新推荐文章于 2021-04-26 10:55:04 发布

阅读量1.2k

点赞数 1

分类专栏：文献翻译

文献翻译专栏收录该内容

4 篇文章 0 订阅

订阅专栏

Irwan Bello∗, Hieu Pham∗, Quoc V. Le, Mohammad Norouzi, Samy Bengio
Google Brain
{ibello,hyhieu,qvl,mnorouzi,bengio}@google.com

使用强化学习进行神经组合优化

摘要

This paper presents a framework to tackle combinatorial optimization problems using neural networks and reinforcement learning. We focus on the traveling salesman problem (TSP) and train a recurrent neural network that, given a set of city coordinates, predicts a distribution over different city permutations. Using negative tour length as the reward signal, we optimize the parameters of the recurrent neural network using a policy gradient method. We compare learning the network parameters on a set of training graphs against learning them on individual test graphs. Despite the computational expense, without much engineering and heuristic designing, Neural Combinatorial Optimization achieves close to optimal results on 2D Euclidean graphs with up to 100 nodes. Applied to the KnapSack,another NP-hard problem, the same method obtains optimal solutions for instances with up to 200 items.

本文提出了一个使用神经网络和强化学习来解决组合优化问题的框架。我们专注于旅行商问题（TSP）并训练一个递归神经网络——给定一组城市坐标，预测不同城市排列的分布。使用负游览长度作为奖励信号，我们使用策略梯度方法优化递归神经网络的参数。我们比较了在一组训练图上学习网络参数与在单独的测试图上学习它们。尽管存在计算开销，但没有多种工程和启发式设计，神经组合优化在具有多达100个节点的2D欧几里得图上实现了接近最优的结果。应用于另一个难以解决的NP问题KnapSack，同样的方法可以获得最多200个项目的最佳解决方案。

简介

Combinatorial optimization is a fundamental problem in computer science. A canonical example is the traveling salesman problem (TSP), where given a graph, one needs to search the space of permutations to find an optimal sequence of nodes with minimal total edge weights (tour length).The TSP and its variants have myriad applications in planning, manufacturing, genetics, etc. (see(Applegate et al., 2011) for an overview).	组合优化是计算机科学中的基本问题。一个典型的例子是旅行商问题（TSP），在给定图形的情况下，需要搜索排列的空间以找到具有最小总边缘权重（旅行长度）的最佳节点序列.TSP及其变体具有无数的应用在规划，制造，遗传等方面（见（Applegate等，2011）的概述）。
Finding the optimal TSP solution is NP-hard, even in the two-dimensional Euclidean case (Papadimitriou, 1977), where the nodes are 2D points and edge weights are Euclidean distances between pairs of points. In practice, TSP solvers rely on handcrafted heuristics that guide their search procedures to find ompetitive tours efficiently. Even though these heuristics work well on TSP, once the problem statement changes slightly, they need to be revised. In contrast, machine learning methods have the potential to be applicable across many optimization tasks by automatically discovering their own heuristics based on the training data, thus requiring less handengineering than solvers that are optimized for one task only.	即使在二维欧几里德案例（Papadimitriou，1977）（其中节点是2D点，边缘权重是点对之间的欧几里德距离）中，找到最优TSP解也是NP难的。在实践中，TSP求解器依靠手工启发式方法来指导他们的搜索程序，以便有效地找到有效的路线。尽管这些启发式方法在TSP上运行良好，但一旦问题陈述略有变化，它们就需要进行修改。相比之下，机器学习方法有可能通过基于训练数据自动发现他们自己的启发式来适用于许多优化任务，因此比仅针对一个任务优化的求解器需要更少的手工工程。
While most successful machine learning techniques fall into the family of supervised learning, where a mapping from training inputs to outputs is learned, supervised learning is not applicable to most combinatorial optimization problems because one does not have access to optimal labels. However,one can compare the quality of a set of solutions using a verifier, and provide some reward feedbacks to a learning algorithm. Hence, we follow the reinforcement learning (RL) paradigm to tackle combinatorial optimization. We empirically demonstrate that, even when using optimal solutions as labeled data to optimize a supervised mapping, the generalization is rather poor compared to an RL agent that explores different tours and observes their corresponding rewards.	虽然大多数成功的机器学习技术都属于监督学习系列(学习了从训练输入到输出的映射)，但是监督学习不适用于大多数组合优化问题，因为人们无法访问最佳标签。但是，可以使用验证器比较一组解决方案的质量，并为学习算法提供一些奖励和反馈。因此，我们遵循强化学习（RL）范式来解决组合优化问题。我们凭经验证明，即使使用最优解作为标记数据来优化监督映射，与探索不同游览并观察其相应奖励的RL代理相比，泛化相当差。
We propose Neural Combinatorial Optimization, a framework to tackle combinatorial optimization problems using reinforcement learning and neural networks. We consider two approaches based on policy gradients (Williams, 1992). The first approach, called RL pretraining, uses a training set to optimize a recurrent neural network (RNN) that parameterizes a stochastic policy over solutions, using the expected reward as objective. At test time, the policy is fixed, and one performs inference by greedy decoding or sampling. The second approach, called active search, involves no pretraining.	我们提出神经组合优化，这是一个使用强化学习和神经网络解决组合优化问题的框架。我们考虑两种基于政策梯度的方法（Williams，1992）。第一种方法称为RL预训练，使用训练集来优化递归神经网络（RNN），该神经网络使用预期的奖励作为目标，对解决方案的随机策略进行参数化。在测试时，策略是固定的，并且通过贪婪解码或采样来执行推断。第二种方法称为主动搜索，不涉及预训练。
It starts froma random policy and iteratively optimizes the RNN parameters on a single test instance,again using the expected reward objective, while keeping track of the best solution sampled during the search. We find that combining RL pretraining and active search works best in practice.	它从随机策略开始，并在单个测试实例上迭代优化RNN参数，再次使用预期的奖励目标，同时跟踪搜索期间采样的最佳解决方案。我们发现结合RL预训练和主动搜索在实践中效果最好。
On 2D Euclidean graphs with up to 100 nodes, Neural Combinatorial Optimization significantly outperforms the supervised learning approach to the TSP (Vinyals et al., 2015b) and obtains close to optimal results when allowed more computation time. We illustrate its flexibility by testing the same method on the KnapSack problem, for which we get optimal results for instances with up to 200 items. These results give insights into how neural networks can be used as a general tool for tackling combinatorial optimization problems, especially those that are difficult to design heuristics for.	在具有多达100个节点的2D欧几里得图上，神经组合优化明显优于TSP的监督学习方法（Vinyals等，2015b），并且当允许更多计算时间时获得接近最优的结果。我们通过在KnapSack问题上测试相同的方法来说明其灵活性，为此我们可以获得最多200个项目的实例的最佳结果。这些结果可以深入了解神经网络如何用作解决组合优化问题的通用工具，特别是那些难以设计启发式算法的工具。