【论文阅读】-Learning 2-opt Heuristics for the Traveling Salesman Problem via Deep Reinforcement Learning


1. Abstract

  • construction heuristics. Such approaches find TSP solutions of good quality but require additional procedures such as beam search and sampling to improve solutions and achieve state-of-the-art performance.
  • improvement heuristics, where a given solution is improved until reaching a near-optimal one. In this work
  • Our results show that the learned policies can improve even over random initial solutions and approach near-optimal solutions at a faster rate than previous state-of-the-art deep learning methods
  • We propose a policy gradient algorithm to learn a stochastic policy. Moreover, we introduce a policy neural network that leverages a pointing attention mechanism, which unlike previous works

2. Conclusion

  • We proposed a neural architecture with graph and sequence embeddings
  • One drawback of our policy gradient method is the large number of samples required to train a good policy

3. Framework Or Graph

 DRL + Improvement Heuristic + Encoder - Decoder

 A2C

 RL Formulation

 Attention  K, Q, V

 网络中有GCN, Pointing Mechanism


4. Reinforcement Learning Formulation


5. Policy Gradient Neural Architecture

5.1 encoder

Embedding Layer

x_i \in [0,1]^2

2维坐标变为d维,再添加边信息。并对节点特征矩阵进行加权。

Graph Convolutional Layers

Sequence Embedding Layers

Next, we use node embeddings z_i to learn a sequence representation of the input and encode a tour

LSTM

Dual Encoding

2 Encoders


         

         6. Policy Decoder

Pointing Mechanism

We use a pointing mechanism to predict a distribution over node outputs given encoded actions (nodes) and a state representation (query vector). Our pointing mechanism is parameterized by two learned attention matrices K ∈ R d×d and Q ∈ R and vector v ∈ R


        7. Value Decoder


        8. Policy Gradient Optimization

we maximize the expected rewards given a state S¯ defined as J(θ | S¯) = Eπθ [Gt | S¯]


        9. Experiments & Analysis

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值
>