1. Abstract
- construction heuristics. Such approaches find TSP solutions of good quality but require additional procedures such as beam search and sampling to improve solutions and achieve state-of-the-art performance.
- improvement heuristics, where a given solution is improved until reaching a near-optimal one. In this work
- Our results show that the learned policies can improve even over random initial solutions and approach near-optimal solutions at a faster rate than previous state-of-the-art deep learning methods
- We propose a policy gradient algorithm to learn a stochastic policy. Moreover, we introduce a policy neural network that leverages a pointing attention mechanism, which unlike previous works
2. Conclusion
- We proposed a neural architecture with graph and sequence embeddings
- One drawback of our policy gradient method is the large number of samples required to train a good policy
3. Framework Or Graph
DRL + Improvement Heuristic + Encoder - Decoder
A2C
RL Formulation
Attention K, Q, V
网络中有GCN, Pointing Mechanism
4. Reinforcement Learning Formulation

![]()



5. Policy Gradient Neural Architecture
5.1 encoder
Embedding Layer
2维坐标变为d维,再添加边信息。并对节点特征矩阵进行加权。
Graph Convolutional Layers


Sequence Embedding Layers
Next, we use node embeddings to learn a sequence representation of the input and encode a tour
LSTM
Dual Encoding
2 Encoders
6. Policy Decoder



Pointing Mechanism
We use a pointing mechanism to predict a distribution over node outputs given encoded actions (nodes) and a state representation (query vector). Our pointing mechanism is parameterized by two learned attention matrices K ∈ R d×d and Q ∈ R and vector v ∈ R
7. Value Decoder

8. Policy Gradient Optimization
we maximize the expected rewards given a state S¯ defined as J(θ | S¯) = Eπθ [Gt | S¯]


9. Experiments & Analysis


2340

被折叠的 条评论
为什么被折叠?



