【强化学习-探索】超越RND？Go-Explore: a New Approach for Hard-Exploration Problems-CSDN博客

本文链接：https://blog.csdn.net/fangzhang233/article/details/133799003

detachment 和 derailment -- 两个现有的问题。

detachment -- Detachment is the idea that an agent driven by IM could become detached from the frontiers of high intrinsic reward (IR). To understand detachment, we must first consider that intrinsic reward is nearly always a consumable resource: a curious agent is curious about states to the extent that it has not often visited them (similar arguments apply for surprise, novelty, or prediction-error seeking agents [4, 14–16]). If an agent discovers multiple areas of the state space that produce high IR, its policy may in the short term focus on one such area.

After exhausting some of the IR offered by that area, the policy may by chance begin consuming IR in another area. 也就是detachment.

Once it has exhausted that IR, it is difficult for it to rediscover the frontier it detached from in the initial area, because it has already consumed the IR that led to that frontier, and it likely will not remember how to return to that frontier due to catastrophic forgetting. Each time this process occurs, a potential avenue of exploration can be lost, or at least be difficult to rediscover. In the worst case, there may be a dearth of remaining IR near the areas of state space visited by the current policy (even though much IR might remain elsewhere), and therefore no learning signal remains to guide the agent to further explore in an effective and informed way.

One could slowly add intrinsic rewards back over time, but then the entire fruitless process could repeat indefinitely.没看懂，

In theory a replay buffer could prevent detachment, but in practice it would have to be large to prevent data about the abandoned frontier to not be purged before it becomes needed, and large replay buffers introduce their own optimization stability difficulties [21, 22].

可以用replay buffer

The Go-Explore algorithm addresses detachment by explicitly storing an archive of promising states visited so that they can then be revisited and explored from later.、

下面这个例子就是detachment最好的例子，注意随后忘记了紫色部分是因为神经网络的灾难性遗忘造成的。

Derailment can occur when an agent has discovered a promising state and it would be beneficial to return to that state and explore from it. Typical RL algorithms attempt to enact such desirable behavior by running the policy that led to the initial state again, but with some stochastic perturbations to the existing policy mixed in to encourage a slightly different behavior (e.g. exploring further). The stochastic perturbation is performed because IM agents have two layers of exploration mechanisms: (1) the higher-level IR incentive that rewards when new states are reached, and (2) a more basic exploratory mechanism such as epsilon-greedy exploration, action-space noise, or parameter-space noise.

Importantly, IM agents rely on the latter mechanism to discover states containing high IR, and the former mechanism to return to them. However, the longer, more complex, and more precise a sequence of actions needs to be in order to reach a previously-discovered high-IR state, the more likely it is that such stochastic perturbations will “derail” the agent from ever returning to that state. That is because the needed precise actions are naively perturbed by the basic exploration mechanism, causing the agent to only rarely succeed in reaching the known state to which it is drawn, and from which further exploration might be most effective. To address derailment, an insight in Go-Explore is that effective exploration can be decomposed into first returning to a promising state (without intentionally adding any exploration) before then exploring further.

Go-Explore is an explicit response to both detachment and derailment that is also designed to achieve robust solutions in stochastic environments. The version presented here works in two phases (Fig. 2): (1) first solve the problem in a way that may be brittle, such as solving a deterministic version of the problem (i.e. discover how to solve the problem at all), and (2) then robustify (i.e. train to be able to reliably perform the solution in the presence of stochasticity).1 Similar to IM algorithms, Phase 1 focuses on exploring infrequently visited states, which forms the basis for dealing with sparse-reward and deceptive problems. In contrast to IM algorithms, Phase 1 addresses detachment and derailment by accumulating an archive of states and ways to reach them through two strategies:

(a) add all interestingly different states visited so far into the archive, and

(b) each time a state from the archive is selected to explore from, first Go back to that state (without adding exploration), and then Explore further from that state in search of new states (hence the name “Go-Explore”).

先把有趣的都收集起来，然后从那些有趣的开始（在此之前，不用random policy）

An analogy of searching a house can help one contrast IM algorithms and Phase 1 of Go-Explore.

IM algorithms are akin to searching through a house with a flashlight, which casts a narrow beam of exploration first in one area of the house, then another, and another, and so on, with the light being drawn towards areas of intrinsic motivation at the edge of its small visible region. It can get lost if at any point the beam fails to fall on any area with intrinsic motivation remaining. Go-Explore more resembles turning the lights on in one room of a house, then its adjacent rooms, then their adjacent rooms, etc., until the entire house is illuminated. Go-Explore thus gradually expands its sphere of knowledge in all directions simultaneously until a solution is discovered.

If necessary, the second phase of Go-Explore robustifies high-performing trajectories from the archive such that they are robust to the stochastic dynamics of the true environment. Go-Explore robustifies via imitation learning (aka learning from demonstrations or LfD [26–29]), a technique that learns how to solve a task from human demonstrations.

The only difference with Go-Explore is that the solution demonstrations are produced automatically by Phase 1 of Go-Explore instead of being provided by humans.

The input to this phase is one or more high-performing trajectories, and the output is a robust policy able to consistently achieve similar performance. The combination of both phases instantiates a powerful algorithm for hard-exploration problems, able to deeply explore sparse- and deceptive-reward environments and robustify high-performing trajectories into reliable solutions that perform well in the unmodified, stochastic test environment.

The input to this phase is one or more high-performing trajectories, and the output is a robust policy able to consistently achieve similar performance.

The combination of both phases instantiates a powerful algorithm for hard-exploration problems, able to deeply explore sparse and deceptive reward environments and robustify trajectories into reliable solutions that perform well in the unmodified, stochastic test environment.

The Go-Explore Algorithm

The insight that remembering and returning reliably to promising states is fundamental to effective exploration in sparse-reward problems is at the core of Go-Explore. Because this insight is so flexible and can be exploited in different ways, Go-Explore effectively encompasses a family of algorithms built around this key idea. The variant implemented for the experiments in this paper and described in detail in this section relies on two distinct phases. While it provides a canonical demonstration of the possibilities opened up by Go-Explore, other variants are also discussed (e.g. in Section 4) to provide a broader compass for future applications.