Planning and Learning with Tabular Methods: Part3

Heuristic Search

Decision-time planning methods, collectively known as heuristic search, are classical state-space planning methods in AI.

The approximate value function is applied to the leaf nodes and then backed up toward the current state at the root. The backing up stops at the state-action nodes for current state. Once the backed-up values of these nodes are computed, the best of them is chosen as the current action, and then all backed-up values are discarded.

The reason why we carry out deeper searching than one step is to obtain better action selections. If one has a perfect model and an imperfect action-value function, then in fact deeper search will usually yield better policies.

Certainly, if the search is all the way to the end of the episode, then the effect of the imperfect value function is eliminated, and the action determined in this way must be optimal. If the search is of sufficient depth k k k such that γ k \gamma ^ k γk is very small, then the actions will be correspondingly near optimal.

Heuristic search focuses greatly on the current state. Much of the effectiveness of heuristic search is due to its search tree being tightly focused on the states and actions that might immediately follow the current state. Not only should your computation be preferentially devoted to imminent events, but so should your limited memory resources. This great focusing of memory and computational resources on the current decosion is presumably the reason why heuristic search can be so effective.

We can use the methods of heuristic search to construct a search tree, and then perform the individual, one-step updates from bottom up, as sugggested by the following figure.
请添加图片描述
Any state-space search can be viewed as the piecing together of a large number of individual one-step updates. The performance improvement observed with deeper search is due to the focus and concentration of updates on states and actions immediately downstream from the current state.

Rollout Algorithm

The idea of rollout algorithm:

  • Estimating action values for a given policy by averaging the returns of many simulated trajectories that start with each possible action and then followed the given policy.
  • When the action-value estimates are considered to be accurate enough, the action (or one of the actions) having the highest estimated value is executed.
  • The process is carried out anew from the resulting next new state.

Rollout algorithm is a special case of MC control algorithm. It produces Monte Carlo estimates of action values only for each current state and for a given policy usually called the rollout policy.

Given any two policies π \pi π and π ′ \pi ' π that are identical except that π ′ ( s ) = a ≠ π ( s ) \pi'(s)=a \neq \pi(s) π(s)=a=π(s) for some states s s s, if q π ( s , a ) ≥ v π ( s ) q_\pi(s, a) \geq v_\pi(s) qπ(s,a)vπ(s) then policy π ′ \pi' π is as good as, or better, than π \pi π. Moreover, if the inequality is strict, then π ′ \pi' π is in fact btter than π \pi π.

The above theory can apply to rollout algorithms where s s s is the current state and π \pi π is the rollout policy.

The aim of a rollout algorithm is to improve upon the rollout policy instead of finding an optimal policy, and the steps are:
Averaging the retuens of the simulated trajectories produces estimates of q π ( s , a ′ ) q_\pi(s, a') qπ(s,a) for each action a ′ ∈ A ( s ) a' \in A(s) aA(s). Then the policy that selects an action in s s s that maximize these estimates and thereafter follows π \pi π is a good candidate for a policy that improves over π . \pi. π.

The better the rollout policy and the more accurate the value estimates, the better the policy produced by a rollout algorithm is likely to be.

Rollout algorithm involves important tradeoffs because btter rollout policies typically mean that more time is needed to simulated enough trajectories to obtain good value estimates. But as decision-time plannning methods, rollout algorithms have to meet strict time constrains, and there are mainly three ways to overcome the difficulty:

  1. The Monte Carlo trials are independent of one another, it is possible to run many trials in parallel on separate processors.
  2. Truncate the simulated trajectories short of complete episodes, correcting the trucated returns by means of a stored evaluation function.
  3. Monitor the Monte Carlo simulations and prune away candidate actions that are unlikely to turn out to be the best, or whose values are close enough to that of the current best that choosing them instead would make no real difference.

Rollout algorithms do not maintian long-term memories of values and policies, for which they aren’t recognized as l e a r n i n g learning learning algorithms. But rollout algorithms are still similar to RL to some extend, and take advantage of these features:

  • They both avoid the exhaustive sweeps of dynamic programming by trajectory sampling.
  • They both avoid the need for distribution models by relying on sample, instead of expected, updates.
  • Rollout algorithms take advantage of the policy improvement property by acting greedily with respect to the estimated action values.

Monte Carlo Tree Search

M C T S MCTS MCTS is a recent and strikingly successful example of decision-time planning, and it is also a kind of rollout algorithm. But, what’s better is that MCTS introduces a means for accumulating value estimates obtained from the Monte Carlo simulations in order to successively direct simulations toward more highly-rewarding trajectories.

MCTS is executed after encountering each new state to select the agent’s action for that state; it is executed again to select the action for the next state, and so on. Each execution is an iterative process that simulates many trajectories starting from the current state and running to a terminal state.

The core idea of MCTS is to successively focus multiple simulations starting at the current state by extending the initial portions of trajectories that have received high evaluations from earlier simulations.

Monte Carlo value estimates are maintained only for the subset of state-action pairs that are most likely to be reached in a few steps, which form a tree rooted at the current state. Like the figure as following.
请添加图片描述
MCTS incrementally extends the tree by adding nodes representing states that look promising based on the results of the simulated trjectories. Outside the tree and at the leaf nodes the rollout policy is used for action selections. For the states inside the tree, we have value estimates for at least some of the actions, so we can pick among them using an informed policy, called the t r e e tree tree p o l i c y policy policy, which can be an ϵ − g r r e e d y \epsilon-grreedy ϵgrreedy or UCB selection rule.

In general, each iteration of a basic version of MCTS consists of four steps: Selection, Expansion, Simulation and Backup. MCTS continues executing these four steps, and finally , an action from the root node (represents the current state) is selected according to some mechanism that depends on the accumulated statistics in the tree, for example, it may be an action having the largest action value, or perhaps the action with the largest visit count to avoid selecting outliers.

In addition to the benifits it has as a kind of rollout algorithm, MCTS saves action-value estimates attached to the tree edges and updates them using RL’s sample updates, and that explains why it perform such impressive results. Such methods has the effect of focusing the Monte Carlo trails on trajectories whose initial segments are common to high-return trajectories previously simulated. And by expanding the tree, MCTS effectively grows a lookup table to store a partial action-value function, with memory allocated to the estimated values of state-action pairs visited in the initial segments of high-yielding sample trajectories. Thus, MCTS avoids the problem of globally approximating an action-value function, while it retains the benifit of using past experience to guide exploration.

The modifications and extensions of MCTS for use in both games and single-agent applications are still under research.

References

[1]. Reinforcement Learning-An introduction

If there is infringement, promise to delete immediately

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值