Chapter 8: Planning and Learning with Tabular Methods

1 Introduction

  • In this chapter we develop a unified view of reinforcement learning methods that require a model of the environment, such as dynamic programming and heuristic search, and methods that can be used without a model, such as Monte Carlo and temporal-difference methods. These are respectively called model-based and model-free reinforcement learning methods. Model-based methods rely on planning as their primary component, while model-free methods primarily rely on learning.

  • Model-Based: By a model of the environment we mean anything that an agent can use to predict how the environment will respond to its actions. Some models produce a description of all possibilities and their probabilities; these we call distribution models. Other models produce just one of the possibilities, sampled according to the probabilities; these we call sample models.

  • Planning: We use the term to refer to any computational process that takes a model as input and produces or improves a policy for interacting with the modeled environment. (1) all state-space planning methods involve computing value functions as a key intermediate step toward improving the policy, and (2) they compute value functions by updates or backup operations applied to simulated experience.

  • Planning and learning are similar methods except that they are based on different environments. Planning is based on current available information or a model of the environment, while learning is based on real environment. The heart of both learning and planning methods is the estimation of value functions by backing-up update operations. Learning can update the model with the real experience. They can be naturally combined, such as in Dyna, Dyna-Q. They both aim to get a optimal policy by calculating the value function (tabular form).

2 Summary of TD, DP, MC, (MCTS(HS), ES)

2.1 TD, MC, DP returns and update

2.2 Expected & sample Update

Expected update Q:
Q ( s , a ) = ∑ s ′ , r p ′ ( s ′ , r ∣ s , a ) [ r + γ m a x a ′ Q ( s ′ , a ′ ) ] Q(s,a)=\sum_{s',r}p'(s',r|s,a)[r+\gamma \mathop{max}\limits_{a'}Q(s',a')] Q(s,a)=s,rp(s,rs,a)[r+γamaxQ(s,a)]
The dynamic of the model is needed to do such expected update.

Sample updates Q:
Q ( s , a ) = Q ( s , a ) + α [ R + γ m a x a ′ Q ( S ′ , a ′ ) − Q ( s , a ) ] Q(s,a)=Q(s,a)+\alpha[R+\gamma \mathop{max}\limits_{a'}Q(S',a')-Q(s,a)] Q(s,a)=Q(s,a)+α[R+γamaxQ(S,a)Q(s,a)]

2.3 Trajectory sampling

As discussed previously, distributing updates uniformly during planning is often sub-optimal. This is because for many tasks, the majority of possible updates will be on irrelevant or low-probability trajectories.

We could generate experience and updates in planning by interacting the current policy with the model, then only updating the simulated trajectories. We call this trajectory sampling. Naturally, trajectory sampling generates updates according to the on-policy distribution.

Focusing on the on-policy distribution could be beneficial because it causes uninteresting parts of the space to be ignored, but it could be detrimental because it causes the same parts of the space to be updated repeatedly. It is often the case the distributing updates according to the on-policy distribution is preferable to using the uniform distribution for larger problems.

3 Dyna, Dyna-Q+

  • Indirect and direct learning:
    Both direct and indirect methods have advantages and disadvantages. Indirect methods often make fuller use of a limited amount of experience and thus achieve a better policy with fewer environmental interactions. On the other hand, direct methods are much simpler and are not affected by biases in the design of the model.

3.1 Dyna



3.2 When the model is dynamic — Dyna Q+

When the model will change with time, the present optimal policy may also change. To encourage such exploration, a special reward of long-untried actions can be assigned, resulting in Dyna-Q+ method.

In a planning context, exploration means trying actions that improve the model, whereas exploitation means behaving in the optimal way given the current model.

The Dyna-Q+ agent that did solve the shortcut maze uses one such heuristic. This agent keeps track for each state–action pair of how many time steps have elapsed since the pair was last tried in a real interaction with the environment. The more time that has elapsed, the greater (we might presume) the chance that the dynamics of this pair has changed and that the model of it is incorrect. To encourage behavior that tests long-untried actions, a special “bonus reward” is given on simulated experiences involving these actions.

3.3 When the model is large — Prioritized Sweeping

When we face a fairly complex model, it is important to focus on some important states or state-action pairs. The idea of backward focusing and prioritized sweeping could be utilized to focus on influential state or state-action values.

  • backward focusing
    In general, we want to work back not just from goal states but from any state whose value has changed.
  • forward focusing
    Another would be to focus on states according to how easily they can be reached from the states that are visited frequently under the current policy, which might be called forward focusing.

It is natural to prioritize the updates according to a measure of their urgency, and perform them in order of priority. This is the idea behind prioritized sweeping. A queue is maintained of every state–action pair whose estimated value would change nontrivially if updated , prioritized by the size of the change.

4 Real-time dynamic programming

Real-time dynamic programming (RTDP) is a on-policy, trajectory-sampling version of value-iteration DP. This is DP value iteration, but with the updates distributed according to the on-policy distribution. As such, it is a form of asynchronous DP.

In general, finding an optimal policy with on-policy trajectory-sampling control method (e.g. Sarsa) requires visiting all state action pairs infinitely many times in the limit. This is true for RTDP as well, but there are certain types of problems for which RTDP is guaranteed to find ann optimal partial policy without visiting all states infinitely often. This is an advantage for problems with very large state sets.

5 Decision-time planning

The type of planning we have considered so far is the improvement of a policy or value function based on simulated experience. This is not focussed on interaction with the environment and is called background planning.

An alternative type of planning, decision time planning, is the search (sometimes many actions deep) of possible future trajectories given the current state.

5.1 Heuristic search

In heuristic search, for each state encountered, a large tree of possible continuations is considered. The approximate value function is applied to the leaf nodes and then backed up toward the current state at the root. The backing up within the search tree is just the same as in the expected updates with maxes discussed throughout this book. The backing up stops at the state (action nodes for the current state). Once the backed-up values of these nodes are computed, the best of them is chosen as the current action, and then all backed-up values are discarded.

5.2 Rollout Algorithm

Rollout algorithms are decision-time planning algorithms based on Monte Carlo control applied to simulated trajectories that all begin at the current environment state. Rollout algorithms start in a given state, then estimate the value of the state by averaging simulated returns from that state after following a given policy, called the rollout policy. The action with the highest estimated value is selected and the process is repeated. This is useful when one knows a policy but needs to average over some stochasticity in the environment.

5.3 Monte carlo tree search (MCTS)

Monte-Carlo Tree Search (MCTS) is a successful example of decision time planning. It is a rollout algorithm that accumulates value estimates from the Monte Carlo simulations in order to guide the search. A variant of MCTS was used in AlphaGo.

A basic version of MCTS follows the following steps, starting at the current state:

  1. Selection: Starting at the root node, a tree policy based on action-values attached to the edges of the tree (that balances exploration and exploitation) traverses the tree to select a leaf node.
  2. Expansion: On some iterations (depending on the implementation), the tree is expanded from the selected leaf node by adding one of more child nodes reached from the selected node via unexplored actions.
  3. Simulation: From the selected node, or from one if its newly added child nodes (if any), simulation of a complete episode is run with actions selected by the rollout policy. The result is a Monte Carlo trial with actions selected first by the tree policy and beyond the tree by the rollout policy.
  4. Backup: The return generated by the simulated episode is backed up to update, or to initialise, the action values attached to the edges of the tree traversed by the tree policy in this iteration of MCTS. No values are saved for the states and actions visited by the rollout policy beyond the tree.

The figure below illustrates this process. MCTS executes this process iteratively, starting at the current state, until no more time is left or computational resources are exhausted. An action is then taken based on some statistics in the tree (e.g. largest action-value or most visited node). After the environment transitions to a new state, MCTS is run again, sometimes starting with a tree of a single root node representing the new state, but often starting with a tree containing any descendants of this node left over from the tree constructed by the previous execution of MCTS; all the remaining nodes are discarded, along with the action values associated with them.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
"neural oblivious decision ensembles for deep learning on tabular data" 是一种用于表格数据深度学习的神经网络无意识决策集合方法。在传统的深度学习中,神经网络往往以端到端的方式进行训练,对于每个输入样本都直接输出最终结果。然而,在某些情况下,这种端到端方式可能不够灵活和可解释,特别是在处理结构化的表格数据时。 这种方法引入了决策树集成的概念,通过将神经网络的输出与多个决策树进行集成,从而提高了模型的表现和可解释性。首先,神经网络用于提取表格数据的特征表示,然后将这些特征作为输入传递给多个决策树模型,每个决策树都将以不同的方式对特征进行划分和决策。最后,集成了所有决策树的结果,并根据需要进行后处理,以获得最终的预测结果。 将神经网络与决策树集成相结合,可以充分利用神经网络的学习能力和决策树的优势。神经网络可以自动学习特征表示和复杂的非线性关系,而决策树则可以提供更直观和可解释的预测规则。此外,决策树的集成可以改善模型的鲁棒性和泛化能力。 这种方法在处理表格数据时具有广泛的应用前景。例如,在金融领域,可以将这种方法应用于信用评分、风险预测和投资决策等任务。在医疗领域,可以利用该方法进行疾病诊断和预测患者的治疗效果。此外,在推荐系统、电子商务和广告领域,也可以利用神经网络无意识决策集合方法来提高个性化推荐和广告排序的效果。 总之,"neural oblivious decision ensembles for deep learning on tabular data" 是一种将神经网络和决策树集成的方法,用于处理结构化的表格数据,并在多个领域具有广泛的应用前景。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值