[OpenAI SpinningUp] Key Concepts and Terminology

本文概述了强化学习中的核心概念,包括状态与观察、行动空间、策略、轨迹、奖励与回报。讨论了确定性与随机性策略,如离散和对角高斯策略,并解释了价值函数和贝尔曼方程的重要性。
摘要由CSDN通过智能技术生成

Key Concepts and Terminology

overview

An agent interacts with the environment, in the way that

  1. the agent observes the environment, and take an action,
  2. the action changes the environment’s current state, and meanwhile the agent receives a reward,
  3. back to 1.

In an episode, the return the agent gets is defined by the cumulative rewards.
Breifly speaking, the goal of RL is to find an agent with the best possible policy that receives maximal returns.

states and observations

conceptmeaning
statefull description of the environment
observationpartial description of the environment

Both states and observations can be represented in vectors, matrices, high-dim tensor, e.g., visual observations often use RGB matrices, robots in GYM often use high-dim vectors to encapsulate angles and velocities.

However in the literature, the two concepts are not differentiated clearly. Often when researchers mentions state in the paper, their experiemtal subjects actually have observation instead, because normally the subjects have partial access to the environment.

action spaces

The set of all valid actions is termed action space.

conceptmeaningrepresentations
discrete action spacefinite options of actionsinteger indices
continuous action spaceinfinite, smooth action changesreal-valued vectors

As the goal of RL is to find optimal policies, the responsibility of whom is to give appropriate actions on observing the environment, the two concepts matter to the method of finding optimal policies.

policies

The policy is what an agent conforms with to give actions when observing the environment. It takes the current state (or observation) as input and outputs an action

conceptnotation
deterministic policy a t = π ( s t ) a_t=\pi(s_t) at=π(st)
stochastic policy a t ∼ π ( ⋅ ∣ s t ) a_t\sim\pi(\cdot\mid s_t) atπ(st)

I would interpret stochastic as such that actions are randomly sampled from a distribution dependent on the state.

Note policy sometimes take the place of agent in the literature. This is reasonable since all an agent do is to follow its policy.

In deep RL, we survey on the parameterized policy. The policy is parameterized computable functions, like in the form of a neural network, so that we can obtain a policy that produces optimal actions conditioned on given states, by adjusting the parameters according to some optimization methods.

deterministic policies

A typical example is several stacking dense layers with activations followed by a final dense layer that outputs logits.

The above example is like a vanilla finite-classes classifier. The essential difference to the stochastic policies, from my perspective, is it often takes an argmax over the logits instead of sampling, rendering it a deterministic function.

To think further, is it necessarily be a categorical policy?
My answer is yes. The question is in essence asking is it possible to have infinite action space for a deterministic policy.
At the first glance, I thought I just need an infinite action space and when I get the logits for each actions, I simply take an argmax and my work is over.
But how can we have logits for an infinite action space? Only via a distribution’s formula, right? But recall what exactly do we use the distribution’s density function. We feed a point to it and get the probability density. But there is no way enumerating through all possible actions (each of which is a real-valued vector) and get their logits, since the action space is inifinite.

stochastic policies

conceptaction space
categorical policesdiscrete
diagonal gaussian policiescontinuous

There are two essential tasks, 1) sampling, and 2) log-likelihood.

1. categorical policies

the common practice is:

  1. via network inference we get logits for the finite actions.

  2. sample from the softmax-ed logits:
    tf.multinomial is meant for the sampling result of n trials of (each) k options with specified probability.

    recall that
    >> bernoulli distribution models the outcome of a single trial of two options
    >> binomial distribution models the outcome of n trials of two options
    >> multimomial distribution models the outcome of n trials of k options

  3. log-likelihood: use the sampled action to index the logits.

2. Diagonal Gaussian policies

As the name suggests, the distribution from which actions are sampled is a Diagonal Multivariate Gaussian Distribution. As an epitome of a normal multivariate gaussian, here diagonal refers to a diagonal covariance matrix of such distribution, meaning the dimensions are linearly independent to each other.

Normally such distribution is represented by a mean vector and a diagonal covariance matrix (equivalent to a variance vector).

As common practices, the mean vector of the policy is modeled by a neural network that inputs a state vector and outputs the mean action vector, used for the distribution of actions.

Two options are offered for modeling the variance vector.

  1. can be a single parameter independent on the state, i.e.
    μ = l o g σ \mu=log\sigma μ=logσ
  2. dependent on the state, μ t = l o g σ t = ϕ θ ( s t ) \mu_t=log\sigma_t=\phi_\theta(s_t) μt=logσt=ϕθ(st), where the function ϕ \phi ϕ is parameterized by θ \theta θ.

Note we use log-standard-deviation because it takes on any values in ( − ∞ , + ∞ ) (-\infty,+\infty) (,+) while std-dev non-negative, and Training turns out easier if we enforce no constraints.

The sampling process can be done by tf.random_normal in tensorflow, following exactly a = μ θ ( s t ) + σ θ ( s t ) ⊙ z a=\mu_\theta(s_t)+\sigma_\theta(s_t)\odot z a=μθ(st)+σθ(st)z, where μ \mu μ and σ \sigma σ respectively the mean and std-dev, and z ∼ N ( 0 , 1 ) z\sim N(0,1) zN(0,1).

The log-likelihood is obtained via the following
在这里插入图片描述
在这里插入图片描述

trajectories

def
start-state
state-transition

  • mdp
  • determ
  • stochastic
    alias

reward and return

reward function
why return
finite undiscounted return
infinite discounted return

  • why discount
    common practice: optimize undiscounted, value functions use discounted

the RL problem

goal: expected return
(stochastic case) proba of a T-step Traj.
expected return
optimal policy

value functions

def. what does the value mean
four types
on-policy off-policy
two lemma value and q

the optimal Q-function and the optimal action

def. > rel.
note. the policy with respect to optimal action

Bellman Equations

in recurrence
the optimal value and q
bellman backup

Advantage function

meaning
form

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值