状态 state:
- State is the information used to determine what happens next
- Formally, state is a function of the history:
状态state则是关于历史记录history的函数
S t = f ( H t ) S_t =f(H_t) St=f(Ht)
状态有三种定义:
-
环境状态,也就是整个环境的状态
environment state
1.)The environment state State S T e S_T^e STe is the environment’s private representation
2.)i.e. whatever data the environment uses to pick the next observation/reward
3.)The environment state is not usually visible to the agent
4.)Even if Ste is visible, it may contain irrelevant information -
agent状态,就是agent自身的状态
The agent state
The agent state S T a S_T^a STa is the agent’s internal representation
i.e. whatever information the agent uses to pick the next action
i.e. it is the information used by reinforcement learning algorithms
It can be any function of history:
S t a = f ( H t ) S_t^a = f ( H_t ) Sta=f(Ht)
3、信息状态,也叫马尔科夫状态,基于信息论,可以说是强化学习的基础。An information state
An information state (a.k.a. Markov state) contains all useful information from the history.
马尔科夫状态 定义 Definition
A state S t S_t St is Markov if and only if
P [ S t + 1 ∣ S t ] = P [ S t + 1 ∣ S 1 , . . . , S t ] P[S_{t+1} | S_t] = P[S_{t+1} | S_1,...,S_t] P[St+1∣St]=P[St+1∣S1,...,St]
马尔科夫状态的定义是下一刻的状态只由当前状态决定,也就是说马尔科夫状态包含了之前全部历史的有用信息。
与过去的状态没有多大的关系
“The future is independent of the past given the present”
H 1 : t → S t → H t + 1 : ∞ H_{1:t}\rightarrow S_t \rightarrow H_{t+1}:\infty H1:t→St→Ht+1:∞
当前的状态决定决策的最优解,而不是history
以直升机为例,10分钟前的速度,风向,位置对于下一步行动没有意义,只有根据现在的速度,风向,位置,才能进行下一步操作。
Once the state is known, the history may be thrown away
i.e. The state is a sufficient statistic of the future
The environment state
S
t
e
S_t^e
Ste is Markov
The history
H
t
H_t
Ht is Markov
完全可观测环境
Fully Observable Environments
智能体可以观测到环境的变化
完全观察环境意味着观察到的环境状态 = 智能体的状态
Full observability: agent directly
observes environment state
O t = S t a = S t e O_t=S_t^a=S_t^e Ot=Sta=Ste
- Agent state = environment
state = information state - 实际上,就是马尔科夫决策过程
Formally, this is a Markov decision process (MDP)
部分观测环境
Partially Observable Environments
部分观测环境:不能直接观测环境
Partial observability: agent indirectly observes environment:
实际上这是部分观测马尔科夫决策过程 ,DP(动态规划)
Formally this is a partially observable Markov decision process(POMDP)
这三者就不相等了,这时候对马尔科夫状态的表达就非常重要了,必须选择合适的表示方法。一般主要由三种:
智能体必须创建自己的状态表达
S
t
a
S_t^a
Sta
Agent must construct its own state representation
S
t
a
S_t^a
Stae.g.
1、记录整个完全的历史 S t a = H t S_t^a=H_t Sta=Ht
Complete history
2、环境状态的置信,这是一种贝叶斯问题文法.你不相信每一步都是正确的,所以你要引进概率分布。 S t a = ( P [ S t e = s 1 , . . . , [ S t e = s n ] ] ) S_t^a=(P[S_t^e=s^1,...,[S_t^e=s^n]]) Sta=(P[Ste=s1,...,[Ste=sn]])
Beliefs of environment state:
3、循环人工神经网络,以线性组合的方式,将agents的状态,与最近的观测结合起来: S t a = δ ( S t − 1 a W s + O t W o ) S_t^a=\delta(S_{t-1}^aW_s+O_tW_o) Sta=δ(St−1aWs+OtWo)
Recurrent neural network
智能体的组件
An RL agent may include one or more of these components:
- Policy: 行为函数 状态作为输入,行动决策作为输出 agent’s behaviour function
- Value function: 评价函数 how good is each state and/or action
- Model:用来感知环境是如何变化的,判断环境变化的 agent’s representation of the environment