人工智能教程 - 专业选修课程4.3.5 - 强化学习 2.状态,智能体的组件

本文链接：https://blog.csdn.net/fsdaewrq/article/details/104575633

本文介绍了强化学习中状态的定义，包括环境状态、智能体状态和信息状态（马尔科夫状态）。强调马尔科夫状态的重要性，即当前状态包含了所有对未来决策有用的过去信息。还探讨了完全可观测环境与部分观测环境的区别，以及在部分观测环境下智能体如何通过建立自己的状态表示来应对不确定性。最后，提到了智能体的组成部分如策略、价值函数和模型。

摘要由CSDN通过智能技术生成

状态 state:

State is the information used to determine what happens next
Formally, state is a function of the history:

状态state则是关于历史记录history的函数

$S_t =f(H_t)$

状态有三种定义：

环境状态，也就是整个环境的状态
environment state
1.）The environment state State $S_T^e$ is the environment’s private representation
2.）i.e. whatever data the environment uses to pick the next observation/reward
3.）The environment state is not usually visible to the agent
4.）Even if Ste is visible, it may contain irrelevant information
agent状态，就是agent自身的状态
The agent state

The agent state $S_T^a$ is the agent’s internal representation

i.e. whatever information the agent uses to pick the next action

i.e. it is the information used by reinforcement learning algorithms

It can be any function of history:

$S_t^a = f ( H_t )$

3、信息状态，也叫马尔科夫状态，基于信息论，可以说是强化学习的基础。An information state
An information state (a.k.a. Markov state) contains all useful information from the history.

马尔科夫状态定义 Definition

A state $S_t$ is Markov if and only if

$P[S_{t+1} | S_t] = P[S_{t+1} | S_1,...,S_t]$

马尔科夫状态的定义是下一刻的状态只由当前状态决定，也就是说马尔科夫状态包含了之前全部历史的有用信息。

与过去的状态没有多大的关系

“The future is independent of the past given the present”

$H_{1:t}\rightarrow S_t \rightarrow H_{t+1}:\infty$

当前的状态决定决策的最优解，而不是history

以直升机为例，10分钟前的速度，风向，位置对于下一步行动没有意义，只有根据现在的速度，风向，位置，才能进行下一步操作。

Once the state is known, the history may be thrown away
i.e. The state is a sufficient statistic of the future
The environment state $S_t^e$ is Markov
The history $H_t$ is Markov

完全可观测环境

Fully Observable Environments

智能体可以观测到环境的变化

完全观察环境意味着观察到的环境状态 = 智能体的状态

Full observability: agent directly
observes environment state

$O_t=S_t^a=S_t^e$

Agent state = environment
state = information state
实际上，就是马尔科夫决策过程
Formally, this is a Markov decision process (MDP)

部分观测环境

Partially Observable Environments

部分观测环境：不能直接观测环境

Partial observability: agent indirectly observes environment:

实际上这是部分观测马尔科夫决策过程，DP(动态规划)

Formally this is a partially observable Markov decision process(POMDP)
这三者就不相等了，这时候对马尔科夫状态的表达就非常重要了，必须选择合适的表示方法。一般主要由三种：
智能体必须创建自己的状态表达 $S_t^a$
Agent must construct its own state representation $S_t^a$ e.g.