Breif Introduction for Reinforcement Learning (Background Info)

是不是测评机针对我

于 2021-01-18 16:57:01 发布

阅读量344

点赞数 1

分类专栏： machine learning

本文链接：https://blog.csdn.net/qq_42778110/article/details/112788741

版权

machine learning 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

文章目录

Breif Introduction for Reinforcement Learning I (Background Info)
- Markov Chain

Breif Introduction for Reinforcement Learning I (Background Info)

Markov Chain

Markov Decision Process

$M = (S, A, P, R)$

States : $s_i\in S$
Actions : $a_i\in A$
Probability distribution of transitions : $p(s'|s,a)\in P_{sa}$
Reward : $r (s^{'} ∣ s, a)$

Value function: Bellman Equation

RL learns a policy $\pi:S\rightarrow A$ . Reward function $R$ reflects only the real time REWARD. For a long term REWARD, we introduce Value Function $V^{\pi}(s)$ .

State Value Function
$V^{\pi}(s)=\sum_{s'\in S}p(s'|s,\pi(s))[r(s'|s,\pi(s))+\gamma V^\pi(s')]$
Action Value Function
$Q(s,a)=\sum_{s'\in S}p(s'|s,a)[r(s'|s,a)+\gamma V^\pi(s')]$
Connection
In this section, we have 2 different value functions for states and actions. Consider $V$ as a specialized version of $Q$ with predescribed actions for all states, thus we can easily achieve the reward for a sequence of states and actions with $V$ .
$V^{\pi}(s)=Q(s,\pi(s))$
Difference
$Q$ is defined on actions, but $V$ is defined on states.
MDP Best Policy
$\pi^*=\arg\max_\pi V^\pi(s), \forall s\in S$

Basic Solutions

Dynamic Programming (?)

Policy Iteration

Policy Evaluation
For a given policy $\pi$ , Policy Evaluation algorithm calculates values of states $v (s)$ .

ALGORITHM: Policy_Evaluation
Input: $\pi(a|s)$ , the mixed policy to be evaluted.
Initialize $v (s) = 0$ for all $s\in S$
Repeat

$\Delta\leftarrow0$
For each $s\in S$

$tmp\leftarrow\sum_{a}\pi(a|s)\sum_{s'\in S}p(s'|s,\pi(s))[r(s'|s,\pi(s))+\gamma v(s')]$
$\Delta\leftarrow\max(\Delta,|tmp-v(s)|)$
$v(s)\leftarrow tmp$

Until $\Delta<\theta$ (a small positive threshold)
Output: $v\approx v^*$ , the approximate values of states.

Policy Improvement
For a given policy $\pi$ and values of states $v (s)$ , Policy Improvement algorithm can achieve a better policy with $v (s)$ untouched.

ALGORITHM: Policy_Improvement
Input: $\pi(s)$ , $v (s)$ .
Repeat

policy_stable $\leftarrow true$
For each $s\in S$

$tmp\leftarrow\arg\max_a\sum_{s'\in S}p(s'|s,\pi(s))[r(s'|s,\pi(s))+\gamma v(s')]$
If $tmp\not=\pi(s)$ Then policy_stable $\leftarrow false$
$\pi(s)\leftarrow tmp$

Until policy_stable $= t r u e$
Output: $\pi= \pi^*$ , the Improved policy.

Policy Iteration
Combine Policy Evaluation and Policy Improvement, we have Policy Iteration algorithm. The process is as follows:
$\pi_0\rightarrow^{E} v_0\rightarrow^{I} \pi_1\rightarrow^{E} v_1\rightarrow^{I} \pi_2\rightarrow\cdots\rightarrow^{E} v^*\rightarrow^{I} \pi^*$

ALGORITHM: Policy_Iteration
Initializate $v(s)\in R$ and $\pi(s)\in A(s)$ randomly for all $s\in S$
Repeat

$v(s)\leftarrow$ Policy_Evaluation $(\pi)$
$\pi'(s)\leftarrow$ Policy_Improvement $(\pi,v)$
policy_stable $\leftarrow true$
If $\pi\not=\pi'$ Then policy_stable $\leftarrow false$
$\pi\leftarrow\pi'$

Until policy_stable $= t r u e$
Output: $\pi, v$

Value Iteration

Compared with Policy Iteration algorithm, Value iteration algorithm implicitly stores the values of states, so in each iteration we only need to sweep all $s$ for one time.

ALGORITHM: Value_Iteration
Initializate $v(s)\in R$ randomly for all $s\in S$
Repeat

$\Delta\leftarrow0$
For each $s\in S$

$tmp\leftarrow\max_a\sum_{s'\in S}p(s'|s,a)[r(s'|s,a)+\gamma v(s')]$
$\Delta\leftarrow\max(\Delta,|tmp-v(s)|)$
$v(s)\leftarrow tmp$

Until $\Delta<\theta$ (a small positive threshold)
For each $s\in S$

$\pi(s)\leftarrow \arg\max_a\sum_{s'\in S}p(s'|s,a)[r(s'|s,a)+\gamma v(s')]$

Output: $\pi$

Pros andc Cons

pros
- interpretable
- mathematical deduction based
cons
- require complete environmental information

Monte Carlo

MC method is a random version of DP method based on samples. MC method is defined on episode tasks (will end in finite steps) only. There are first-visit MC methods (number of episodes where exists $s$ ) and every-visit MC methods (number of $s$ ). In the section, we discuss first-visit MC methods only.

Similar to DP method, MC method has MC version of Policy Evalution, Policy Improvement and Policy Iteration processes as well.

Monte Carlo Policy Evalution

Input: The policy to be evaluted
Step1: generate some state sequences (each sequence is a episode)
Step2: for each state, calculate the average reward among all episodes where exists $s$
Step3: set the average rewards as values of states

Mote Carlo Estimation of Action Values

To improve policy, we need values of actions (Q-value) first. We can do similar steps like Monte Carlo Policy Evalution: generate sequences, calculate the average reward and set them as Q-values. After that, we and improve the policy as follows: $\pi'(s)=\arg\max_a Q^\pi(s,a)$

Maintaining Exploration

There is a problem for MC method. If we already have predescribed Q-values: $Q(s,a_1)$ $Q(s,a_1)$ and $Q(s,a_1)>Q(s,a_2)$ , Q(s,a_2) will never be updated given MC method will never choose this action. It is similar to a Multi-armed Bandit problem. Maintaining Exploration replace soft policies to definite policies with, for example, $\epsilon-greedy$ policy: execute the best action with a probability of $1-\epsilon$ , otherwise execute those worse actions. Decrease $\epsilon$ by time and the algorithm will converge.

Mote Carlo Control

The process of Mote Carlo Control is as follows:
$\pi_0\rightarrow^{E} q_0\rightarrow^{I} \pi_1\rightarrow^{E} q_1\rightarrow^{I} \pi_2\rightarrow\cdots\rightarrow^{E} q^*\rightarrow^{I} \pi^*$
We can also implicitly stores the values of actions, so we have value iteration of Mote Carlo Control. And at the end of this algorithm, we generate the policy based on Q-values.

Pros and Cons

pros
- based on experience rather than the whole environment
cons
- worked on episode tasks only

Temporal-Difference

TD Prediction

Consider the Bellman Equation for value
$V_\pi(s_t)=E_\pi[R(s_{t+1})+\gamma V_\pi(s_{t+1})|s_{t+1}=\pi(s_t)]$
When policy $\pi$ is fixed, we have
$V(s_t)=R(s_{t+1})+\gamma V(s_{t+1})$
Then we have td_error
$td\_error=|R(s_{t+1})+\gamma V_\pi(s_{t+1})-V_\pi(s_t)|$
To optimize the model, all we need is to modify policy $\pi$ to minimize td_error as follows
$\pi^*=\arg\min_\pi|R(s_{t+1})+\gamma V_\pi(s_{t+1})-V_\pi(s_t)|$

N-step TD

Consider the definition of state value
$V(s_t)=R(s_{t+1})+\gamma R(s_{t+2})+\cdots+\gamma^{n-1} R(s_{t+n})+\gamma^{n} V(s_{t+n})$
Similarly in TD algorithm, we can reform state value and achieve new td_error with any step $m$ . If we set $n\rightarrow\inf$ , then it will degenerate to MC algorithm. To achieve a better performance, $n$ need to be modified. In order to reduce the effect of step size on the results, we can multiply $1-\gamma$ to $V (s)$ , then the expected value should be in the same order of magnitude with different hyper-parameter $\gamma$ .

Pros and Cons

pros
- more flexible than MC
- available for both online (SARSA) and offline (Q-Learning) situation
- TD has much better performance than other algorithms, so most SOTA algorithm are based on TD methods

是不是测评机针对我

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Breif Introduction for Reinforcement Learning (Background Info)

文章目录Breif Introduction for Reinforcement Learning (UPDATING)Markov ChainMarkov Decision ProcessValue function: Bellman EquationBasic SolutionsDynamic Programming (?)Policy IterationValue IterationPros andc ConsMonte CarloMonte Carlo Policy EvalutionMote Ca
复制链接

扫一扫

专栏目录