[Chapter 5] Reinforcement Learning(3)Function Approximation

本文探讨了在连续和无限状态空间中,如何使用函数逼近来表示和学习Q值。介绍了线性函数逼近的例子,并阐述了SARSA和Q-learning两种更新参数的方法。随着深度学习的发展,将深度网络应用于强化学习形成了深度强化学习(DRL),如DQN算法,以处理大规模和复杂环境的问题。深度网络因其强大的表示能力,适于处理高维状态和复杂观测,成为DRL的关键所在。
摘要由CSDN通过智能技术生成

Function Approximation

While we are learning the Q-functions, but how to represent or record the Q-values? For discrete and finite state space and action space, we can use a big table with size of ∣ S ∣ × ∣ A ∣ |S| \times |A| S×A to represent the Q-values for all ( s , a ) (s,a) (s,a) pairs. However, if the state space or action space is very huge, or actually, usually they are continuous and infinite, a tabular method doesn’t work anymore.

We need function approximation to represent utility and Q-functions with some parameters θ {\theta} θ to be learnt. Also take the grid environment as our example, we can represent the state using a pair of coordiantes ( x , y ) (x,y) (x,y), then one simple function approximation can be like this:

U ^ θ ( x , y ) = θ 0 + θ 1 x + θ 2 y \hat{U}_{\theta} (x,y)={\theta}_0+{\theta}_1 x+{\theta}_2 y U^θ(x,y)=θ0+θ1x+θ2y

Of course, you can design more complex functions when you have a much larger state space.

In this case, our reinforcement learning agent turns to learn the parameters θ {\theta} θ to approximate the evaluation functions ( U ^ θ \hat{U}_{\theta} U^θ or Q ^ θ \hat{Q}_{\theta} Q^θ).

For Monte Carlo learning, we can collect a set of training samples (trails) with input and label, then this turns to be a supervised learning problem. With squared error and linear function, we get a standard linear regression problem.

For Temporal Difference learning, the agent aims to adjust the parameters to reduce the temporal difference (to reduce the TD error. To update the parameters using gradient decrease method:

  • For SARSA (on-policy method):

θ i ← θ i + α ( R ( s ) + γ Q ^ θ ( s ′ , a ′ ) − Q ^ θ ( s , a ) ) ∂ Q ^ θ ( s , a ) ∂ θ i {\theta}_i \leftarrow {\theta}_i+{\alpha}(R(s)+{\gamma}\hat{Q}_{\theta} (s^′,a′)−\hat{Q}_{\theta}(s,a)) \frac{\partial {\hat{Q}_{\theta} (s,a)}}{\partial{{\theta}_i}} θiθi+α(R(s)+γQ^θ(s,a)Q^θ(s,a))θiQ^θ(s,a)

  • For Q-learning (off-policy method):

θ i ← θ i + α ( R ( s ) + γ m a x a ′ Q ^ θ ( s ′ , a ′ ) − Q ^ θ ( s , a ) ) ∂ Q ^ θ ( s , a ) ∂ θ i {\theta}_i \leftarrow {\theta}_i+{\alpha}(R(s)+{\gamma} max_{a'}{\hat{Q}_{\theta} (s^′,a′)}−\hat{Q}_{\theta}(s,a)) \frac{\partial {\hat{Q}_{\theta} (s,a)}}{\partial{{\theta}_i}} θiθi+α(R(s)+γmaxaQ^θ(s,a)Q^θ(s,a))θiQ^θ(s,a)

Going Deep

One of the greatest advancement in reinforcement learning is to combine it with deep learning. As we have stated above, mostly, we cannot use a tabular method to represent the evaluation functions, we need approximation! I know what you want to say: you must have thought that deep network is a good function approximation. We have input for a network and output the Q-values or utilities, that’s it! So, using deep network in RL is deep reinforcement learning (DRL).

Why we need deep network?

  • Firstly, for the environment that has nearly infinite state space, the deep network can hold a large set of parameters θ {\theta} θ to be learnt and can map a large set of states to their expected Q-values.
  • Secondly, for some environment with complex observation, only deep network can solve them. For example, if the observation is an RGB image, we need convolutional neural network (CNN) in the first layers to read them; if the observation is a piece of audio, we need recurrent neural network (RNN) in the first layers.
  • Nowadays, designing and training a deep neural network can be done much easier based on the advanced hardware and software technology.

One of the DRL algorithms is Deep Q-learning Network (DQN), we have the pseudo code here but will not go into it:

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值