DQN

I will try all out to discuss the DQN algorithm in this article.

Basic Introduction

We have witnessed the power of deep learning about solving high-computation problems and the strengh of reinforcement learning at decision-making. Trying to combine those two methods is an obvious thing, which contributes to the appear of deep reinforcement learning. DQN, which was first put forward by OpenAI at the paper named , is a typical algorithm of deep reinforcement learning.

Meanwhile, the basic assumption behind reinforcement learning is that the agent can have a deeper understanding about the environment by interacting with it. And we can increase the ability to reach the goal after several interations.

MDPs

Markov Decision Process satisfies the markov property, which is that the nest state is only determined by current state and current action instead of the history.

这里写图片描述

MDPs consists of 5 components.

S: the finite states sets
A: the finite actions sets
P: the state transition matrix
R: the reward function
gama: the discount factor

Value Function

We introduce the value function to measure the long-term reward and judge the policy in order to make better choice.

Value function can be normalized to bellman equation, the standard bellman equation can be solve by literation.

Overall, there are three different ways to solve the value function:

  • Dynamic programming
  • Monte-Carlo method
  • Temporal differential method

Action-Value funciton

We aim at analysing the value of different action towards current state, which brings about the action-value function.

We can regularize the action-state function into these form:

这里写图片描述

Optimal Value function

We can normalize it into this form:

这里写图片描述

Then, after put it into action-value function, we can get :

这里写图片描述

Iteration based on Bellman Equation:

Widely used interation methods designed to solve bellman equation can be categorized to policy iteration and value iteration.

Policy Iteration

  1. Policy Evaluation: Estimate Vpi
  2. Policy Improvement: Generate pi’ >= pi

pi can be converged to optimal proved theoretically.

Value Iteration

Value iteration is based on Bellman Optimal equation and convert it into iteration form.

Diff. between policy iteration and value iteration

Policy iteration updates value by using the naive bellman equation. Meanwhile, the optimal value(vpi) is the optimal value at current policy, also called the estimation of a particular policy.

Value iteration update value by using Bellman Optimal Equation. Meanwhile, the optimal value(vpi) is the optimal value at current state.

Value iteration is more direct concerning getting an optimal value.

Q-Learning

The basic idea of Q learning is based on value iteration. We update Q value every value iteration, which means all the state and action. But due to the fact that we just get limited examples, Q learning puts forward a new way to update Q value:

这里写图片描述

We can demiss the influence of error like gradient descent method, which converges to the optimal Q value.

Exploration and Exploitation

  • Off-policy: Need a policy to generate action, which means the selected policy is not the optimal policy.
  • Model-free: Not consider about the model(the detail information about the env), just care about seen env and reward.

Generate action by policy

  • Exploration: Generate an action randomly. Benefitial to update Q value in order to get better policy.
  • Exploitation: Calculate an optimal action based on current Q value. Good to test whether an algorithm works while hard to update Q value.

ita-greedy policy

We can combine exploration and exploitation by setting a fixed threshold ita, which method is called ita-greedy policy.

DQN

Dimemsion Disaster

We store Q(s,a) in a table, which represents all the states and actions. But when we deal with image problems, the computation will be exponent that even can not be solved. So we need to reflect how to optimal the value function in another way.

Firstly, we introduced Value Function Approximation.

Value Function Approximation

In order to decrease dimension, we need to approximate the value function by another function. For example, we may use a linear function to approximate the value function, which just like this:

   Q(s,a) = f(s,a) = w1s + w2a + b

Thus we get Q(s,a) approximates to f(s,a:w)

Dimensionality Reduction

As is often the case, the dimension of actions are extinctively smaller than the dimension of states. In order to update value Q more effiently, we need to reduce the dimension of states. just like this form:

Q(s) approximates to f(s,w) where s is a vector: [Q(s,a1),...,Q(s,an)].

Training Q-Network by DQN

The typical deep neural network method is an optimal problem. The optimal target of neural network is the loss function, which is the bais between label and output. As the name suggests, the optimal of loss function is attempting to minimize the bias. We will need a lot of samples to train the parameters of neural network by policy gredent by backpropogation.

Following the basic idea of neural network, we regard the target value Q as a label. And then trying to approximate value Q to target value Q.

Thus, the traing loss function is:

这里写图片描述

Naive DQN

这里写图片描述

The basic idea of naive DQN is trying to train Q-Learning and SGD algorithm synchronously. Store all the sample and then randomly sampling, which is what we called experience replay. Learning by reflecting.

Trial for serveral times and then store all the data. Then do SGD by randomly sampling when getting a considerable number of datas.

Improved DQN

There are several different methods used to improve the efficiency of DQN. Double DQN, Prioritied Replay and Dueling Network are three instinctive methods.

Nature DQN

Nature DQN means the methods mentioned on (Human Level Control …) by DeepMind. Nature DQN is also based on experience replay. The difference between it and naive DQN is that nature DQN introduced a Target Q network. Like this:

这里写图片描述

In order to decrease the relevatively between target Q and current value Q, they designed a target Q network with updating delay, which means update the parameters after trained for a time.

The content of double DQN, prioritied replay and dueling network will be disscussed later. This part remains to be seen after reading those papers.

And also, I will give a base summary of policy gradient method and A3C series about deep reinforcement learning.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值