[note][reinforcement-learning] (0) using reinforcement to solve frozen-lake

using reinforcement to solve frozen-lake

we will use GYM library to solve frozen-lake problem.

GYM provides an easy way to experiment with agents in a grid game.

The frozen-lake environment consists of a 4 by 4 grid of blocks. There are a start block, a goal block, a safe frozen block and a dangerous hole.

The objective is to move agent from start block to goal block and not fall into the hole block.

Each time agent can choose go four directions: up, down, left and right by one block.

There maybe a wind which occasionally blows the agent to a block that it didn't choose.

As such, a perfect performance everytime is impossible.

But learning to avoid hole and reach the goal is sill doable.

The reward of each step is 0, except for entering the goal block which rewards 1.

Thus, we will need an algorithm that learns long-time expected rewards.

Q-learning is a table of values for every state(row) and action (column) possible in the environment.

In the case of frozen-lake problem, we have 16 states and 4 actions, giving us a 16 by 4 table of Q-values.

We start by initializing the table to be uniform (all zeros), and then as we observe the rewards we obtain for various actions, we update the table accordingly.

We update the table by Bellman Equation. It states that the expect long-term reward for a given action is equal to the immediate reward from the current action combined with the expected reward from the best future action taken at the following state. In this way, we reuse our own Q-table when estimating how to update our table for future actions! In equation form, the rule looks like this:

Q(state,action)=r+γmax(Q(state,action))

This says that the Q-value for a given state (s) and action (a) should represent the current reward (r) plus the maximum discounted (γ) future reward expected according to our own table for the next state (s’) we would end up in. The discount variable allows us to decide how important the possible future rewards are compared to the present reward. By updating in this way, the table slowly beings to obtain accurate measures of the expected future reward for a given action in a given state.

import gym
import numpy as np
env = gym.make('FrozenLake-v0') # gym 自带FrozenLake环境,所以字符串必须一致
[2016-08-28 18:52:39,722] Making new env: FrozenLake-v0
# grids 
Q = np.zeros((env.observation_space.n, env.action_space.n))
# learning parameters

lr = 0.85
y = 0.99

num_steps = 2000

rList = []
for ii in xrange(num_steps):
    j = 0
    # reset environment, get first observation
    s = env.reset()
    rAll = 0
    d = False
    while j < 99:
        j += 1
        # choose action greedily
        a  = np.argmax(Q[s,:] + np.random.randn(1, env.action_space.n)*1.0/ (ii+1))
        s1, r, d, _ = env.step(a)
        # update Q table
        Q[s, a] = Q[s,a] + lr * (r + y * np.max(Q[s1,:]) - Q[s,a])
        rAll += r
        s = s1
        if d == True:break
    rList.append(rAll)
print("Score over time: " +  str(sum(rList)/num_steps))
print("Final Q-Table Values")
print(Q)
Score over time: 0.35
Final Q-Table Values
[[  6.13590716e-03   6.14249158e-03   5.29860236e-01   1.56022837e-02]
 [  1.91698554e-04   2.95870014e-04   2.10001305e-04   3.33685432e-01]
 [  2.24586661e-03   2.56720079e-03   3.65946835e-03   2.80146626e-01]
 [  1.45247907e-03   1.82229915e-04   1.21586370e-03   2.62976533e-01]
 [  7.62102747e-01   9.49716764e-05   1.21329204e-03   1.12768247e-03]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 [  3.64798004e-01   4.95678153e-05   7.58364558e-04   8.00482449e-10]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 [  3.99071293e-03   1.31990383e-05   2.12142655e-03   5.37417612e-01]
 [  9.48052771e-04   8.17525930e-01   9.00640853e-04   1.09892593e-03]
 [  8.95540010e-01   4.88582407e-04   0.00000000e+00   0.00000000e+00]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 [  7.92931304e-04   0.00000000e+00   9.71283575e-01   2.33591697e-03]
 [  0.00000000e+00   0.00000000e+00   9.99524630e-01   0.00000000e+00]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]]

We instead need some way to take a description of our state, and produce Q-values for actions without a table:

that is where neural networks come in.

By acting as a function approximator, we can take any number of possible states that can be represented as a vector and learn to map them to Q-values.

In the case of the Frozen-lake example,

we will be using a one-layer network which takes the state encoded in a one-hot vector (1x16),and produces a vector of 4 Q-values,one for each action.

Such a simple network acts kind of like a glorified table, with the network weights serving as the old cells.

The key difference is that we can easily expand the tensorflow network with added layers, activation functions, and different input types, whereas all that is impossible with a regular table.

The method of updating is a little different as well.

Instead of directly updating our table, with a network we will be using backpropogation and a loss function.

Our loss function will be sum-of-squares loss, where the difference between the current predicted Q-values, and the “target” value is computed and the gradients passed through the network.

In this case, our Q-target for the chosen action is the equivalent to the Q-value computed in equation above.

Loss=(QtargetQ)2
import gym
import numpy as np
import random
import tensorflow as tf
import matplotlib.pyplot as plt
%matplotlib inline
env = gym.make('FrozenLake-v0')
[2016-08-28 22:40:31,086] Making new env: FrozenLake-v0
tf.reset_default_graph()
#These lines establish the feed-forward part of the network used to choose actions
inputs1 = tf.placeholder(shape=(1,16),dtype=tf.float32)
W = tf.Variable(tf.random_uniform((16,4),0,0.01))
Qout = tf.matmul(inputs1,W)
predict = tf.argmax(Qout,1)

#Below we obtain the loss by taking the sum of squares difference between the target and prediction Q values.
nextQ = tf.placeholder(shape=(1,4),dtype=tf.float32)
loss = tf.reduce_sum(tf.square(nextQ - Qout))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
updateModel = trainer.minimize(loss)
init = tf.initialize_all_variables()

# Set learning parameters
y = .99
e = 0.1
num_episodes = 2000
#create lists to contain total rewards and steps per episode
jList = []
rList = []
with tf.Session() as sess:
    sess.run(init)
    for i in range(num_episodes):
        #Reset environment and get first new observation
        s = env.reset()
        rAll = 0
        d = False
        j = 0
        #The Q-Network
        while j < 99:
            j+=1
            #Choose an action by greedily (with e chance of random action) from the Q-network
            a,allQ = sess.run([predict,Qout],feed_dict={inputs1:np.identity(16)[s:s+1]})
            if np.random.rand(1) < e:
                a[0] = env.action_space.sample()
            #Get new state and reward from environment
            s1,r,d,_ = env.step(a[0])
            #Obtain the Q' values by feeding the new state through our network
            Q1 = sess.run(Qout,feed_dict={inputs1:np.identity(16)[s1:s1+1]})
            #Obtain maxQ' and set our target value for chosen action.
            maxQ1 = np.max(Q1)
            targetQ = allQ
            targetQ[0,a[0]] = r + y*maxQ1
            #Train our network using target and predicted Q values
            _,W1 = sess.run([updateModel,W],feed_dict={inputs1:np.identity(16)[s:s+1],nextQ:targetQ})
            rAll += r
            s = s1
            if d == True:
                #Reduce chance of random action as we train the model.
                e = 1./((i/50) + 10)
                break
        jList.append(j)
        rList.append(rAll)
print "Percent of succesful episodes: " + str(sum(rList)/num_episodes) + "%"
Percent of succesful episodes: 0.442%
plt.plot(rList)
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值