[note][reinforcement-learning] (0) using reinforcement to solve frozen-lake

转载 2016年08月29日 00:58:59

using reinforcement to solve frozen-lake

we will use GYM library to solve frozen-lake problem.

GYM provides an easy way to experiment with agents in a grid game.

The frozen-lake environment consists of a 4 by 4 grid of blocks. There are a start block, a goal block, a safe frozen block and a dangerous hole.

The objective is to move agent from start block to goal block and not fall into the hole block.

Each time agent can choose go four directions: up, down, left and right by one block.

There maybe a wind which occasionally blows the agent to a block that it didn't choose.

As such, a perfect performance everytime is impossible.

But learning to avoid hole and reach the goal is sill doable.

The reward of each step is 0, except for entering the goal block which rewards 1.

Thus, we will need an algorithm that learns long-time expected rewards.

Q-learning is a table of values for every state(row) and action (column) possible in the environment.

In the case of frozen-lake problem, we have 16 states and 4 actions, giving us a 16 by 4 table of Q-values.

We start by initializing the table to be uniform (all zeros), and then as we observe the rewards we obtain for various actions, we update the table accordingly.

We update the table by Bellman Equation. It states that the expect long-term reward for a given action is equal to the immediate reward from the current action combined with the expected reward from the best future action taken at the following state. In this way, we reuse our own Q-table when estimating how to update our table for future actions! In equation form, the rule looks like this:


This says that the Q-value for a given state (s) and action (a) should represent the current reward (r) plus the maximum discounted (γ) future reward expected according to our own table for the next state (s’) we would end up in. The discount variable allows us to decide how important the possible future rewards are compared to the present reward. By updating in this way, the table slowly beings to obtain accurate measures of the expected future reward for a given action in a given state.

import gym
import numpy as np
env = gym.make('FrozenLake-v0') # gym 自带FrozenLake环境,所以字符串必须一致
[2016-08-28 18:52:39,722] Making new env: FrozenLake-v0
# grids 
Q = np.zeros((env.observation_space.n, env.action_space.n))
# learning parameters

lr = 0.85
y = 0.99

num_steps = 2000

rList = []
for ii in xrange(num_steps):
    j = 0
    # reset environment, get first observation
    s = env.reset()
    rAll = 0
    d = False
    while j < 99:
        j += 1
        # choose action greedily
        a  = np.argmax(Q[s,:] + np.random.randn(1, env.action_space.n)*1.0/ (ii+1))
        s1, r, d, _ = env.step(a)
        # update Q table
        Q[s, a] = Q[s,a] + lr * (r + y * np.max(Q[s1,:]) - Q[s,a])
        rAll += r
        s = s1
        if d == True:break
print("Score over time: " +  str(sum(rList)/num_steps))
print("Final Q-Table Values")
Score over time: 0.35
Final Q-Table Values
[[  6.13590716e-03   6.14249158e-03   5.29860236e-01   1.56022837e-02]
 [  1.91698554e-04   2.95870014e-04   2.10001305e-04   3.33685432e-01]
 [  2.24586661e-03   2.56720079e-03   3.65946835e-03   2.80146626e-01]
 [  1.45247907e-03   1.82229915e-04   1.21586370e-03   2.62976533e-01]
 [  7.62102747e-01   9.49716764e-05   1.21329204e-03   1.12768247e-03]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 [  3.64798004e-01   4.95678153e-05   7.58364558e-04   8.00482449e-10]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 [  3.99071293e-03   1.31990383e-05   2.12142655e-03   5.37417612e-01]
 [  9.48052771e-04   8.17525930e-01   9.00640853e-04   1.09892593e-03]
 [  8.95540010e-01   4.88582407e-04   0.00000000e+00   0.00000000e+00]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 [  7.92931304e-04   0.00000000e+00   9.71283575e-01   2.33591697e-03]
 [  0.00000000e+00   0.00000000e+00   9.99524630e-01   0.00000000e+00]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]]

We instead need some way to take a description of our state, and produce Q-values for actions without a table:

that is where neural networks come in.

By acting as a function approximator, we can take any number of possible states that can be represented as a vector and learn to map them to Q-values.

In the case of the Frozen-lake example,

we will be using a one-layer network which takes the state encoded in a one-hot vector (1x16),and produces a vector of 4 Q-values,one for each action.

Such a simple network acts kind of like a glorified table, with the network weights serving as the old cells.

The key difference is that we can easily expand the tensorflow network with added layers, activation functions, and different input types, whereas all that is impossible with a regular table.

The method of updating is a little different as well.

Instead of directly updating our table, with a network we will be using backpropogation and a loss function.

Our loss function will be sum-of-squares loss, where the difference between the current predicted Q-values, and the “target” value is computed and the gradients passed through the network.

In this case, our Q-target for the chosen action is the equivalent to the Q-value computed in equation above.

import gym
import numpy as np
import random
import tensorflow as tf
import matplotlib.pyplot as plt
%matplotlib inline
env = gym.make('FrozenLake-v0')
[2016-08-28 22:40:31,086] Making new env: FrozenLake-v0
#These lines establish the feed-forward part of the network used to choose actions
inputs1 = tf.placeholder(shape=(1,16),dtype=tf.float32)
W = tf.Variable(tf.random_uniform((16,4),0,0.01))
Qout = tf.matmul(inputs1,W)
predict = tf.argmax(Qout,1)

#Below we obtain the loss by taking the sum of squares difference between the target and prediction Q values.
nextQ = tf.placeholder(shape=(1,4),dtype=tf.float32)
loss = tf.reduce_sum(tf.square(nextQ - Qout))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
updateModel = trainer.minimize(loss)
init = tf.initialize_all_variables()

# Set learning parameters
y = .99
e = 0.1
num_episodes = 2000
#create lists to contain total rewards and steps per episode
jList = []
rList = []
with tf.Session() as sess:
    for i in range(num_episodes):
        #Reset environment and get first new observation
        s = env.reset()
        rAll = 0
        d = False
        j = 0
        #The Q-Network
        while j < 99:
            #Choose an action by greedily (with e chance of random action) from the Q-network
            a,allQ = sess.run([predict,Qout],feed_dict={inputs1:np.identity(16)[s:s+1]})
            if np.random.rand(1) < e:
                a[0] = env.action_space.sample()
            #Get new state and reward from environment
            s1,r,d,_ = env.step(a[0])
            #Obtain the Q' values by feeding the new state through our network
            Q1 = sess.run(Qout,feed_dict={inputs1:np.identity(16)[s1:s1+1]})
            #Obtain maxQ' and set our target value for chosen action.
            maxQ1 = np.max(Q1)
            targetQ = allQ
            targetQ[0,a[0]] = r + y*maxQ1
            #Train our network using target and predicted Q values
            _,W1 = sess.run([updateModel,W],feed_dict={inputs1:np.identity(16)[s:s+1],nextQ:targetQ})
            rAll += r
            s = s1
            if d == True:
                #Reduce chance of random action as we train the model.
                e = 1./((i/50) + 10)
print "Percent of succesful episodes: " + str(sum(rList)/num_episodes) + "%"
Percent of succesful episodes: 0.442%




Reinforcement Learning in Continuous State and Action Spaces: A Brief Note

The problems of sequential decision making in continuous domains with delayed reward signals, the ma...

Andrew Ng, High Speed Obstacle Avoidance using Monocular Visionand Reinforcement Learning阅读笔记

High Speed Obstacle Avoidance using Monocular Vision and Reinforcement Learning阅读笔记 原作者:Jeff Mi...

End-to-end LSTM-based dialog control optimized with supervised and reinforcement learning


Introduction to Reinforcement Learning

  • 2017年11月15日 22:03
  • 10.47MB
  • 下载

deep reinforcement learning

  • 2017年10月25日 11:20
  • 34.53MB
  • 下载

[增强学习][Reinforcement Learning]学习笔记与回顾-2-马尔可夫决策过程MDP

Markov Decision Processes前言本文主要是视频学习的总结与回顾,想要了解更多内容请看视频或者学习专业课程。这一节主要是说马尔可夫决策过程-Markov Decision Proc...
您举报文章:[note][reinforcement-learning] (0) using reinforcement to solve frozen-lake