n05_Reinforcement Q-Learnining_MDP_getattr_Monte Carlo_Blackjack_Driving office_ε-greedy_SARSA_Bellm

     Reinforcement learning (RL) is the third major section of machine learning after supervised and unsupervised learning. These techniques have gained a lot of traction in recent years in the application of artificial intelligence. In reinforcement learning, sequential decisions are to be made rather than one shot decision making, which makes it difficult to train the models in a few cases. In this chapter, we would be covering various techniques used in reinforcement learning with practical examples to support with. Though covering all topics are beyond the scope of this book, but we did cover the most important fundamentals here for a reader to create enough enthusiasm on this subject. Topics discussed in this chapter are:

  • Markov decision process
  • Bellman equations
  • Dynamic programming
  • Monte Carlo methods
  • Temporal difference learning

Reinforcement learning basics

     Before we deep dive into the details of reinforcement learning, I would like to cover some of the basics necessary for understanding the various nuts[nʌts]螺母 and bolts[boʊlt] 螺栓,螺钉 of RL methodologies了解 RL 方法的各种具体细节. These basics appear across various sections of this chapter, which we will explain in detail whenever required:

  • Environment: This is any system that has states, and mechanisms to transition between states. For example, the environment for a robot is the landscape or facility it operates.
  • Agent: This is an automated system that interacts with the environment.
  • State: The state of the environment or system is the set of variables or features that fully describe the environment.
  • Goal or absorbing state or terminal state: This is the state that provides a higher discounted cumulative reward than any other state. A high cumulative reward prevents the best policy from being dependent on the initial state during training. Whenever an agent reaches its goal, we will finish one episode.
  • Action: This defines the transition between states. The agent is responsible for performing, or at least recommending an action. Upon execution of the action, the agent collects a reward (or punishment) from the environment.
  • Policy: This defines the action to be selected and executed for any state of the environment. In other words, policy is the agent's behavior; it is a map from state to action. Policies could be either deterministic or stochastic.  A stochastic policy has a probability distribution over actions that an agent can take at a given state:

    The optimal policy  is the policy that yields the highest return.
  • Best policy: This is the policy generated through training. It defines the model in Q-learning and is constantly updated with any new episode.
  • Rewards: This quantifies the positive or negative interaction of the agent with the environment. Rewards are usually immediate earnings made by the agent reaching each state.
  • Returns or value function: A value function (also called returns) is a prediction of future rewards of each state. These are used to evaluate the goodness/badness of the states, based on which, the agent will choose/act on for selecting the next best state:
    ==>This means that the return at time t is equal to the immediate reward r plus the discounted future-return at time t + 1. This is a very important property, which facilitates the computations of the return
    #######################################      If an agent decides to go right three times in a row and gets +10 reward after the 1st step, 0 after the 2nd step, and finally 50 after the 3rd step, then assuming we use a discount factor 𝛾 = 0.8, the first action will have a return of  
    import numpy as np
     
    def discount_rewards( immediate_reward_list, discount_rate ):
      discounted = np.array( immediate_reward_list )
      # from back to front that to avoid use a recursion
      for reward_on_each_step in range( len(immediate_reward_list)-2, -1, -1 ):
        discounted[reward_on_each_step] += discounted[reward_on_each_step+1] * discount_rate
      return discounted
    
    discount_rewards([10, 0, -50], discount_rate=0.8)

    def discounted_future_return( immediate_reward_list, discount_rate, index=0 ):
      immediate_reward_list = np.array(immediate_reward_list)
     
      if len(immediate_reward_list) ==1:
        return immediate_reward_list[0]
     
      if index == len(immediate_reward_list)-1 :
        return immediate_reward_list[index]
     
      # use a recursion
      return immediate_reward_list[index] + \
             discount_rate * discounted_future_return( immediate_reward_list, 
                                                       discount_rate, 
                                                       index+1
                                                     )# index+1: 1, 2
     
    discounted_future_return( [10, 0, -50], discount_rate=0.8, index=0 )
    =
    def discount_rewards( immediate_reward_list, discount_rate=0.8 ):
      discount_reward_list = []
      for index in range( len(immediate_reward_list) ): 
        discount_reward_list.append( discounted_future_return( immediate_reward_list, 
                                                               discount_rate, index 
                                                             )
                                   )
      return discount_reward_list
     
    discount_rewards([10, 0, -50], discount_rate=0.8)

    #######################################
     
    •  = 𝑟 is the immediate reward obtained after performing an action , at time t; the subsequent rewards are , and so forth.
    • 𝛾 is the discount factor in range [0, 1]. The parameter 𝛾 indicates how much the future rewards are "worth" at the current moment (time t) ( OR The discounting factor penalize the rewards in the future).
      • setting 𝛾 = 0 , we would imply that we do not care about future rewards. In this case, the return will be equal to the immediate reward, ignoring the subsequent rewards after t+1, and the agent will be short-sighted[ˌʃɔːrt ˈsaɪtɪd]目光短浅的;近视的. On the other hand,
      • if 𝛾 = 1, the return will be the unweighted sum of all subsequent rewards.
    • Each state is associated with a value function V(s) predicting the expected amount of future rewards we are able to receive in this state by acting the corresponding policy.
    • Now, based on the return , we define the value function of state s as the expected return (the average return over all possible episodes) after following policy 𝜋 (if we are in this state at time t, ) :
      state-value function:==>
  • Episode: This defines the number of steps necessary to reach the goal state from an initial state. Episodes are also known as trials.

  • Horizon: This is the number of future steps or actions used in the maximization of the reward. The horizon can be infinite, in which case, the future rewards are discounted in order for the value of the policy to converge.
  • Exploration[ˌekspləˈreɪʃn]  versus Exploitation[ˌeksplɔɪˈteɪʃn]探索与开发 : RL is a type of trial and error learning. The goal is to find the best policy; and at the same time, remain alert to explore some unknown policies. A classic example would be treasure hunting: if we just go to the locations greedily (exploitation), we fail to look for other places where hidden treasure might also exist (exploration). By exploring the unknown states, and by taking chances, even when the immediate rewards are low and without losing the maximum rewards, we might achieve greater goals. In other words, we are escaping the local optimum in order to achieve a global optimum (which is exploration), rather than just a short-term focus purely on the immediate rewards (which is exploitation). Here are a couple of examples to explain the difference:
    • Restaurant selection: By exploring unknown restaurants once in a while, we might find a much better one than our regular favorite restaurant:
      • Exploitation: Going to your favorite restaurant
      • Exploration: Trying a new restaurant
    • Oil drilling example: By exploring new untapped[ˌʌnˈtæpt]未开发的 locations, we may get newer insights that are more beneficial that just exploring the same place:
      • Exploitation: Drill for oil at best known location
      • Exploration: Drill at a new location
  • State-Value versus State-Action Function: In action-value function, Q(OR ) represents the expected return (cumulative discounted reward) an agent is to receive when taking Action in State S, and behaving according to a certain policy π(a|s) afterwards (which is the probability of taking an action in a given state).
    • In state-value function, the value is the expected return an agent is to receive from being in state s behaving under a policy π(a|s). More specifically, the state-value is an expectation over the action-values under a policy:
      OR 
      ==>by looking one step ahead to find the action that gives the maximum value
  • On-policy(SARSA) versus off-policy TD control(Q-learning):
    An off-policy learner learns the value of the optimal policy independently of the agent's actions. Q-learning is an off-policy learner.
    An on-policy learner learns the value of the policy being carried out by the agent, including the exploration steps.
  • Prediction and control problems:
    Prediction talks about how well I do, based on the given policy: meaning, if someone has given me a policy and I implement it, how much reward I will get for that.
    Whereas, in control, the problem is to find the best policy so that I can maximize the reward.
  • Prediction: Evaluation of the values of states for a given policy.
    • For the uniform random policy, what is the value function for all states?
  • Control: Optimize the future by finding the best policy.
         What is the optimal value function over all possible policies, and what is the optimal policy?
         Usually, in reinforcement learning, we need to solve the prediction problem first, in order to solve the control problem after, as we need to figure out all the policies to figure out the best or optimal one.
  • RL Agent Taxonomy[tækˈsɑːnəmi] 分类法: An RL agent includes one or more of the following components:
    • Policy: Agent's behavior function (map from state to action); Policies can be either deterministic or stochastic
    • Value function: How good is each state (or) prediction of expected future reward for each state
    • Model: Agent's representation of the environment. A model predicts what the environment will do next:
      • Transitions: p predicts the next state (that is, dynamics):

        e.g.  transitions P[s][a] == [(probability, nextstate, reward, done), ...] ###,
      • Rewards: R predicts the next (immediate) reward

     Let us explain the various categories possible in RL agent taxonomy, based on combinations of policy and value, and model individual components with the following maze example. In the following maze, you have both the start and the goal; the agent needs to reach the goal as quickly as possible, taking a path to gain the total maximum reward and the minimum total negative reward. Majorly five categorical way this problem can be solved:

  • Value based
  • Policy based
  • Actor critic
  • Model free
  • Model based

 

Category 1 - value based

     Value function does look like the right-hand side of the image (the sum of discounted future rewards) where every state has some value. Let's say, the state one step away from the goal has a value of -1; and two steps away from the goal has a value of -2. In a similar way, the starting point has a value of -16. If the agent gets stuck in the wrong place, the value could be as much as -24. In fact, the agent does move across the grid based on the best possible values to reach its goal. For example, the agent is at a state with a value of -15. Here, it can choose to move either north or south, so it chooses to move north due to the high reward, which is -14 rather, than moving south, which has a value of -16. In this way, the agent chooses its path across the grid until it reaches the goal.

  • Value Function: Only values are defined at all states
  • No Policy (Implicit): No exclusive policy is present; policies are chosen based on the values at each state

Category 2 - policy based

     The arrows in the following image represent what an agent chooses as the direction of the next move while in any of these states. For example, the agent first moves east and then north, following all the arrows until the goal has been reached. This is also known as mapping from states to actions. Once we have this mapping, an agent just needs to read it and behave accordingly.

  • Policy: Policies or arrows that get adjusted to reach the maximum possible future rewards. As the name suggests, only policies are stored and optimized to maximize rewards.
  • No value function: No values exist for the states.

Category 3 - actor-critic

     In Actor-Critic, we have both policy and value functions (or a combination of value-based and policy-based). This method is the best of both worlds:

  • Policy
  • Value Function

Category 4 - model-free

     In RL, a fundamental distinction is if it is model-based or model-free. In model-free, we do not explicitly model the environment, or we do not know the entire dynamics of a complete environment. Instead, we just go directly to the policy or value function to gain the experience and figure out how the policy affects the reward:

  • Policy and/or value function
    • No model

Category 5 - model-based

In model-based RL, we first build the entire dynamics of the environment:

  • Policy and/or value function
  • Model

     After going through all the above categories, the following Venn diagram shows the entire landscape of the taxonomy of an RL agent at one single place. If you pick up any paper related to reinforcement learning, those methods can fit in within any section of this landscape. 

Fundamental categories in sequential decision making

There are two fundamental types of problems in sequential decision making:

  • Reinforcement learning (for example, autonomous helicopter, and so on):
    • Environment is initially unknown
    • Agent interacts with the environment and obtain policies, rewards, values from the environment
    • Agent improves its policy
  • Planning (for example, chess, Atari games, and so on):
    • Model of environment or complete dynamics of environment is known
    • Agent performs computation with its model (without any external interaction)
    • Agent improves its policy
    • These are the type of problems also known as reasoning, searching, introspection[ˌɪntrəˈspekʃn]自我检查,内省, and so on 

     Though the preceding two categories can be linked together as per the given problem, but this is basically a broad view of the two types of setups.

Markov decision processes and Bellman equations 

     Markov decision process (MDP) formally describes an environment for reinforcement learning. Where:

  • Environment is fully observable
  • Current state completely characterizes the process (which means the future state is entirely dependent on the current state rather than historic states or values)
  • Almost all RL problems can be formalized as MDPs (for example, optimal control primarily deals with continuous MDPs)

     Central idea of MDP: MDP works on the simple Markovian property of a state; for example, is entirely dependent on latest state rather than any historic dependencies. In the following equation, the current state captures all the relevant information from the history, which means the current state is a sufficient statistic of the future:
 ==> 
     The probability distribution for  and = 𝑟 can be written as a conditional probability over the preceding state () and taken action () at time step t
(
     The transition function P records the probability of transitioning from state s to s’ after taking actionwhile obtaining reward r(current reward after toke action). We use P as a symbol of “probability”.
).predefined finite sets denoted by 𝑠 ∈ 𝑟 ∈ , and 𝑎 ∈, respectively.
    An intuitive sense of this property can be explained with the autonomous helicopter example: the next step is for the helicopter to move either to the right, left, to pitch, or to roll, and so on, entirely dependent on the current position of the helicopter, rather than where it was five minutes before.

    Modeling of MDP: RL problems models the world using MDP formulation as a five tuple
(S, A, , γ, R)

  • S - Set of States (set of possible orientations of the helicopter)
  • A - Set of Actions (set of all possible positions that can pull the control stick)
  • - State transition distributions (or state transition probability distributions) provide transitions from one state to another and the respective probabilities needed for the Markov process:


    e.g.
    ​​
  • γ - Discount factor: 
  • R - Reward function (maps set of states to real numbers, either positive or negative):

 Returns are calculated by discounting the future rewards until terminal state is reached.

"""A Markov Decision Process, defined by an init_pos_posial state, transition model,
and reward function. """

class MDP:

    def __init__(self, init_pos, actlist, terminals, transitions={}, states=None, gamma=0.99):
        if not (0 < gamma <= 1):
            raise ValueError("MDP should have 0 < gamma <= 1 values")

        if states:
            self.states = states
        else:
            self.states = set()
        self.init_pos = init_pos
        self.actlist = actlist         # action list
        self.terminals = terminals
        self.transitions = transitions
        self.gamma = gamma             # discount factor
        self.reward = {}

    """Returns a numeric reward for the state."""
    def R(self, state):
        return self.reward[state]

    """Transition model. From a state and an action, return a list of (probability, result-state) pairs"""
    def T(self, state, action):
        if(self.transitions == {}):
            raise ValueError("Transition model is missing")
        else:
            return self.transitions[state][action]

    """Set of actions that can be performed for a particular state"""
    def actions(self, state):
        if state in self.terminals:
            return [None]
        else:
            return self.actlist

     Bellman Equations for MDP: Bellman equations are utilized for the mathematical formulation of MDP, which will be solved to obtain the optimal policies of the environment. Bellman equations are also known as dynamic programming equations and are a necessary condition for the optimality associated with the mathematical optimization method that is known as dynamic programming. Bellman equations are linear equations which can be solvable for the entire environment. However, the time complexity for solving these equations is O (), which becomes computationally very expensive when the number of states in an environment is large; and sometimes, it is not feasible to explore all the states because the environment itself is very large. In those scenarios, we need to look at other ways of solving problems.

In Bellman equations, value function can be decomposed into two parts:

  • Immediate reward , from the successor[səkˈsesər]后继 state you will end up with
  • Discounted value of successor states you will get from that timestep onwards:
    value function

     Grid world example of MDP: Robot navigation tasks live in the following type of grid world. An obstacle is shown the cell (2,2), through which the robot can't navigate. We would like the robot to move to the upper-right cell (4,3) and when it reaches that position, the robot will get a reward of +1. The robot should avoid the cell (4,2), as, if it moved into that cell, it would receive a -1 reward.

Robot can be in any of the following positions:

  • 11 States - (except cell (2,2), in which we have got an obstacle for the robot)
  • A = {N-north, S-south, E-east, W-west}

     In the real world, robot movements are noisy, and a robot may not be able to move exactly where it has been asked to. Examples might include that some of its wheels slipped, its parts were loosely connected, it had incorrect actuators[ˈæktʃueɪtər]致动器, and so on. When asked to move by 1 meter, it may not be able to move exactly 1 meter; instead, it may move 90-105 centimeters, and so on. 

     In a simplified grid world, stochastic dynamics of a robot can be modeled as follows. If we command the robot to go north, there is a 10% chance that the robot could drag towards the left and a 10 % chance that it could drag towards the right. Only 80 percent of the time it may actually go north. When a robot bounces off the wall (including obstacles) and just stays at the same position, nothing happens:
 
     Every state in this grid world example is represented by (x, y) coordinates. Let's say it is at state (column_idx=3, row_idx=1) and we asked the robot to move north, then the state transition probability matrices are as follows:

The probability of staying in the same position is 0 for the robot.

As we know, that sum of all the state transition probabilities sums up to 1:

Reward function:
     For all the other states, there are small negative reward values, which means it charges the robot for battery or fuel consumption when running around the grid, which creates solutions that do not waste moves or time while reaching the goal of reward +1, which encourages the robot to reach the goal as quickly as possible with as little fuel used as possible.

     The world ends when the robot reaches either +1 or -1 states. No more rewards are possible
after reaching any of these states; these can be called absorbing states. These are zero-cost
absorbing states and the robot stays there forever
.

MDP working model:

  • At state
  • Choose
  • Get to
  • Choose
  • Get to
  • and so on ....
# Import the random package for generating moves in any of the N, E, S, W directions: 
import random,operator

# To add two vectors at component level, the following code has been utilized for:
def vector_add(a, b):
    return tuple( map(operator.add, a, b) )

# Orientations provide what the increment value would be, which needs to be added to the
# existing position of the agent; orientations can be applied on the x-axis or y-axis:    
# (1, 0):'>',  (0, 1):'^',  (-1, 0):'<', (0, -1):'v', None: '.' 
orientations = [(1,0), (0, 1), (-1, 0), (0, -1)] # (1,0) :(col_move, row_move)

# The following function is used to turn the agent in the right direction, as we know at # every command the agent moves in that direction about 80% of the time, whilst 10% of
# the time it would move right, and 10% it would move left.:
def turn_right(orientation):
    return orientations[orientations.index(orientation)-1] # index( orientation=(0,1) )=1 ==> 0 ==> (1,0):'>'

def turn_left(orientation):
    return orientations[(orientations.index(orientation)+1) % len(orientations)] # (1+1)%4 = 2 ==> (-1,0):'<'

def isnumber(x):
    return hasattr(x, '__int__')


# sequential_decision_environment = GridMDP([[-0.02, -0.02, -0.02, +1],
#                                            [-0.02, None, -0.02, -1],
#                                            [-0.02, -0.02, -0.02, -0.02]],
#                                           terminals=[(3, 2), (3, 1)])

        
"""A two-dimensional grid MDP"""
# Class GridMDP is created for modeling a 2D grid world with grid values at each state,
# terminal positions, initial position, and gamma value (discount):
class GridMDP(MDP):

    def __init__(self, grid, terminals, init_pos=(0, 0), gamma=0.99):
        
        """ because we want row 0 on bottom, not on top """
        grid.reverse() # flip axis=0
#         [[-0.02, -0.02, -0.02, -0.02],
#          [-0.02, None, -0.02, -1],
#          [-0.02, -0.02, -0.02, 1]]
        
        MDP.__init__(self, init_pos, actlist=orientations,
                     terminals=terminals, gamma=gamma)
        self.grid = grid           # reward_table for each position(or each state)
        self.rows = len(grid)
        self.cols = len(grid[0])
        for c in range(self.cols):
            for r in range(self.rows):
                self.reward[c, r] = grid[r][c] # e.g. self.reward[c=3][r=2]=1 <== grid[r=2][c=3]=1
                if grid[r][c] is not None:
                    self.states.add((c, r)) # self.states = set() , append 

    # State transitions provide randomly 80% toward the desired direction and 
    # 10% for left and right. This is to model the randomness in 
    # a robot which might slip on the floor, and so on:   
    def T(self, state, action): # State transition probabilities = P_s,a_(s') ==> return [ (P, new_state),...]
        if action is None:
            return [(0.0, state)]
        else:                                                  #    (3,1)+(0,1)==>(3,2)  # (0,1):(col_move, row_move)    
            return [(0.8, self.go(state, action)),             # P_s=(3,1),a_( s'=(3,2) ) # action(a: goes up) taken
                    (0.1, self.go(state, turn_right(action))), # P_s=(3,1),a_( s'=(4,1) ) # action(a: goes right) taken
                    (0.1, self.go(state, turn_left(action)))]  # P_s=(3,1),a_( s'=(2,1) ) # action(a: goes left) taken

    """Return the state that results from going in this direction."""
    # subject to where that state is in the list of valid states. 
    # If the next state is not in the list, like hitting the wall, 
    # then the agent should remain in the same state:
    def go(self, state, direction):
        state1 = vector_add(state, direction) # get new state(state1)
        return state1 if state1 in self.states else state # if state1 not in states, then state1 is an obstacle or wall

    """fill each state(x, y) with action_char ==> [[..., action_char, ...]...] state transition probability grid."""
    def to_grid(self, mapping):
        return list( reversed([ [ mapping.get( (x, y), None )# get(key) ==> value(here is direction char)
                                  for x in range(self.cols)
                                ]
                                for y in range(self.rows)
                             ])# reversed since previous grid did a reverse operation
                   )
                
    """Convert a mapping from state(x, y) to action into a [[..., action_char, ...]...] state transition probability grid."""
    def to_arrows(self, policy):
        chars = { (1, 0): '>', 
                  (0, 1): '^', 
                  (-1, 0): '<',
                  (0, -1): 'v',
                  None: '.'
                }
        return self.to_grid({ s: chars[a] 
                              for (s, a) in policy.items() # e.g. (1, 2): (1, 0) == current state: next action taken
                            })

 After a while, it takes all the rewards and sums up to obtain:

     Discount factor models an economic application, in which one dollar earned today is more valuable than one dollar earned tomorrow.

The robot needs to choose actions over time (a0, a1, a2, ....) to maximize the expected payoff:

     Over the period, a reinforcement learning algorithm learns a policy which is a mapping of actions for each state, which means it is a recommended action, which the robot needs to take based on the state in which it exists:

     Optimal Policy for Grid World: Policy maps from states to actions, which means that, if you are in a particular state, you need to take this particular action. The following policy is the optimal policy which maximizes the expected value of the total payoff or sum of discounted rewards. Policy always looks into the current state rather than previous states, which is the Markovian property:

     One tricky thing to look at is at the position (3,1): optimal policy shows to go left (West) rather than going (north), which may have a fewer number of states; however, we have an even riskier state that we may step into. So, going left may take longer, but it safely arrives at the destination without getting into negative traps. These types of things can be obtained from computing, which do not look obvious to humans, but a computer is very good at coming up with these policies:
Define: , V*, π*
      = For any given policy π, value function is : S -> R such that (S) is expected total
payoff starting in state S, and execute π

     Random policy for grid world: The following is an example of a random policy and its value functions. This policy is a rather bad policy with negative values. For any policy, we can write down the value function for that particular policy:


     In simple English, Bellman equations illustrate that the value of the current state is equal to the immediate reward and discount factor applied to the expected total payoff of new states (S') multiplied by their probability to take action (policy) into those states.

     Bellman equations are used to solve value functions for a policy in close form, given fixed policy, how to solve the value function equations.

     Bellman equations impose a set of linear constraints on value functions. It turns out that we solve the value function at any state S by solving a set of linear equations.

Example of Bellman equations with a grid world problem:

     The chosen policy for cell (3,1) is to move north. However, we have stochasticity in the system that about 80 percent of the time it moves in the said direction, and 20% of the time it drifts sideways, either left (10 percent) or right (10 percent). 

<==

     Similar equations can be written for all the 11 states of the MDPs within the grid. We can
obtain the following metrics, from which we will solve all the unknown values, using a
system of linear equation methods:

  • 11 equations
  • 11 unknown value function variables
  • 11 constraints

     This is solving an n variables with n equations problem, for which we can find the exact form of a solution using a system of equations easily to get an exact solution for V (π) for the entire closed form of the grid, which consists of all the states.

Dynamic programming

     Dynamic programming is a sequential way of solving complex problems by breaking them down into sub-problems and solving each of them. Once it solves the sub-problems, then it puts those subproblem solutions together to solve the original complex problem. In the reinforcement learning world, Dynamic Programming is a solution methodology to compute optimal policies given a perfect model of the environment as a Markov Decision Process (MDP)

     Dynamic programming holds good for problems which have the following two properties. MDPs, in fact, satisfy both properties, which makes DP a good fit for solving them by solving Bellman Equations:

  • Optimal substructure
    • Principle of optimality applies
    • Optimal solution can be decomposed into sub-problems
  • Overlapping sub-problems
    • Sub-problems recur many times
    • Solutions can be cached and reused
  • MDP satisfies both the properties - luckily!
    • Bellman equations have recursive decomposition of state-values
    • Value function stores and reuses solutions 

     Though, classical DP algorithms are of limited utility in reinforcement learning, both because of their assumptions of a perfect model and high computational expense. However, it is still important, as they provide an essential foundation for understanding all the methods in the RL domain.

Algorithms to compute optimal policy using dynamic programming

     Standard algorithms to compute optimal policies for MDP utilizing Dynamic Programming are as follows, and we will be covering both in detail in later sections of this chapter:

  • Value Iteration algorithm: An iterative algorithm, in which state values are iterated until it reaches optimal values; and, subsequently, optimum values are utilized to determine the optimal policy
  • Policy Iteration algorithm: An iterative algorithm, in which policy evaluation and policy improvements are utilized alternatively to reach optimal policy

Value(State) Iteration algorithm 

     Value Iteration algorithm: Value Iteration algorithms are easy to compute for the very reason of applying iteratively on only state values. First, we will compute the optimal value function V*, then plug those values into the optimal policy equation to determine the optimal policy. Just to give the size of the problem, for 11 possible states, each state can have four policies (N-north, S-south, E-east, W-west), which gives an overall 11 possible policies. The value iteration algorithm consists of the following steps: 

  • 1. Initialize V(S) = 0 for all states S
  • 2. For every S, update:
    <==*p_s,a(s')<== and
  • 3. By repeatedly computing step 2, we will eventually converge to optimal values for all the states:
"""Solving an MDP by value iteration and returns the optimum state values """
def value_iteration(mdp, epsilon=0.001):
    STSN = {s: 0 for s in mdp.states} # Initialize V(S) = 0 for all states S
    
    R, T, gamma = mdp.R, mdp.T, mdp.gamma # gamma=0.99
    while True:
        STS = STSN.copy()
        delta = 0
        
        for s in mdp.states:
            STSN[s] = R(s) + gamma * max( [ sum([ p * STS[s1]
                                                  for (p, s1) in T(s, a)
                                                ])
                                            for a in mdp.actions(s)
                                          ])
            
            delta = max(delta, abs(STSN[s] - STS[s]))
        if delta < epsilon * (1 - gamma) / gamma:
            return STS
""" A 4x3 grid environment that presents the agent with a sequential decision problem"""
sequential_decision_environment = GridMDP([[-0.02, -0.02, -0.02, +1],
                                           [-0.02, None, -0.02, -1],
                                           [-0.02, -0.02, -0.02, -0.02]],
                                          terminals=[(3, 2), (3, 1)])

state_value=value_iteration(sequential_decision_environment, .01)
state_value

 ==>V(s)

There are two ways of updating the values in step 2 of the algorithm

  • Synchronous update - By performing synchronous update (or Bellman backup operator) we will perform RHS computing and substitute LHS of the equation represented as follows:
  • Asynchronous update - Update the values of the states one at a time rather than updating all the states at the same time, in which states will be updated in a fixed order (update state number 1, followed by 2, and so on.). During convergence, asynchronous updates are a little faster than synchronous updates. 

Illustration of value iteration on grid world example: The application of the Value iteration on a grid world is explained in the following image, and the complete code for solving a real problem is provided at the end of this section. After applying the previous value iteration algorithm on MDP(Markov decision process) using Bellman equations, we've obtained the following optimal values V* for all the states (Gamma value chosen as 0.99)

When we plug these values in to our policy equation, we obtain the following policy grid:
 ==>optimal action for current state

     Here, at position (3,1) we would like to prove mathematically why an optimal policy suggests taking going left (west) rather than moving up (north):
<==<==
     Due to the wall, whenever the robot tries to move towards South (downwards side), it will remain in the same place(or remain in the current state), hence we assigned the value of the current position 0.71 for a probability of 0.1.

Similarly, for north, we calculated the total payoff as follows:
<==<==
     So, it would be optimal to move towards the west rather than north, and therefore the optimal policy is chosen to do so.

# The following argmax function calculated the maximum state among the given states,
# based on the value for each state:
def argmax(seq, fn):
    best = seq[0];
    best_score = fn(best) # pass the best action to fn ( expected_utility(a, s, STS, mdp) )
    for x in seq:
        x_score = fn(x)
        if x_score > best_score:
            best, best_score = x, x_score
    return best

"""Given an MDP and a utility function STS, determine the best policy,
as a mapping from state to action """
def best_policy(mdp, STS):######################
    pi = {}
    # I just use state{(2,0)} for test, please replace it with mdp.states later
    for s in {(2,0)}:# mdp.states: # e.g. s(col_idx, row_idx): ==> (0, 1): 0.8196979051091251
        # print(mdp.actions((2,0))) # [(1, 0), (0, 1), (-1, 0), (0, -1)] : ['>','^','<','V']
        # print(STS[(2,0)]) # 0.7086718796622142 ~~0.71
        pi[s] = argmax(mdp.actions(s), lambda a: expected_utility(a, s, STS, mdp))
    return pi

"""The expected utility of doing a in state s, according to the MDP and STS."""
def expected_utility(a, s, STS, mdp):
    print( s, round( sum([ round(p,4) * STS[s1] for (p, s1) in mdp.T(s, a) ]),
                     2
                   )
         )
    return sum([p * STS[s1] for (p, s1) in mdp.T(s, a)])

def print_table(table, header=None, sep='   ', numfmt='{}'):
    justs = ['rjust' if isnumber(x) else 'ljust' for x in table[0]]
    if header:
        table.insert(0, header)
    table = [[ numfmt.format(x) if isnumber(x) else x 
                                for x in row
             ]
             for row in table]

    # each column max width or maximum of the number of chars in each column
    sizes = list( map( lambda seq: max( map(len, seq) ),
                       list( zip( *[ map(str, row) # convert 'None' to str
                                     for row in table # each row is a list element
                                   ]# * convert [map object, map object, map object] to map map map
                                )# zip column by column 
                           )# zip(* matrix.shape(3,4) ) : 3x4 ==> 4x3 # do 2D transpose operation 
                     )
                )# https://blog.csdn.net/Linli522362242/article/details/118349075
    # sizes : [1, 4, 1, 1] # for width for each column
        
    for row in table:
        print( sep.join( getattr(
                                  str(x_in_row), 
                                  j # call ljust or rjust built-in function and width =size, fillchar = blank
                                )(size)
                         for (j, size, x_in_row) in zip(justs, sizes, row)
                       )
             )

""" A 4x3 grid environment that presents the agent with a sequential decision problem"""
sequential_decision_environment = GridMDP([[-0.02, -0.02, -0.02, +1],
                                           [-0.02, None, -0.02, -1],
                                           [-0.02, -0.02, -0.02, -0.02]],
                                          terminals=[(3, 2), (3, 1)])

# perform a value iteration on the given sequential decision making environment
value_iter = best_policy( sequential_decision_environment, 
                          value_iteration(sequential_decision_environment, .01)
                        )
print("\n Optimal Policy based on Value Iteration\n")
print_table(sequential_decision_environment.to_arrows(value_iter))
print(value_iter)

 Note: each state includes 3 actions

==>optimal action based on current state==>move to left(west)
Note: mdp.T(s, a) ==> go function: return return state1 if state1 in self.states else state # if state1 not in states, then state1 is an obstacle or wall ==> STS[s1] ==> STS[state] ==> V(s) ==> current state value = 0.71 or 0.7086718796622142

# The following argmax function calculated the maximum state among the given states,
# based on the value for each state:
def argmax(seq, fn):
    best = seq[0];
    best_score = fn(best)
    for x in seq:
        x_score = fn(x) # pass the best action to fn ( expected_utility(a, s, STS, mdp) )
        if x_score > best_score:
            best, best_score = x, x_score
    return best

"""Given an MDP and a utility function STS, determine the best policy,
as a mapping from state to action """
def best_policy(mdp, STS):
    pi = {}
    # I just use state{(2,0)} for test, please replace it with mdp.states later
    for s in mdp.states: # e.g. s(col_idx, row_idx): ==> (0, 1): 0.8196979051091251
        # print(mdp.actions((2,0))) # [(1, 0), (0, 1), (-1, 0), (0, -1)] : ['>','^','<','V']
        pi[s] = argmax(mdp.actions(s), lambda a: expected_utility(a, s, STS, mdp))
    return pi

"""The expected utility of doing a in state s, according to the MDP and STS."""
def expected_utility(a, s, STS, mdp):
    return sum([p * STS[s1] for (p, s1) in mdp.T(s, a)])


""" A 4x3 grid environment that presents the agent with a sequential decision problem"""
sequential_decision_environment = GridMDP([[-0.02, -0.02, -0.02, +1],
                                           [-0.02, None, -0.02, -1],
                                           [-0.02, -0.02, -0.02, -0.02]],
                                          terminals=[(3, 2), (3, 1)])

# Value Iteration
value_iter = best_policy( sequential_decision_environment, 
                          value_iteration(sequential_decision_environment, .01)
                        )
print("\n Optimal Policy based on Value Iteration\n")
print_table(sequential_decision_environment.to_arrows(value_iter))

Value(state) for each state

Policy Iteration Algorithm

Policy Iteration Algorithm: Policy iterations are another way of obtaining optimal policies for MDP in which policy evaluation and policy improvement algorithms are applied iteratively until the solution converges to the optimal policy. Policy Iteration Algorithm consists of the following steps:  

1. Initialize random policy π

    pi = {s: random.choice(mdp.actions(s)) for s in mdp.states}

2. Repeatedly do the following until convergence happens

  • Solve Bellman equations for the current policy π for obtaining for using system of linear equations:

    use the current policy π(S:Action) to update current Value(State) grid
    """Return an updated utility mapping U from each state in the MDP to its
    utility, using an approximation (modified policy iteration)"""
    def policy_evaluation(pi, STS, mdp, k=20):
        R, T, gamma = mdp.R, mdp.T, mdp.gamma
        for i in range(k):
            for s in mdp.states: # pi[s] ==> just 1 desired action, but T(s, pi[s]) return an available action list 
                STS[s] = R(s) + gamma * sum([p * STS[s1] for (p, s1) in T(s, pi[s])])   
        return STS
  • Update the policy as per the new value function to improve the policy by pretending the new value is an optimal value using argmax formulae:<==*P_s,a(s') and 
    """The expected utility of doing a in state s, according to the MDP and STS."""
    def expected_utility(a, s, STS, mdp):
        return sum([p * STS[s1] for (p, s1) in mdp.T(s, a)])
    
    """Solve an MDP by policy iteration"""
    def policy_iteration(mdp):
        STS = {s: 0 for s in mdp.states}
        #  Initialize random policy π
        pi = {s: random.choice(mdp.actions(s)) for s in mdp.states}
        while True:
            STS = policy_evaluation(pi, STS, mdp)
            unchanged = True
            for s in mdp.states:
                # mdp.actions(s) return an available action list
                a = argmax(mdp.actions(s),lambda a: expected_utility(a, s, STS, mdp))
                if a != pi[s]:
                    pi[s] = a
                    unchanged = False
            if unchanged:#########
                return pi

 3. By repeating these steps, both value and policy will converge to optimal values:
==>

#Policy Iteration
policy_iter = policy_iteration(sequential_decision_environment)
print("\n Optimal Policy based on Policy Iteration & Evaluation\n")
print_table(sequential_decision_environment.to_arrows(policy_iter))

     Policy iterations tend to do well with smaller problems(Note: k loop, and the policy continue iterate until policy π unchanged(for each state action is optimal or unchanged)). If an MDP has an enormous number of states, policy iterations will be computationally expensive. As a result, large MDPs tend to use value iterations rather than policy iterations.

We need to estimate the probabilities from the data by using the following simple formulae: 

     If for some states no data is available, which leads to 0/0 problem, we can take a default probability from uniform distributions.

Grid world example using value and policy iteration algorithms with basic Python

     The classic grid world example has been used to illustrate value and policy iterations with Dynamic Programming to solve MDP's Bellman equations. In the following grid, the agent will start at the south-west corner of the grid in (1,1) position and the goal is to move towards the north-east corner, to position (4,3). Once it reaches the goal, the agent will get a reward of +1. During the journey, it should avoid the danger zone (4,2), because this will give out a negative penalty of reward -1. The agent cannot get into the position where the obstacle (2,2) is present from any direction. Goal and danger zones are the terminal states, which means the agent continues to move around until it reaches one of these two states. The reward for all the other states would be -0.02. Here, the task is to determine the optimal policy (direction to move) for the agent at every state (11 states altogether), so that the agent's total reward is the maximum, or so that the agent can reach the goal as quickly as possible. The agent can move in 4 directions: north, south, east and west.

     The complete code was written in the Python programming language with class implementation. For further reading, please refer to object oriented programming in Python to understand class, objects, constructors, and so on.

import random,operator

def vector_add(a, b):
    return tuple( map(operator.add, a, b) )
    
# (1, 0):'>',  (0, 1):'^',  (-1, 0):'<', (0, -1):'v', None: '.' 
orientations = [(1,0), (0, 1), (-1, 0), (0, -1)] # (1,0) :(col_move, row_move)

def turn_right(orientation):
    return orientations[orientations.index(orientation)-1] # index( orientation=(0,1) )=1 ==> 0 ==> (1,0):'>'

def turn_left(orientation):
    return orientations[(orientations.index(orientation)+1) % len(orientations)] # (1+1)%4 = 2 ==> (-1,0):'<'

def isnumber(x):
    return hasattr(x, '__int__')

   
"""A Markov Decision Process, defined by an init_pos_posial state, transition model,
and reward function. """

class MDP:

    def __init__(self, init_pos, actlist, terminals, transitions={}, states=None, gamma=0.99):
        if not (0 < gamma <= 1):
            raise ValueError("MDP should have 0 < gamma <= 1 values")

        if states:
            self.states = states
        else:
            self.states = set()
        self.init_pos = init_pos
        self.actlist = actlist         # action list
        self.terminals = terminals
        self.transitions = transitions
        self.gamma = gamma             # discount factor
        self.reward = {}

    """Returns a numeric reward for the state."""
    def R(self, state):
        return self.reward[state]

    """Transition model. From a state and an action, return a list of (probability, result-state) pairs"""
    def T(self, state, action):
        if(self.transitions == {}):
            raise ValueError("Transition model is missing")
        else:
            return self.transitions[state][action]

    """Set of actions that can be performed for a particular state"""
    def actions(self, state):
        if state in self.terminals:
            return [None]
        else:
            return self.actlist


# sequential_decision_environment = GridMDP([[-0.02, -0.02, -0.02, +1],
#                                            [-0.02, None, -0.02, -1],
#                                            [-0.02, -0.02, -0.02, -0.02]],
#                                           terminals=[(3, 2), (3, 1)])
        
"""A two-dimensional grid MDP"""
class GridMDP(MDP):

    def __init__(self, grid, terminals, init_pos=(0, 0), gamma=0.99):
        
        """ because we want row 0 on bottom, not on top """
        grid.reverse() # flip axis=0
#         [[-0.02, -0.02, -0.02, -0.02],
#          [-0.02, None, -0.02, -1],
#          [-0.02, -0.02, -0.02, 1]]
        
        MDP.__init__(self, init_pos, actlist=orientations,
                     terminals=terminals, gamma=gamma)
        self.grid = grid           # reward_table for each position(or each state)
        self.rows = len(grid)
        self.cols = len(grid[0])
        for c in range(self.cols):
            for r in range(self.rows):
                self.reward[c, r] = grid[r][c] # e.g. self.reward[c=3][r=2]=1 <== grid[r=2][c=3]=1
                if grid[r][c] is not None:
                    self.states.add((c, r)) # self.states = set() , append 

    def T(self, state, action): # State transition probabilities = P_s,a_(s') ==> return [ (P, new_state),...]
        if action is None:
            return [(0.0, state)]
        else:                                                  #    (3,1)+(0,1)==>(3,2)  # (0,1):(col_move, row_move)    
            return [(0.8, self.go(state, action)),             # P_s=(3,1),a_( s'=(3,2) ) # action(a: goes up) taken
                    (0.1, self.go(state, turn_right(action))), # P_s=(3,1),a_( s'=(4,1) ) # action(a: goes right) taken
                    (0.1, self.go(state, turn_left(action)))]  # P_s=(3,1),a_( s'=(2,1) ) # action(a: goes left) taken

    """Return the state that results from going in this direction."""
    def go(self, state, direction):
        state1 = vector_add(state, direction) # get new state(state1)
        return state1 if state1 in self.states else state # if state1 not in states, then state1 is an obstacle or wall

    """fill each state(x, y) with action_char ==> [[..., action_char, ...]...] state transition probability grid."""
    def to_grid(self, mapping):
        return list( reversed([ [ mapping.get( (x, y), None )# get(key) ==> value(here is direction char)
                                  for x in range(self.cols)
                                ]
                                for y in range(self.rows)
                             ])# reversed since previous grid did a reverse operation
                   )
                
    """Convert a mapping from state(x, y) to action into a [[..., action_char, ...]...] state transition probability grid."""
    def to_arrows(self, policy):
        chars = { (1, 0): '>', 
                  (0, 1): '^', 
                  (-1, 0): '<',
                  (0, -1): 'v',
                  None: '.'
                }
        return self.to_grid({ s: chars[a] 
                              for (s, a) in policy.items() # e.g. (1, 2): (1, 0) == current state: next action taken
                            })
    
   
"""Solving an MDP by value iteration and returns the optimum state values """
def value_iteration(mdp, epsilon=0.001):
    STSN = {s: 0 for s in mdp.states} # Initialize V(S) = 0 for all states S
    
    R, T, gamma = mdp.R, mdp.T, mdp.gamma # gamma=0.99
    while True:
        STS = STSN.copy()
        delta = 0
        
        for s in mdp.states:
            STSN[s] = R(s) + gamma * max( [ sum([ p * STS[s1]
                                                  for (p, s1) in T(s, a)
                                                ])
                                            for a in mdp.actions(s)
                                          ] )
            
            delta = max(delta, abs(STSN[s] - STS[s]))
        if delta < epsilon * (1 - gamma) / gamma:
            return STS

def argmax(seq, fn):
    best = seq[0];
    best_score = fn(best)
    for x in seq:
        x_score = fn(x)
        if x_score > best_score:
            best, best_score = x, x_score
    return best

"""Given an MDP and a utility function STS, determine the best policy,
as a mapping from state to action """
def best_policy(mdp, STS):
    pi = {}
    # I just use state{(2,0)} for test, please replace it with mdp.states later
    for s in mdp.states:# {(2,0)}: # e.g. s(col_idx, row_idx): ==> (0, 1): 0.8196979051091251
        # print(mdp.actions((2,0))) # [(1, 0), (0, 1), (-1, 0), (0, -1)] : ['>','^','<','V']
        # print(STS[(2,0)]) # 0.7086718796622142
        pi[s] = argmax(mdp.actions(s), lambda a: expected_utility(a, s, STS, mdp))
    return pi

"""The expected utility of doing a in state s, according to the MDP and STS."""
def expected_utility(a, s, STS, mdp):
    return sum([p * STS[s1] for (p, s1) in mdp.T(s, a)])


"""Solve an MDP by policy iteration"""
def policy_iteration(mdp):
    STS = {s: 0 for s in mdp.states}
    #  Initialize random policy π
    pi = {s: random.choice(mdp.actions(s)) for s in mdp.states}
    while True:
        STS = policy_evaluation(pi, STS, mdp)
        unchanged = True
        for s in mdp.states:
            # mdp.actions(s) return an available action list
            a = argmax(mdp.actions(s),lambda a: expected_utility(a, s, STS, mdp))
            if a != pi[s]:
                pi[s] = a
                unchanged = False
        if unchanged:
            return pi

"""Return an updated utility mapping U from each state in the MDP to its
utility, using an approximation (modified policy iteration)"""
def policy_evaluation(pi, STS, mdp, k=20):
    R, T, gamma = mdp.R, mdp.T, mdp.gamma
    for i in range(k):
        for s in mdp.states: # pi[s] ==> just 1 action, but T(s, pi[s]) return an action list 
            STS[s] = R(s) + gamma * sum([p * STS[s1] for (p, s1) in T(s, pi[s])])   
    return STS


def print_table(table, header=None, sep='   ', numfmt='{}'):
    justs = ['rjust' if isnumber(x) else 'ljust' for x in table[0]]
    if header:
        table.insert(0, header)
    table = [[ numfmt.format(x) if isnumber(x) else x 
                                for x in row
             ]
             for row in table]

    # each column max width or maximum of the number of chars in each column
    sizes = list( map( lambda seq: max( map(len, seq) ),
                       list( zip( *[ map(str, row) # convert 'None' to str
                                     for row in table # each row is a list element
                                   ]# * convert [map object, map object, map object] to map map map
                                )# zip column by column 
                           )# zip(* matrix.shape(3,4) ) : 3x4 ==> 4x3 # do 2D transpose operation 
                     )
                )# https://blog.csdn.net/Linli522362242/article/details/118349075
    # sizes : [1, 4, 1, 1] # for width for each column
        
    for row in table:
        print( sep.join( getattr(
                                  str(x_in_row), 
                                  j # call ljust or rjust built-in function and width =size, fillchar = blank
                                )(size)
                         for (j, size, x_in_row) in zip(justs, sizes, row)
                       )
             )



""" A 4x3 grid environment that presents the agent with a sequential decision problem"""
sequential_decision_environment = GridMDP([[-0.02, -0.02, -0.02, +1],
                                           [-0.02, None, -0.02, -1],
                                           [-0.02, -0.02, -0.02, -0.02]],
                                          terminals=[(3, 2), (3, 1)])

# Value Iteration
value_iter = best_policy( sequential_decision_environment, 
                          value_iteration(sequential_decision_environment, .01)
                        )
print("\n Optimal Policy based on Value Iteration\n")
print_table(sequential_decision_environment.to_arrows(value_iter))

#Policy Iteration
policy_iter = policy_iteration(sequential_decision_environment)
print("\n Optimal Policy based on Policy Iteration & Evaluation\n")
print_table(sequential_decision_environment.to_arrows(policy_iter))


     From the preceding output with two results, we can conclude that both value and policy iterations provide the same optimal policy for an agent to move across the grid to reach the goal state in the quickest way possible. When the problem size is large enough, it is computationally advisable to go for value iteration rather than policy iteration, as in policy iterations, we need to perform two steps at every iteration of the policy evaluation and policy improvement.

Monte Carlo methods

     Using Monte Carlo (MC) methods, we will compute the value functions first and determine the optimal policies. In this method, we do not assume complete knowledge of the environment. MC require only experience, which consists of sample sequences of states, actions, and rewards from actual or simulated interactions with the environment. Learning from actual experiences is striking because it requires no prior knowledge of the environment's dynamics(That is, we do not know OR 𝑝(𝑠′, 𝑟|𝑠, 𝑎) (r is the reward in each state, e.g. the state-transition probabilities of the environment,), but still attains optimal behavior. This is very similar to how humans or animals learn from actual experience rather than any mathematical model. Surprisingly, in many cases, it is easy to generate experience sampled according to the desired probability distributions, but infeasible to obtain the distributions in explicit form. 

     Monte Carlo methods solve the reinforcement learning problem based on averaging the sample returns over each episode. This means that we assume experience is divided into episodes, and that all episodes eventually terminate, no matter what actions are selected. Values are estimated and policies are changed only after the completion of each episode. MC methods are incremental in an episode-by-episode sense, but not in a step-by-step (which is an online learning, and which we will cover the same in Temporal Difference learning section) sense. 

     Monte Carlo methods sample and average returns for each state-action pair over the episode. However, within the same episode, the return after taking an action in one stage depends on the actions taken in later states. Because all the action selections are undergoing learning, the problem becomes non-stationary不稳定的 from the point of view of the earlier state. In order to handle this non-stationarity, we adapt the idea of policy iteration from dynamic programming, in which, first, we compute the value function for a fixed arbitrary policy; and, later, we improve the policy.

Monte Carlo prediction

     As we know, Monte Carlo methods predict the state-value function for a given policy. The value of any state is the expected return or expected cumulative future discounted rewards starting from that state. These values are estimated in MC methods simply to average the returns observed after visits to that state. As more and more values are observed, the average should converge to the expected value based on the law of large numbers. In fact, this is the principle applicable in all Monte Carlo methods. The Monte Carlo Policy Evaluation Algorithm consist of the following steps: 

1. Initialize:

2. Repeat forever:

  • Generate an episode using π
  • For each state s appearing in the episode:
    • G return following the first occurrence of s
    • Append G to Returns(s)
    • V(s) average(Returns(s))

The suitability of Monte Carlo prediction on gridworld problems 

     The following diagram has been plotted for illustration purposes. However, practically, Monte Carlo methods cannot be easily used for solving grid-world type problems, due to the fact that termination is not guaranteed for all the policies. If a policy was ever found that caused the agent to stay in the same state, then the next episode would never end. Step-by-step learning methods like (State-Action-Reward-State-Action (SARSA), which we will be covering in a later part of this chapter in TD Learning Control) do not have this problem because they quickly learn during the episode that such policies are poor, and switch to something else.

Modeling Blackjack example of Monte Carlo methods using Python

     The objective of the popular casino card game Blackjack is to obtain cards, the sum of whose numerical values is as great as possible, without exceeding the value of 21. All face cards (king, queen, and jack) count as 10(一般记作T,即ten之意), and an ace can count as either 1 or as 11, depending upon the way the player wants to use it. Only the ace has this flexibility option. All the other cards are valued at face value. The game begins with two cards dealt with both dealer and players. One of the dealer's cards is face up and the other is face down.

  • If the player has a 'Natural 21' from these first two cards (an ace and a 10-card, A+10=21), the player wins unless the dealer also has a Natural, in which case the game is a draw平局.
  • If the player does not have a natural, then he can ask for additional cards, one by one (hits拿牌), until he either stops (sticks, stand停止拿牌) or exceeds 21 (goes bust). If the player goes bust, he loses;
  • if the player sticks(当玩家停牌时,点数一律视为最大而尽量不爆,如A+9为20,A+4+8为13,A+3+A视为15), then it's the dealer's turn. The dealer hits or sticks according to a fixed strategy without choice: the dealer usually sticks on any sum of 17 or greater, and hits otherwise(当所有玩家停止拿牌后,庄家翻开暗牌,并持续拿牌直至点数不小于17,若有A,按最大而尽量不爆计算).
    • If the dealer goes bust, then the player automatically wins.
    • If dealer sticks(停止拿牌), the outcome would be either win, lose, or draw, determined by whether the dealer or the player's sum total is closer to 21.


     The Blackjack problem can be formulated as an episodic finite MDP, in which each game of Blackjack is an episode. Rewards of +1, -1, and 0 are given for winning, losing, and drawing平局 for each episode respectively at the terminal state and the remaining rewards within the state of game are given the value as 0 with no discount (gamma = 1). Therefore, the terminal rewards are also the returns for this game. We draw the cards from an infinite deck so that no traceable pattern exists. The entire game is modeled in Python in the following code.

     The following snippets of code have taken inspiration from Shangtong Zhang's Python codes for RL, and are published in this book with permission from the student of Richard S. Sutton, the famous author of Reinforcement : Learning: An Introduction (details provided in the Further reading section).

     The following package is imported for array manipulation and visualization:

from __future__ import print_function
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

     At each turn, the player or dealer can take one of the actions possible: either to hit or to stand. These are the only two states possible :

ACTION_HIT = 0
ACTION_STAND = 1
actions = [ACTION_HIT, ACTION_STAND]

     The policy for player is modeled with 21 arrays of values, as the player will get bust after going over the value of 21:

policyPlayer = np.zeros(22)# point_as_index:action

for i in range(12, 20): # 12 <= points <20
    policyPlayer[i] = ACTION_HIT # ask for additional cards

     The player has taken the policy of stick if he gets a value of either 20 or 21, or else he will keep hitting the deck to draw a new card:

# The player has taken the policy of stick if he gets a value of either 20 or 21
policyPlayer[20] = ACTION_STAND
policyPlayer[21] = ACTION_STAND

Function form of target policy of a player:

def targetPolicyPlayer( usableAcePlayer, playerSum, dealerCard ):
    return policyPlayer[playerSum]

Function form of behavior policy of a player:

     一般地,如果随机变量 X 服从参数为 n 和 p 的二项分布,我们记为

 
n次试验中正好得到k次成功的概率由概率质量函数给出

 The probability density for the binomial distribution is 

 random.binomial(n, p, size=None) ==> 0 or 1

def behaviorPolicyPlayer( usableAcePlayer, playerSum, dealerCard ):
    if np.random.binomial( 1, 0.5 ) == 1:
        return ACTION_STAND
    #else:
    return ACTION_HIT

     Fixed policy for the dealer is to keep hitting the deck until value is 17 and then stick between 17 to 21:(on-policy, lookup table)

policyDealer = np.zeros(22)

for i in range(12, 17):
    policyDealer[i] = ACTION_HIT # ask for additional cards

for i in range(17, 22):
    policyDealer[i] = ACTION_STAND

The following function is used for drawing a new card from the deck with replacement:

def getCard():
    card = np.random.randint(1,14) # range : [1,14)
    card = min( card, 10 ) # range : [1,10]
    return card

Let's play the game!

开局时,庄家(dealer)给每个玩家(又称闲家)牌面向上发两张牌(明牌),再给庄家自己发两张牌,一张明牌,一张暗牌(牌面朝下)。

def play( policyPlayerFn, initialState=None, initialAction=None ):
    playerSum = 0           # sum of player
    playerTrajectory = []   # trajectory of player # [...[action, (usableAcePlayer, playerSum, dealerCard1)]...]
    usableAcePlayer = False #  whether player uses Ace as 11
    
    # dealer status
    dealerCard1 = 0
    dealerCard2 = 0
    usableAceDealer = False
    
    if initialState is None:
        # generate a random initial state
        
        numOfAce = 0
        
        # Only one player is considered here, and 
        # only the case where the total points of the first 2 cards exceeds 11 points
        # initialize cards of player
        while playerSum < 12:
            # if sum of player is less than 12, always hit
            card = getCard() # range : [1,10]
            
            # if get an Ace, use it as 11
            if card == 1:
                numOfAce +=1
                card = 11
                usableAcePlayer = True
            playerSum += card
            
        # If the player's sum is > 21, he must hold at least one ace, ##########
        # but two aces are also possible. In that case, he will use ace as 1 rather than 11.
        # If the player has only one ace, then he does not have a usable ace any more:   
        if playerSum > 21:
            playerSum -= 10 # use the Ace as 1 rather than 11
        
            if numOfAce == 1:
                # if the player only has one Ace, then he doesn't have usable Ace any more
                usableAcePlayer = False
        # initialize cards of dealer, suppose dealer will show the first card he gets
        dealerCard1 = getCard()
        dealerCard2 = getCard()
    else:
        # use specified initial state
        usableAcePlayer = initialState[0] # False or True
        playerSum = initialState[1]
        
        dealerCard1 = initialState[2]
        dealerCard2 = getCard()
        
    # initial state of the game
    state = [usableAcePlayer, playerSum, dealerCard1]
    
    # initialize dealer's sum
    dealerSum = 0
    if dealerCard1 == 1 and dealerCard2 !=1: # 1 : 'ace'
        dealerSum += ( 11 + dealerCard2 )
        usableAceDealer = True
    elif dealerCard1 !=1 and dealerCard2 ==1:
        dealerSum += ( 11 + dealerCard1 )
        usableAceDealer = True
    elif dealerCard1 ==1 and dealerCard2 ==1:
        dealerSum += ( 11 + 1 )
        usableAceDealer = True
    else:
        dealerSum += dealerCard1 + dealerCard2 # note <=20 since getCard() <=10
    
    # The game starts from here, 
    # as the player needs to draw extra cards from here onwards:
    # player's turn
    while True:
        if initialAction is not None:
            action = initialAction
            initialAction = None
        else:
            # get action based on the current sum of a player:
            action = policyPlayerFn( usableAcePlayer, playerSum, dealerCard1 )
            
        # track player's trajectory for importance sampling
        playerTrajectory.append([ action, 
                                  (usableAcePlayer, playerSum, dealerCard1)
                                ])
        
        if action == ACTION_STAND:
            break ################
        # elif hit, get new card
        playerSum += getCard() # only consider playerSum >= 12, then continue getCard(), and 'Ace' ==1
        
        # Player busts here if the total sum is greater than 21, the game ends, and he gets a reward of -1. 
        # However, if he has an ace at his disposable, he can use it to save the game, or else he will lose.
        if playerSum > 21:
            # if player has a usable Ace, use it as 1 to avoid busting and continue
            if usableAcePlayer == True:
                playerSum -= 10
                usableAcePlayer = False
            else:
                return state, -1, playerTrajectory
    
    # Only when the player stops taking the cards, the dealer starts
    # Now it's the dealer's turn. 
    # He will draw cards based on a sum: 
    # if he reaches 17, he will stop, otherwise keep on drawing cards. 
    # If the dealer also has ace, he can use it to achieve the bust situation, otherwise, he goes bust
    while True:
        # get action based on current sum
        action = policyDealer[dealerSum]
        if action == ACTION_STAND:
            break ################
        # elif hit, get a new card
        dealerSum += getCard()# only consider dealSum >= 12, then continue getCard(), and 'Ace' ==1
        # dealer busts
        if dealerSum > 21:
            # if dealer has a usable Ace, use it as 1 to avoid busting and continue
            if usableAceDealer == True:
                dealerSum -= 10
                usableAceDealer = False
            else:
                # otherwise dealer loses
                return state, -1, playerTrajectory
    
    # both player and dealer chose to stand
    # compare the sum between player and dealer
    if playerSum > dealerSum:
        return state, 1, playerTrajectory
    elif playerSum == dealerSum:
        return state, 0, playerTrajectory
    else:
        return state, -1, playerTrajectory

The following code illustrates the Monte Carlo sample with On-Policy:

# Monte Carlo Sample with On-Policy
def monteCarloOnPolicy( nEpisodes ):
    statesUsableAceRewards = np.zeros( (10,10) ) # 21>=playerSum >=12  ==> gap = 10 
    statesNoUsableAceRewards = np.zeros( (10,10) )
    
    # initialze counts to 1 to avoid 0 being divided
    statesUsableAceCount = np.ones( (10,10) )
    statesNoUsableAceCount = np.ones( (10,10) )
    
    for i in range(0, nEpisodes):
        # def targetPolicyPlayer( usableAcePlayer, playerSum, dealerCard ):
        #     return policyPlayer[playerSum] 
        # policyPlayer[points <20] = ACTION_HIT=0,  policyPlayer[points=20 or 21] = ACTION_STAND =1
        state, reward, _ = play( targetPolicyPlayer ) # note 21>=playerSum >=12  ==> gap = 10 
        # state = [usableAcePlayer, playerSum, dealerCard1]
        state[1] -= 12 # for 2D grid states... row_index start from 0 # playerSum
        state[2] -= 1  # for 2D grid states... col_index start from 0 # dealercard1
        
        if state[0]:# usableAce : Whether to include Ace
            statesUsableAceCount[ state[1], state[2] ] += 1
            statesUsableAceRewards[ state[1], state[2] ] += reward
        else:
            statesNoUsableAceCount[ state[1], state[2] ] +=1
            statesNoUsableAceRewards[ state[1], state[2] ] += reward
            
    return statesUsableAceRewards / statesUsableAceCount,\
            statesNoUsableAceRewards / statesNoUsableAceCount

     The following code discusses Monte Carlo with Exploring Starts, in which all the returns for each state-action pair are accumulated and averaged, irrespective[ˌɪrɪˈspektɪv]不顾 of what policy was in force when they were observed:

     Given an action-value function, q(s, a), we can generate a greedy (deterministic) policy as follows:and 

# Monte Carlo with Exploring Starts
def monteCarloES( nEpisodes ):
    # (playerSum, dealerCard1, usableAce, action).
    stateActionValues = np.zeros( (10,10,2,2) )
    # initialze counts to 1 to avoid division by 0
    stateActionPairCount = np.ones( (10,10,2,2) )
    
    # Behavior policy is greedy, which gets argmax of the average returns (s, a)
    def behaviorPolicy( usableAce, playerSum, dealerCard1 ):
        usableAce = int( usableAce )
        playerSum -= 12 # for 2D grid states... row_index start from 0
        dealerCard1 -= 1 # for 2D grid states... col_index start from 0
        
        # get argmax of the average returns(s, a)
        return np.argmax( stateActionValues[playerSum, dealerCard1, usableAce, :]
                            / stateActionPairCount[playerSum, dealerCard1, usableAce, :]
                        )
    
    # Play continues for several episodes and, at each episode, randomly initialized state, action,
    # and update values of state-action pairs:
    for episode in range( nEpisodes ):
        if episode % 1000 ==0:
            print( 'episode:', episode )
        # for each episode, use a randomly initialized state and action
        initialState = [ bool( np.random.choice([0,1]) ), # usableAcePlayer # Whether to include Ace
                         np.random.choice( range(12,22) ),# only consider playerSum >= 12
                         np.random.choice( range(1,11) )  # dealerCard1
                       ]
        # actions = [ACTION_HIT, ACTION_STAND]
        initialAction = np.random.choice( actions )
        _, reward, trajectory = play( behaviorPolicy, #return the best action_index
                                      initialState, initialAction )
        
        for action, ( usableAce, playerSum, dealerCard1 ) in trajectory:
            usableAce = int( usableAce )
            playerSum -= 12
            dealerCard1 -= 1
            # update values of state-action pairs
            stateActionValues[playerSum, dealerCard1, usableAce, action] += reward #################
            stateActionPairCount[playerSum, dealerCard1, usableAce, action] += 1   #################
            
    return stateActionValues / stateActionPairCount

Print the state value:

# Print the state value
# figureIndex = 0
def prettyPrint( fig, data, tile, zlabel='reward' ):
    global figureIndex

    figureIndex += 1
    ax = fig.add_subplot(2,2, figureIndex, projection='3d')
    # or fig.suptitle( tile )
    ax.set_title(tile)
    
    x_axis = []
    y_axis = []
    z_axis = []
    for i in range(12,22): # # only consider playerSum >= 12
        for j in range(1,11): # dealerCard1
            x_axis.append(i)
            y_axis.append(j)
            z_axis.append( data[i-12, j-1] )
            
    ax.scatter( x_axis, y_axis, z_axis, c='blue' )
    ax.set_xlabel('player sum', fontsize=13)
    ax.set_ylabel('dealercard1 showing', fontsize=13)
    ax.set_zlabel( zlabel, fontsize=13)    

On-Policy results with or without a usable ace for 10,000 and 500,000 iterations:

# On-Policy results
def onPolicy(main_title):
    statesUsableAce1, statesNoUsableAce1 = monteCarloOnPolicy( 10000 )
    statesUsableAce2, statesNoUsableAce2 = monteCarloOnPolicy( 500000)
    
    fig = plt.figure( figsize=(15,14) )
    fig.suptitle( main_title,
               c='r', 
               fontdict={'fontsize':25,
                         'family':'serif',
                        },#fontsize='xxx-large', #
            
               fontweight='bold',  # or 'heavy', 'bold', 'normal'
               fontstyle='italic' ) # rotation=45, bbox=dict(facecolor='y', edgecolor='blue', alpha=0.65 )
    
    prettyPrint( fig, statesUsableAce1, 'Usable Ace & 10000 Episodes' )
    prettyPrint( fig, statesNoUsableAce1, 'No Usable Ace & 10000 Episodes' )
    
    prettyPrint( fig, statesUsableAce2, 'Usable Ace & 500000 Episodes' )
    prettyPrint( fig, statesNoUsableAce2, 'No Usable Ace & 500000 Episodes' )

    plt.show()
figureIndex = 0
main_title = 'Approximate state-value functions for Blackjack policy htat sticks\n only on 20 or 21, computed by MC Policy Evaluation'
onPolicy( main_title )

 

     From the previous diagram, we can conclude that a usable ace in a hand gives much higher rewards even at the low playerSum combinations, whereas for a player without a usable ace, values are pretty distinguished in terms of earned reward if those values are less than 20.

Optimized or Monte Carlo control of policy iterations:

# Optimized or Monte Calro Control
def MC_ES_optimalPolicy(main):
    stateActionValues = monteCarloES( 500000 )
    stateValueUsableAce = np.zeros( (10,10) )
    stateValueNoUsableAce = np.zeros( (10,10) )
    
    # get the optimal policy
    actionUsableAce = np.zeros( (10,10), dtype='int' )
    actionNoUsableAce = np.zeros( (10,10), dtype='int' )
    
    for i in range(10):
        for j in range(10):
            # stateActionValues[playerSum, dealerCard1, usableAce, action]
            stateValueNoUsableAce[i,j] = np.max( stateActionValues[i,j, 0, :] )
            stateValueUsableAce[i,j] = np.max( stateActionValues[i,j, 1, :] )
            
            actionNoUsableAce[i,j] = np.argmax( stateActionValues[i,j, 0, :])
            actionUsableAce[i,j] = np.argmax( stateActionValues[i,j, 1, :])
            
    fig = plt.figure( figsize=(15,14) )
    fig.suptitle( main_title,
               c='r', 
               fontdict={'fontsize':25,
                         'family':'serif',
                        },#fontsize='xxx-large', #
            
               fontweight='bold',  # or 'heavy', 'bold', 'normal'
               fontstyle='italic' ) # rotation=45, bbox=dict(facecolor='y', edgecolor='blue', alpha=0.65 )
    
    prettyPrint( fig, stateValueUsableAce, 'Optimal state value with usable Ace' )
    prettyPrint( fig, stateValueNoUsableAce, 'Optimal state value without usable Ace' )
    
    prettyPrint( fig, actionUsableAce, 'Optimal policy with usable Ace', 'Action (0 Hit, 1 Stick)' )
    prettyPrint( fig, actionNoUsableAce, 'Optimal policy without usable Ace', 'Action( 0 Hit, 1 Stick)' )
    
    plt.show()

# Run Monte Carlo Control or Explored starts

figureIndex = 0
main_title = 'Optimial policy & state-value functions for Blackjack by Monte Carlo ES(Exploring Starts)'
MC_ES_optimalPolicy(main_title)


... ...
     From the optimum policies and state values, we can conclude that, with a usable ace at our disposal, we can hit more than stick(more data points are distributed at the bottom), and also that the state values for rewards are much higher compared with when there is no ace in a hand(for low playerSum). Though the results we are talking about are obvious, we can see the magnitude of the impact of holding an ace in a hand.

Temporal difference learning

     Temporal Difference (TD) learning is the central and novel theme of reinforcement learning. TD learning is the combination of both Monte Carlo (MC) and Dynamic Programming (DP) ideas. Like Monte Carlo methods, TD methods can learn directly from the experiences without the model of the environment. Similar to Dynamic Programming, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome, unlike MC methods, in which estimates are updated after reaching the final outcome only

TD prediction

     Both TD and MC use experience to solve z prediction problem. Given some policy π, both methods update their estimate of for the non-terminal states occurring in that experience. Monte Carlo methods wait until the return following the visit is known, then use that return as a target for .

  •  is the actual return following time t (or  is used as the target return to update the estimated value)
  • is a correction term added to our current estimate of the value  .
  • 𝛼 is the constant step size parameter( the learning rate, e.g. 0.01 )

     The preceding method can be called as a constant - 𝛼 MC, where MC must wait until the end of the episode to determine the increment to (only then is known). To clarify this, we can rename the actual return,  to , where the subscript 𝑡: 𝑇 indicates that this is the return obtained at time step t while considering all the events occurred from time step t until the final time step, T.

     TD methods need to wait only until the next timestep. At time t+1, they immediately form a target and make a useful update using the observed reward and the estimate . The simplest TD method, known as TD(0), is:


     Target for MC update is (OR), whereas the target for the TD update is .

     In the following diagram, a comparison has been made between TD with MC methods. As we've written in equation TD(0), we use one step of real data and then use the estimated value of the value function of next state. In a similar way, we can also use two steps of real data to get a better picture of the reality and estimate value function of the third stage. However, as we increase the steps, which eventually need more and more data to perform parameter updates, the more time it will cost. When we take infinite steps until it touches the terminal point for updating parameters in each episode, TD becomes the Monte Carlo method.

 TD (0) for estimating v algorithm consists of the following steps:

  • Initialize:
    Input policy to be evaluated ==> 𝜋
    Initialize arbitrary state ==> value function

    ==

    and ==

     ==> V(s) (e.g: V(s) = 0, )

  • Repeat (for each episode):
    • Initialize S
    • Repeat (for each step of episode):
      • A <- action given by π for S
      • Take action A, observe R,S'
      •  

  • Until S is terminal.

Driving office example驾校示例 for TD learning 

     In this simple example, you travel from home to the office every day and you try to predict how long it will take to get to the office in the morning. When you leave your home, you note that time, the day of the week, the weather (whether it is rainy, windy, and so on) any other parameter which you feel is relevant. For example, on Monday morning you leave at exactly 8 a.m. and you estimate it takes 40 minutes to reach the office. At 8:10 a.m., and you notice that a VIP is passing, and you need to wait until the complete convoy[ˈkɑːnvɔɪ]舰队,车队 has moved out, so you re-estimate that it will take 45 minutes from then, or a total of 55 minutes. 15 minutes later you have completed the highway portion of your journey in good time. Now you enter a bypass road and you now reduce your estimate of total travel time to 50 minutes. Unfortunately, at this point, you get stuck behind a bunch of bullock carts and the road is too narrow to pass. You end up having to follow those bullock carts until you turn onto the side street where your office is located at 8:50. 7 minutes later, you reach your office parking. The sequence of states, times, and predictions are as follows: 

     Rewards in this example are the elapsed time at each leg of the journey and we are using a discount factor (gamma, 𝛾 = 1), so the return for each state is the actual time to go from that state to the destination (office). The value of each state is the predicted time to go, which is the second column in the preceding table, also known the current estimated value for each state encountered.

In the previous diagram,

  • Monte Carlo is used to plot the predicted total time over the sequence of events. Arrows always show the change in predictions recommended by the constant-α MC method. These are errors between the estimated value in each stage and the actual return (57 minutes). In the MC method, learning happens only after finishing, for which it needs to wait until 57 minutes passed. However,
  • in reality, you can estimate before reaching the final outcome and correct your estimates accordingly. TD works on the same principle, at every stage it tries to predict and correct the estimates accordingly. So, TD methods learn immediately and do not need to wait until the final outcome. In fact, that is how humans predict in real life. Because of these many positive properties, TD learning is considered as novel in reinforcement learning.

SARSA on-policy TD control

      The state of the agent, as illustrated in the previous figure, is the set of all of its variables (1). For example, in the case of a robot drone, these variables could include the drone's current position (longitude, latitude, and altitude), the drone's remaining battery life, the speed of each fan, and so forth. At each time step, the agent interacts with the environment through a set of available actions (2). Based on the action taken by the agent denoted by , while it is at state , the agent will receive a reward signal  (3), and its state will become  (4).

     State-action-reward-state-action (SARSA) is an on-policy TD control problem, in which policy will be optimized using policy iteration (GPI,generalized policy iteration), only time TD methods used for evaluation of predicted policy. In the first step, the algorithm learns a SARSA function. In particular, for an on-policy method we estimate (true action-value) for the current behavior policy 𝜋 and for all states (s) and actions (a), using the TD method for learning (true state-value ).

     Now, we consider transitions from state-action pair to state-action pair, and learn the values of state-action pairs:

     This update is done after every transition from a non-terminal state . If is terminal, then is defined as zero. This rule uses every element of the quintuple of events , which make up a transition from one state-action pair to the next. This quintuple gives rise to the name SARSA for the algorithm.

     As in all on-policy methods, we continually estimate for the behavior policy 𝜋, and at the same time change π toward greediness with respect to . The algorithm for computation of SARSA is given as follows:

1. Initialize: 

2. Repeat (for each episode):

  • Initialize S
  • Choose A from S using policy derived from Q (for example, ε- greedy)
  • Repeat (for each step of episode):
    • Take action A, observe R,S'
    • Choose A' from using S' policy derived from Q (for example, ε - greedy) 

3. Until S is terminal 

Q-learning - off-policy TD control 

     Q-learning is the most popular method used in practical applications for many reinforcement learning problems. The off-policy TD control algorithm is known as Qlearning. In this case, the learned action-value function, Q directly approximates, the optimal action-value function, independent of the policy being followed. This approximation simplifies the analysis of the algorithm and enables early convergence proofs. The policy still has an effect, in that it determines which state-action pairs are visited and updated. However, all that is required for correct convergence is that all state-action pairs continue to be updated. As we know, this is a minimal requirement in the sense that any method guaranteed to find optimal behavior in the general case must require it. An algorithm of convergence is shown in the following steps:

1. Initialize:
2. Repeat (for each episode):

  • Initialize S
  • Repeat (for each step of episode):
    • Choose A from S using policy derived from Q (for example, ε - greedy)
    • Take action A, observe R,S'

3. Until S is terminal

Cliff walking example of on-policy and off-policy of TD control

     A cliff walking grid-world example is used to compare SARSA and Q-learning, to highlight the differences between on-policy (SARSA) and off-policy (Q-learning) methods. This is a standard undiscounted, episodic task with start and end goal states, and with permitted movements in four directions (north, west, east and south). The reward of -1 is used for all transitions except the regions marked The Cliff悬崖, stepping on this region will penalize the agent with reward of -100 and sends the agent instantly back to the start position.

     The following snippets of code have taken inspiration from Shangtong Zhang's Python codes for RL and are published in this book with permission from the student of Richard S. Sutton, the famous author of Reinforcement Learning: An Introduction (details provided in the Further reading section):

# Cliff-Walking - TD Learning - SARSA & Q-Learning
from __future__ import print_function
import numpy as np
import matplotlib.pyplot as plt

# Grid dimensions
GRID_HEIGHT = 4
GRID_WIDTH = 12

# probability for exploration, step size,gamma 
EPSILON = 0.1
ALPHA = 0.5   # learning rate or step size
GAMMA = 1     # discount factor

# all possible actions
ACTION_UP = 0; ACTION_DOWN = 1; ACTION_LEFT = 2; ACTION_RIGHT = 3
actions = [ACTION_UP, ACTION_DOWN, ACTION_LEFT, ACTION_RIGHT]

# initial state action pair values
stateActionValues = np.zeros( (GRID_HEIGHT, GRID_WIDTH, 4) )
startState = [3,0] # row_index or col_index start from 0
goalState = [3,11]

# reward for each action in each state
actionRewards = np.zeros( (GRID_HEIGHT, GRID_WIDTH, 4) )
actionRewards[:,:,:] = -1.0
             # 2: row_index   
actionRewards[2, 1:11, ACTION_DOWN] = -100.0
actionRewards[3, 0, ACTION_RIGHT] = -100.0

# set up destinations for each action in each state
actionDestination = []
for i in range( 0, GRID_HEIGHT ):
    actionDestination.append([]) # 
    for j in range( 0, GRID_WIDTH ):
        destination = dict() # { ACTION_UP:[state_r, state_c], ACTION_LEFT:[state_r, state_c],... }
        destination[ACTION_UP] = [ max(i-1, 0), j ]
        destination[ACTION_LEFT] = [ i, max(j-1, 0) ]
        destination[ACTION_RIGHT] =[ i, min(j+1, GRID_WIDTH-1) ]
        if i == 2 and 1 <= j <= 10:
            destination[ACTION_DOWN] = startState
        else:
            destination[ACTION_DOWN] = [ min(i+1, GRID_HEIGHT-1), j ]
        actionDestination[-1].append( destination ) 
        #[ ...,
        #  [...,{ ACTION_UP:[state_r, state_c],ACTION_LEFT:[state_r, state_c],... },...],
        #  ...
        #] dimensions: GRID_HEIGHT x GRID_WIDTH
        
actionDestination[3][0][ACTION_RIGHT] = startState
  • ε-greedy policy (ε is epsilon): at each step it acts randomly with probability ε, or greedily with probability 1–ε (i.e., np.random.choice( actions ) OR choosing the action with the highest Q-Value)
     
  • Given an action-value function, q(s, a), we can generate a greedy (deterministic) policy as follows:and 
# choose an action based on epsilon greedy algorithm
#                        state-action pairs
def chooseAction( state, stateActionValues ):
    if np.random.binomial( 1, EPSILON ) ==1: # EPSILON = 0.1 # probability for exploration
        return np.random.choice( actions )
    else:
        return np.argmax( stateActionValues[ state[0], state[1], :] )#==>current state best action

     Alternatively, rather than relying only on chance for exploration, another approach is to encourage the exploration policy to try actions that it has not tried much before. This can be implemented as a bonus added to the Q-Value estimates, as shown in Equation 18-6.

 Equation 18-6. Q-Learning using an exploration function


 In this equation:

  • • N(s′, a′) counts the number of times the action a′ was chosen in state s′.
  • • f(Q, N) is an exploration function, such as f(Q, N) = Q + κ/(1 + N), where κ is a curiosity hyperparameter that measures how much the agent is attracted to the unknown.

# SARSA update 

# SARSA update
def sarsa( stateActionValues, expected=False, stepSize=ALPHA ):
    # SA ###
    currentState = startState # startState = [3,0] # row_index or col_index start from 0
    currentAction = chooseAction( currentState, stateActionValues )# maybe the best action in current state
    rewards = 0.0
    
    while currentState != goalState: # goalState = [3,11]
        # RSA ###
        # actionDestination
        #[ ...,
        #  [...,{ ACTION_UP:[state_r, state_c],ACTION_LEFT:[state_r, state_c],... },...],
        #  ...
        #] dimensions: GRID_HEIGHT x GRID_WIDTH
        reward = actionRewards[ currentState[0], currentState[1], currentAction ]
        rewards += reward
        
        newState = actionDestination[ currentState[0] ][ currentState[1] ][currentAction]
        newAction = chooseAction( newState, stateActionValues )
        # Q(S_t+1, A_t+1)
        if not expected:
            # stateActionValues = np.zeros( (GRID_HEIGHT, GRID_WIDTH, 4) )
            valueTarget = stateActionValues[ newState[0], newState[1], newAction ]
        else:
            valueTarget = 0.0
            actionValues = stateActionValues[ newState[0], newState[1], : ]
            bestActions = np.argwhere( actionValues == np.max( actionValues ) )
            # actions = [ACTION_UP, ACTION_DOWN, ACTION_LEFT, ACTION_RIGHT]
            for action in actions:
                if action in bestActions:
                    # Q(S_t+1, A_t+1)
                    valueTarget += ( (1.0-EPSILON)/len(bestActions) + 
                                      EPSILON/len(actions)
                                   )* stateActionValues[newState[0],
                                                        newState[1],
                                                        action
                                                       ]
                                   
                else:
                    valueTarget += EPSILON/len(actions)*stateActionValues[newState[0],
                                                                          newState[1],
                                                                          action
                                                                         ]
                
        valueTarget *= GAMMA # discount factor
        stateActionValues[currentState[0], currentState[1], currentAction] += stepSize * (
            reward + valueTarget - stateActionValues[currentState[0], currentState[1], currentAction]
        )
        currentState = newState
        currentAction = newAction
    # end while loop
    return rewards

# Q-Learning update 

# Q-Learning update
def qLearning( stateActionValues, stepSize=ALPHA ):
    # S
    currentState = startState
    rewards = 0.0
    
    while currentState != goalState:
        # ARSA
        currentAction = chooseAction(currentState, stateActionValues)#######
        reward = actionRewards[ currentState[0], currentState[1], currentAction ]
        rewards += reward
        newState = actionDestination[ currentState[0] ][ currentState[1] ][currentAction]
        
        stateActionValues[ currentState[0], currentState[1], currentAction ] += stepSize * (
            reward + GAMMA*np.max( stateActionValues[newState[0], newState[1], :] ) - 
            stateActionValues[ currentState[0], currentState[1], currentAction ]
        )
        currentState = newState
    return rewards

# print optimal policy

# print optimal policy
def printOptimalPolicy( stateActionValues ):
    optimalPolicy = []
    for i in range( 0, GRID_HEIGHT ):
        optimalPolicy.append([])
        for j in range(0, GRID_WIDTH):
            if [i,j] == goalState: # goalState = [3,11]
                optimalPolicy[-1].append('G')
                continue
            bestAction = np.argmax( stateActionValues[i,j,:] )
            if bestAction == ACTION_UP:
                optimalPolicy[-1].append('^')
            elif bestAction == ACTION_DOWN:
                optimalPolicy[-1].append('v')
            elif bestAction == ACTION_LEFT:
                optimalPolicy[-1].append('<')
            elif bestAction == ACTION_RIGHT:
                optimalPolicy[-1].append('>')
    for row in optimalPolicy:
        print(row)
def SARSA_QLPlot():
    # averaging the reward sums from 10 successive episodes
    averageRange = 10
    
    # episodes of each run
    nEpisodes = 500
    
    # perform 20 independent runs
    runs = 20
    
    rewardsSARSA = np.zeros( nEpisodes )
    rewardsQLearning = np.zeros( nEpisodes )
    for run in range( 0, runs ):
        stateActionValuesSARSA = np.copy( stateActionValues )
        stateActionValuesQLearning = np.copy( stateActionValues )
        for i in range( 0, nEpisodes ):
            # cut off the value by -100 to draw the figure more elegantly
            rewardsSARSA[i] += max( sarsa(stateActionValuesSARSA), -100 )
            rewardsQLearning[i] += max( qLearning(stateActionValuesQLearning), -100 )
    
    # averaging over independt runs
    rewardsSARSA /= runs
    rewardsQLearning /= runs
    
    # averaging over successive episodes
    smoothedRewardsSARSA = np.copy( rewardsSARSA )
    smoothedRewardsQLearning = np.copy( rewardsQLearning )
    
    for i in range( averageRange, nEpisodes ):
        smoothedRewardsSARSA[i] = np.mean( rewardsSARSA[i-averageRange : i+1] )
        smoothedRewardsQLearning[i] = np.mean( rewardsQLearning[i-averageRange : i+1] )
        
    # display optimal policy
    print( 'SARSA Optimal Policy:' )
    printOptimalPolicy( stateActionValuesSARSA )
    print( 'Q-Learning Optimal Policy:' )
    printOptimalPolicy( stateActionValuesQLearning )
    
    # draw reward curves
    plt.figure(1)
    plt.title( 'SARSA vs. Q-Learning Sum of Rewards during Episode',
               c='r', 
               fontdict={'fontsize':14,
                         'family':'serif',
                        },#fontsize='xxx-large', #
            
               fontweight='bold',  # or 'heavy', 'bold', 'normal'
               fontstyle='italic' 
             ) # rotation=45, bbox=dict(facecolor='y', edgecolor='blue', alpha=0.65 ))
    plt.plot( smoothedRewardsSARSA, label = 'SARSA' )
    plt.plot( smoothedRewardsQLearning, label = 'Q-Learning' )
    plt.xlabel( 'Episodes' )
    plt.ylabel( 'Sum of rewards during episode' )
    plt.legend()
# Sum of Rewards for SARSA vs. QLearning
SARSA_QLPlot()

After an initial transient, Q-learning learns the value of optimal policy to walk along the optimal path, in which the agent travels right along the edge of the cliff. Unfortunately, this will result in occasionally falling off the cliff because of ε-greedy action selection. Whereas SARSA, on the other hand, takes the action selection into account and learns the longer and safer path through the upper part of the grid. Although Q-learning learns the value of the optimal policy, its online performance is worse than that of the SARSA, which learns the roundabout and safest policy. Even if we observe the previous sum of rewards displayed in the previous diagram, SARSA has a less negative sum of rewards during the episode than Q-learning.

Further reading

 There are many classic resources available for reinforcement learning, and we encourage
the reader to go through them:

  • R.S. Sutton and A.G. Barto, Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, USA, 1998
  • RL Course by David Silver from YouTube: https:/ / www. youtube. com/ watch? v=2pWv7GOvuf0 list= PL7- jPKtc4r78- wCZcQn5IqyuWhBZ8fOxT
  • Machine Learning (Stanford) by Andrew NG form YouTube (Lectures 16-20): https:/ / www. youtube. com/ watch? v= UzxYlbK2c7E list= PLA89DCFA6ADACE599
  • Algorithms for reinforcement learning by Csaba from Morgan & Claypool Publishers
  • Artificial Intelligence: A Modern Approach 3rd Edition, by Stuart Russell and Peter Norvig, Prentice Hall

Summary

     In this chapter, you've learned various reinforcement learning techniques, like Markov decision process, Bellman equations, dynamic programming, Monte Carlo methods, Temporal Difference learning, including both on-policy (SARSA) and off-policy (Qlearning), with Python examples to understand its implementation in a practical way. You also learned how Q-learning is being used in many practical applications nowadays, as this method learns from trial and error by interacting with environments. 

     Finally, Further reading has been provided for you if you would like to pursue reinforcement learning full-time. We wish you all the best!

  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值