Windy Gridworld: A simple implementation of Tabular RL(Sarsa and Q-learning)

Description of the problem

The windy gridworld is a simple example in the textbook. By programming to reproduce and solve this problem, I began to really understand Sarsa and Q-learning. Now I introduce this to you, and it goes:

There is a standard grid world, each of them represents a state, with the start and goal states set on the grid. Your mission is to have a agent travel from the start point to the goal point successfully. Now this is a just standard grid world, if adding a crosswind running upward through the middle of the grid, it turns to be the windy grid world. For the beginning, the action space just contains four actions: up, down, right and left, but in the region of wind, the action will be shifted upward according to the level of the “wind”, which varies from column to column.
在这里插入图片描述
For example, if you are just one cell to the right of the goal state, then the action left takes you to the cell just above the goal. The whole process is an undiscounted episodic task, with constant rewards of -1 until the goal state is reached.

To further consider the problem, what if expanding the action space to 8 actions? That is, on the basis of the previous, four actions are added: up left, up right, down left and down right.

And finally, what if the wind strength is stochastic instead of constant in every column?

Now, I’ll show you how to solve this step by step.
Here comes the coding:

  • Firstly, we define a “World” to represent the windy grid world, and to determine where the agent can move to and how many rewards it will get.
class World(object):
    def __init__(self, kingBool):  # if kingBool is true, the agent can use King's moves, else it can only move N, S, W, E
        self.gridthWidth = 10
        self.gridHeight = 7
        self.startPos = (3, 0)  # (column, row)
        self.currentPos = self.startPos  # init the current position
        self.goalPos = (3, 6)
        self.windVals = [0, 0, 0, 1, 1, 1, 2, 2, 1, 0]
        self.king = kingBool

    def movePosition(self,
                     action):  # move the agent to a new position based on the chosen action, and return the new position and the reward
        newPos = np.add(list(self.currentPos), list(action))

        if (self.king):  # if the agent is using King's Moves, the wind is stochastic
            windProbability = random.randint(1, 3)  # integers in the range of [1, 3]
        else:
            windProbability = 1

        newPos[1] = newPos[1] + self.windVals[
            list(self.currentPos)[0]]  # Adjust the agent's position up and down according to the wind

        if (windProbability == 2):
            newPos[1] = newPos[1] - 1  # The wind blew the agent a cell down
        elif (windProbability == 3):
            newPos[1] = newPos[1] + 2  # The wind blew the agent two cells up

        # if the agent goes outside the gridworld, make the new position be just inside where it would have gone off
        if (newPos[1] < 0):
            newPos[1] = 0
        elif (newPos[1] >= self.gridHeight):
            # print("temp is" + str(temp))
            newPos[1] = self.gridHeight - 1

        if (newPos[1] == list(self.goalPos)[1] and newPos[0] == list(self.goalPos)[0]):
            reward = 1;
        else:
            reward = -1

        return [(newPos[0], newPos[1]), reward]

    def allowedActionsFromPos(self, position):  # not allowing agent to move off grid, unless because of wind
        allowedMoves = []
        if (position[0] != 0):
            allowedMoves.append((-1, 0))
        if (position[0] != self.gridthWidth - 1):
            allowedMoves.append((1, 0))
        if (position[1] != self.gridHeight - 1):
            allowedMoves.append((0, 1))
        if (position[1] != 0):
            allowedMoves.append((0, -1))

        if (self.king):  # diagonal actions
            allowedMoves.append((1, 1))
            allowedMoves.append((-1, -1))
            allowedMoves.append((1, -1))
            allowedMoves.append((-1, 1))
            if (position == (0, 0)):  # bottom left(0, 0)
                if ((-1, 0) in allowedMoves):
                    allowedMoves.remove((-1, 0))
                if ((0, -1) in allowedMoves):
                    allowedMoves.remove((0, -1))
                allowedMoves.remove((-1, -1))
                allowedMoves.remove((-1, 1))
                allowedMoves.remove((1, -1))
            elif (position == (0, self.gridHeight - 1)):  # top left
                if ((0, 1) in allowedMoves):
                    allowedMoves.remove((0, 1))
                if ((-1, 0) in allowedMoves):
                    allowedMoves.remove((-1, 0))
                allowedMoves.remove((1, 1))
                allowedMoves.remove((-1, 1))
                allowedMoves.remove((-1, -1))
            elif (position == (self.gridthWidth - 1, 0)):  # bottom right
                if ((1, 0) in allowedMoves):
                    allowedMoves.remove((1, 0))
                if ((0, -1) in allowedMoves):
                    allowedMoves.remove((0, -1))
                allowedMoves.remove((1, 1))
                allowedMoves.remove((1, -1))
                allowedMoves.remove((-1, -1))
            elif (position == (self.gridthWidth - 1, self.gridHeight - 1)):  # top right
                if ((0, 1) in allowedMoves):
                    allowedMoves.remove((0, 1))
                if ((1, 0) in allowedMoves):
                    allowedMoves.remove((1, 0))
                allowedMoves.remove((1, 1))
                allowedMoves.remove((1, -1))
                allowedMoves.remove((-1, 1))
            else:
                if (position[0] == self.gridthWidth - 1):
                    if ((1, 0) in allowedMoves):
                        allowedMoves.remove((1, 0))
                    allowedMoves.remove((1, 1))
                    allowedMoves.remove((1, -1))
                if (position[0] == 0):
                    if ((-1, 0) in allowedMoves):
                        allowedMoves.remove((-1, 0))
                    allowedMoves.remove((-1, 1))
                    allowedMoves.remove((-1, -1))
                if (position[1] == self.gridHeight):
                    if ((0, 1) in allowedMoves):
                        allowedMoves.remove((0, 1))
                    allowedMoves.remove((1, 1))
                    allowedMoves.remove((-1, 1))
                if (position[1] == 0):
                    if ((0, -1) in allowedMoves):
                        allowedMoves.remove((0, -1))
                    allowedMoves.remove((1, -1))
                    allowedMoves.remove((-1, -1))

        return allowedMoves
  • Secondly, in the class named “Agent”, the core part is the implementation of Sarsa and Q-learning, besides, the creation of Q-Table and ϵ − g r e e d y \epsilon-greedy ϵgreedy policy are also included in this part. It’s worth noticing that the Q-Tables is a two dimension dictionary whose form can be formulated as:
    {‘position1’:{‘action1’: value1, ‘action2’: value2}, ‘position2’:{‘action1’: value1, ‘action2’: value2}, …}
class Agent(object):
    def __init__(self, a, e, world: World):
        self.alpha = a  # learning rate
        self.epsilon = e  # the probability you want to explore
        self.world = world
        self.gamma = 0.9
        self.Q_values = self.createQTables()
        self.numOfSteps = 0

    def createQTables(self):
        QTable = {}  # dictionary where each key is a state that holds a different number of possible actions, each which has a q-value
        # QTable is a two dimension dictionary  {'position1':{'action1': value1, 'action2': value2}, 'position2':{'action1': value1, 'action2': value2}, ...}
        for x in range(self.world.gridthWidth):
            for y in range(self.world.gridHeight):
                position = (x, y)
                QTable[position] = {}  # all the posible actions in the current state
                allowedActions = self.world.allowedActionsFromPos(position)
                for action in allowedActions:
                    QTable[position][action] = 0
        return QTable

    def getBestAction(self, position):
        maxVal = np.NINF  # 负无穷大
        maxAction = None
        for action, value in self.Q_values[position].items():
            if value > maxVal:
                maxVal = value
                bestAction = action
        return bestAction

    def chooseAction(self, position):
        # if(position[0] >= self.world.gridthWidth or position[1] >= self.world.gridHeight):
        #     print(position)
        #     return 0
        allowedActions = list(self.Q_values[position].keys())  # get all allowed actions in the state
        possibleNextState = self.world.startPos  # initialize it, and update in loop later
        action = None  # initializing the return

        randomNum = np.random.rand()  # Random sample values subject to uniform distribution of "0~1". The value range of the random sample is [0,1), excluding 1.
        # this is epsilon-greedy policy
        if (randomNum < self.epsilon):
            subOptimalActionChoiceIndex = random.randint(0, len(allowedActions) - 1)  # chooes a random action
            action = allowedActions[subOptimalActionChoiceIndex]
        else:
            action = self.getBestAction(position)
        possibleNextState = np.add(list(position), list(action))
        temp = possibleNextState[1] - self.world.windVals[list(possibleNextState)[0]]

        if (temp >= 0):
            possibleNextState[1] = temp

        possibleNextState = (possibleNextState[0], possibleNextState[1])
        return action

    def Sarsa(self):
        self.world.currentPos = self.world.startPos
        visited = []  # initialize
        actionsTaken = []
        visited.append(self.world.startPos)
        chosenAction = self.chooseAction(self.world.currentPos)
        newPosAndReward = None  # initialize
        converged = True

        while True:  # loop for each step of the episode
            # take action A, observe R and S'
            newPosAndReward = self.world.movePosition(chosenAction)  # [(column, row), reward]
            newPosition = newPosAndReward[0]
            reward = newPosAndReward[1]
            visited.append(newPosition)
            actionsTaken.append(chosenAction)
            self.numOfSteps = self.numOfSteps + 1

            # choose A' from S' using epsilon-greedy policy
            nextAction = self.chooseAction(newPosition)

            # update Q-value of the current position
            oldQ = self.Q_values[self.world.currentPos][chosenAction]
            self.Q_values[self.world.currentPos][chosenAction] += self.alpha * (
                    reward + self.gamma * self.Q_values[newPosition][nextAction] -
                    self.Q_values[self.world.currentPos][chosenAction])

            if (abs(oldQ - self.Q_values[self.world.currentPos][chosenAction]) > self.epsilon):
                converged = False

            # advanced to the next state and action
            self.world.currentPos = newPosition
            chosenAction = nextAction

            if (self.world.currentPos == self.world.goalPos):  # next position is the goal
                break

        return converged, visited, actionsTaken

    def QLearning(self):
        self.world.currentPos = self.world.startPos
        visited = []  # initialize
        actionsTaken = []
        visited.append(self.world.startPos)
        newPosAndReward = None  # initialize
        converged = True

        while True:
            # choose A from S using the epsilon-greedy policy
            chosenAction = self.chooseAction(self.world.currentPos)
            # take A and observe R and S'
            newPosAndReward = self.world.movePosition(chosenAction)
            newPosition = newPosAndReward[0]
            reward = newPosAndReward[1]
            visited.append(newPosition)
            actionsTaken.append(chosenAction)
            self.numOfSteps = self.numOfSteps + 1
            # find the best action from S'
            bestAction = self.getBestAction(newPosition)

            oldQ = self.Q_values[self.world.currentPos][chosenAction]
            self.Q_values[self.world.currentPos][chosenAction] += self.alpha * (
                    reward + self.gamma * self.Q_values[newPosition][bestAction] -
                    self.Q_values[self.world.currentPos][chosenAction])

            if (abs(oldQ - self.Q_values[self.world.currentPos][chosenAction]) > self.epsilon):
                converged = False

            # advance to the next state
            self.world.currentPos = newPosition

            if (self.world.currentPos == self.world.goalPos):
                break

        return converged, visited, actionsTaken
  • Finally, it comes the “Main” part: we separately set the normal move and King’s move here, and set the total number of the episodes as 170, with Sarsa and Q-learning algorithms applied to each condition to compare their performance.
def main():
    # Sarsa
    testWorld = World(False)
    testAgent = Agent(0.9, 0.001, testWorld)
    timeSteps = [0]
    episodes = [0]
    convergedX = []
    convergedY = []
    alreadyConverged = False
    for x in range(170):  # the max number of episodes
        result = testAgent.Sarsa()
        timeSteps.append(testAgent.numOfSteps)
        episodes.append(x)
        if (result[0] and not alreadyConverged):
            print("Found the optimal policy using Sarsa. The path is ...")
            print(result[1])
            print("The actions taken on that path are ...")
            print(result[2])
            convergedX.append(x)
            convergedY.append((testAgent.numOfSteps))
            alreadyConverged = True

    plt.plot(timeSteps, episodes, 'r--', convergedY, convergedX, 'bs')
    plt.suptitle("Sarsa")
    plt.xlabel("Total Timesteps")
    plt.ylabel("Episodes")
    plt.show()

    # Q-learning
    testWorld = World(False)
    testAgent = Agent(0.9, 0.001, testWorld)
    timeSteps = [0]
    episodes = [0]
    convergedX = []
    convergedY = []
    alreadyConverged = False
    for x in range(170):  # the max number of episodes
        result = testAgent.QLearning()
        timeSteps.append(testAgent.numOfSteps)
        episodes.append(x)
        if (result[0] and not alreadyConverged):
            print("Found the optimal policy using Q-Learning. The path is ...")
            print(result[1])
            print("The actions taken on that path are ...")
            print(result[2])
            convergedX.append(x)
            convergedY.append((testAgent.numOfSteps))
            alreadyConverged = True

    plt.plot(timeSteps, episodes, 'r--', convergedY, convergedX, 'bs')
    plt.suptitle("Q_Learning")
    plt.xlabel("Total Timesteps")
    plt.ylabel("Episodes")
    plt.show()

    # Sarsa with King's move
    testWorld = World(True)
    testAgent = Agent(0.9, 0.001, testWorld)
    timeSteps = [0]
    episodes = [0]
    convergedX = []
    convergedY = []
    alreadyConverged = False
    for x in range(170):  # the max number of episodes
        result = testAgent.Sarsa()
        timeSteps.append(testAgent.numOfSteps)
        episodes.append(x)
        if (result[0] and not alreadyConverged):
            print("Found the optimal policy using Sarsa. The path is ...")
            print(result[1])
            print("The actions taken on that path are ...")
            print(result[2])
            convergedX.append(x)
            convergedY.append((testAgent.numOfSteps))
            alreadyConverged = True

    plt.plot(timeSteps, episodes, 'r--', convergedY, convergedX, 'bs')
    plt.suptitle("Sarsa - King's Moves")
    plt.xlabel("Total Timesteps")
    plt.ylabel("Episodes")
    plt.show()

    # Q-Learning with King's move
    testWorld = World(True)
    testAgent = Agent(0.9, 0.001, testWorld)
    timeSteps = [0]
    episodes = [0]
    convergedX = []
    convergedY = []
    alreadyConverged = False
    for x in range(170):  # the max number of episodes
        result = testAgent.QLearning()
        timeSteps.append(testAgent.numOfSteps)
        episodes.append(x)
        if (result[0] and not alreadyConverged):
            print("Found the optimal policy using Q-Learning. The path is ...")
            print(result[1])
            print("The actions taken on that path are ...")
            print(result[2])
            convergedX.append(x)
            convergedY.append((testAgent.numOfSteps))
            alreadyConverged = True

    plt.plot(timeSteps, episodes, 'r--', convergedY, convergedX, 'bs')
    plt.suptitle("Q_Learning - King's Moves")
    plt.xlabel("Total Timesteps")
    plt.ylabel("Episodes")
    plt.show()


if __name__ == '__main__':
    main()

Results

In test, we set the start point (0, 3) and the goal point (7, 3).

  • In the condition of normal walk, constant wind strength, with the algorithm of Sarsa:
    From the line chart, we can clearly conclude that after approximately 1800 steps, that is, 65 episodes, the line becomes straight, which means the agent has find to best path to travel from the start point to the goal point.
    在这里插入图片描述
    The path together with its actions are listed below:
    在这里插入图片描述

  • In the condition of normal walk, constant wind strength, with the algorithm of Q-learning:
    We get nearly the same results as the first condition.
    在这里插入图片描述
    The path together with its actions are listed below:
    在这里插入图片描述

  • In the condition of King’s walk, stochastic wind strength, with the algorithm of Sarsa:
    From the line chart, it didn’t converge to a straight line, that means the agent failed to find the best path in 170 episodes, the policy it took swiched all the time. So, we didn’t get any final path in this condition.
    在这里插入图片描述

  • Final condition, King’s walk, stochastic wind strength, with the algorithm of Q-learning:
    The same as the former condition, it didn’t converge to a straight line.
    在这里插入图片描述

References

[1]. Reinforcement Learning-An introduction

If there is infringement, promise to delete immediately

  • 2
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值