Windy Gridworld: A simple implementation of Tabular RL(Sarsa and Q-learning)

最新推荐文章于 2021-11-13 18:25:51 发布

fo-in

最新推荐文章于 2021-11-13 18:25:51 发布

阅读量1.4k

点赞数 2

分类专栏： # Implementation

本文链接：https://blog.csdn.net/WZX_Hello/article/details/115560022

版权

Implementation 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Description of the problem
Results
References

Description of the problem

The windy gridworld is a simple example in the textbook. By programming to reproduce and solve this problem, I began to really understand Sarsa and Q-learning. Now I introduce this to you, and it goes:

There is a standard grid world, each of them represents a state, with the start and goal states set on the grid. Your mission is to have a agent travel from the start point to the goal point successfully. Now this is a just standard grid world, if adding a crosswind running upward through the middle of the grid, it turns to be the windy grid world. For the beginning, the action space just contains four actions: up, down, right and left, but in the region of wind, the action will be shifted upward according to the level of the “wind”, which varies from column to column.
在这里插入图片描述
For example, if you are just one cell to the right of the goal state, then the action left takes you to the cell just above the goal. The whole process is an undiscounted episodic task, with constant rewards of -1 until the goal state is reached.

To further consider the problem, what if expanding the action space to 8 actions? That is, on the basis of the previous, four actions are added: up left, up right, down left and down right.

And finally, what if the wind strength is stochastic instead of constant in every column?

Now, I’ll show you how to solve this step by step.
Here comes the coding:

Firstly, we define a “World” to represent the windy grid world, and to determine where the agent can move to and how many rewards it will get.

class World(object):
    def __init__(self, kingBool):  # if kingBool is true, the agent can use King's moves, else it can only move N, S, W, E
        self.gridthWidth = 10
        self.gridHeight = 7
        self.startPos = (3, 0)  # (column, row)
        self.currentPos = self.startPos  # init the current position
        self.goalPos = (3, 6)
        self.windVals = [0, 0, 0, 1, 1, 1, 2, 2, 1, 0]
        self.king = kingBool

    def movePosition(self,
                     action):  # move the agent to a new position based on the chosen action, and return the new position and the reward
        newPos = np.add(list(self.currentPos), list(action))

        if (self.king):  # if the agent is using King's Moves, the wind is stochastic
            windProbability = random.randint(1, 3)  # integers in the range of [1, 3]
        else:
            windProbability = 1

        newPos[1] = newPos[1] + self.windVals[
            list(self.currentPos)[0]]  # Adjust the agent's position up and down according to the wind

        if (windProbability == 2):
            newPos[1] = newPos[1] - 1  # The wind blew the agent a cell down
        elif (windProbability == 3):
            newPos[1] = newPos[1] + 2  # The wind blew the agent two cells up

        # if the agent goes outside the gridworld, make the new position be just inside where it would have gone off
        if (newPos[1] < 0):
            newPos[1] = 0
        elif (newPos[1] >= self.gridHeight):
            # print("temp is" + str(temp))
            newPos[1] = self.gridHeight - 1

        if (newPos[1] == list(self.goalPos)[1] and newPos[0] == list(self.goalPos)[0]):
            reward = 1;
        else:
            reward = -1

        return [(newPos[0], newPos[1]), reward]

    def allowedActionsFromPos(self, position):  # not allowing agent to move off grid, unless because of wind
        allowedMoves = []
        if (position[0] != 0):
            allowedMoves.append((-1, 0))
        if (position[0] != self.gridthWidth - 1):
            allowedMoves.append((1, 0))
        if (position[1] != self.gridHeight - 1):
            allowedMoves.append((0, 1))
        if (position[1] != 0):
            allowedMoves.append((0, -1))

        if (self.king):  # diagonal actions
            allowedMoves.append((1, 1))
            allowedMoves.append((-1, -1))
            allowedMoves.append((1, -1))
            allowedMoves.append((-1, 1))
            if (position == (0, 0)):  # bottom left(0, 0)
                if ((-1, 0) in allowedMoves):
                    allowedMoves.remove((-1, 0))
                if ((0, -1) in allowedMoves):
                    allowedMoves.remove((0, -1))
                allowedMoves.remove((-1, -1))
                allowedMoves.remove((-1, 1))
                allowedMoves.remove((1, -1))
            elif (position == (0, self.gridHeight - 1)):  # top left
                if ((0, 1) in allowedMoves):
                    allowedMoves.remove((0, 1))
                if ((-1, 0) in allowedMoves):
                    allowedMoves.remove((-1, 0))
                allowedMoves.remove((1, 1))
                allowedMoves.remove((-1, 1))
                allowedMoves.remove((-1, -1))
            elif (position == (self.gridthWidth - 1, 0)):  # bottom right
                if ((1, 0) in allowedMoves):
                    allowedMoves.remove((1, 0))
                if ((0, -1) in allowedMoves):
                    allowedMoves.remove((0, -1))
                allowedMoves.remove((1, 1))
                allowedMoves.remove((1, -1))
                allowedMoves.remove((-1, -1))
            elif (position == (self.gridthWidth - 1, self.gridHeight - 1)):  # top right
                if ((0, 1) in allowedMoves):
                    allowedMoves.remove((0, 1))
                if ((1, 0) in allowedMoves):
                    allowedMoves.remove((1, 0))
                allowedMoves.remove((1, 1))
                allowedMoves.remove((1, -1))
                allowedMoves.remove((-1, 1))
            else:
                if (position[0] == self.gridthWidth - 1):
                    if ((1, 0) in allowedMoves):
                        allowedMoves.remove((1, 0))
                    allowedMoves.remove((1, 1))
                    allowedMoves.remove((1, -1))
                if (position[0] == 0):
                    if ((-1, 0) in allowedMoves):
                        allowedMoves.remove((-1, 0))
                    allowedMoves.remove((-1, 1))
                    allowedMoves.remove((-1, -1))
                if (position[1] == self.gridHeight):
                    if ((0, 1) in allowedMoves):
                        allowedMoves.remove((0, 1))
                    allowedMoves.remove((1, 1))
                    allowedMoves.remove((-1, 1))
                if (position[1] == 0):
                    if ((0, -1) in allowedMoves):
                        allowedMoves.remove((0, -1))
                    allowedMoves.remove((1, -1))
                    allowedMoves.remove((-1, -1))

        return allowedMoves

Secondly, in the class named “Agent”, the core part is the implementation of Sarsa and Q-learning, besides, the creation of Q-Table and $\epsilon-greedy$ policy are also included in this part. It’s worth noticing that the Q-Tables is a two dimension dictionary whose form can be formulated as:
{‘position1’:{‘action1’: value1, ‘action2’: value2}, ‘position2’:{‘action1’: value1, ‘action2’: value2}, …}

class Agent(object):
    def __init__(self, a, e, world: World):
        self.alpha = a  # learning rate
        self.epsilon = e  # the probability you want to explore
        self.world = world
        self.gamma = 0.9
        self.Q_values = self.createQTables()
        self.numOfSteps = 0

    def createQTables(self):
        QTable = {}  # dictionary where each key is a state that holds a different number of possible actions, each which has a q-value
        # QTable is a two dimension dictionary  {'position1':{'action1': value1, 'action2': value2}, 'position2':{'action1': value1, 'action2': value2}, ...}
        for x in range(self.world.gridthWidth):
            for y in range(self.world.gridHeight):
                position = (x, y)
                QTable[position] = {}  # all the posible actions in the current state
                allowedActions = self.world.allowedActionsFromPos(position)
                for action in allowedActions:
                    QTable[position][action] = 0
        return QTable

    def getBestAction(self, position):
        maxVal = np.NINF  # 负无穷大
        maxAction = None
        for action, value in self.Q_values[position].items():
            if value > maxVal:
                maxVal = value
                bestAction = action
        return bestAction

    def chooseAction(self, position):
        # if(position[0] >= self.world.gridthWidth or position[1] >= self.world.gridHeight):
        #     print(position)
        #     return 0
        allowedActions = list(self.Q_values[position].keys())  # get all allowed actions in the state
        possibleNextState = self.world.startPos  # initialize it, and update in loop later
        action = None  # initializing the return

        randomNum = np.random.rand()  # Random sample values subject to uniform distribution of "0~1". The value range of the random sample is [0,1), excluding 1.
        # this is epsilon-greedy policy
        if (randomNum < self.epsilon):
            subOptimalActionChoiceIndex = random.randint(0, len(allowedActions) - 1)  # chooes a random action
            action = allowedActions[subOptimalActionChoiceIndex]
        else:
            action = self.getBestAction(position)
        possibleNextState = np.add(list(position), list(action))
        temp = possibleNextState[1] - self.world.windVals[list(possibleNextState)[0]]

        if (temp >= 0):
            possibleNextState[1] = temp

        possibleNextState = (possibleNextState[0], possibleNextState[1])
        return action

    def Sarsa(self):
        self.world.currentPos = self.world.startPos
        visited = []  # initialize
        actionsTaken = []
        visited.append(self.world.startPos)
        chosenAction = self.chooseAction(self.world.currentPos)
        newPosAndReward = None  # initialize
        converged = True

        while True:  # loop for each step of the episode
            # take action A, observe R and S'
            newPosAndReward = self.world.movePosition(chosenAction)  # [(column, row), reward]
            newPosition = newPosAndReward[0]
            reward = newPosAndReward[1]
            visited.append(newPosition)
            actionsTaken.append(chosenAction)
            self.numOfSteps = self.numOfSteps + 1

            # choose A' from S' using epsilon-greedy policy
            nextAction = self.chooseAction(newPosition)

            # update Q-value of the current position
            oldQ = self.Q_values[self.world.currentPos][chosenAction]
            self.Q_values[self.world.currentPos][chosenAction] += self.alpha * (
                    reward + self.gamma * self.Q_values[newPosition][nextAction] -
                    self.Q_values[self.world.currentPos][chosenAction])

            if (abs(oldQ - self.Q_values[self.world.currentPos][chosenAction]) > self.epsilon):
                converged = False

            # advanced to the next state and action
            self.world.currentPos = newPosition
            chosenAction = nextAction

            if (self.world.currentPos == self.world.goalPos):  # next position is the goal
                break

        return converged, visited, actionsTaken

    def QLearning(self):
        self.world.currentPos = self.world.startPos
        visited = []  # initialize
        actionsTaken = []
        visited.append(self.world.startPos)
        newPosAndReward = None  # initialize
        converged = True

        while True:
            # choose A from S using the epsilon-greedy policy
            chosenAction = self.chooseAction(self.world.currentPos)
            # take A and observe R and S'
            newPosAndReward = self.world.movePosition(chosenAction)
            newPosition = newPosAndReward[0]
            reward = newPosAndReward[1]
            visited.append(newPosition)
            actionsTaken.append(chosenAction)
            self.numOfSteps = self.numOfSteps + 1
            # find the best action from S'
            bestAction = self.getBestAction(newPosition)

            oldQ = self.Q_values[self.world.currentPos][chosenAction]
            self.Q_values[self.world.currentPos][chosenAction] += self.alpha * (
                    reward + self.gamma * self.Q_values[newPosition][bestAction] -
                    self.Q_values[self.world.currentPos][chosenAction])

            if (abs(oldQ - self.Q_values[self.world.currentPos][chosenAction]) > self.epsilon):
                converged = False

            # advance to the next state
            self.world.currentPos = newPosition

            if (self.world.currentPos == self.world.goalPos):
                break

        return converged, visited, actionsTaken

Finally, it comes the “Main” part: we separately set the normal move and King’s move here, and set the total number of the episodes as 170, with Sarsa and Q-learning algorithms applied to each condition to compare their performance.

def main():
    # Sarsa
    testWorld = World(False)
    testAgent = Agent(0.9, 0.001, testWorld)
    timeSteps = [0]
    episodes = [0]
    convergedX = []
    convergedY = []
    alreadyConverged = False
    for x in range(170):  # the max number of episodes
        result = testAgent.Sarsa()
        timeSteps.append(testAgent.numOfSteps)
        episodes.append(x)
        if (result[0] and not alreadyConverged):
            print("Found the optimal policy using Sarsa. The path is ...")
            print(result[1])
            print("The actions taken on that path are ...")
            print(result[2])
            convergedX.append(x)
            convergedY.append((testAgent.numOfSteps))
            alreadyConverged = True

    plt.plot(timeSteps, episodes, 'r--', convergedY, convergedX, 'bs')
    plt.suptitle("Sarsa")
    plt.xlabel("Total Timesteps")
    plt.ylabel("Episodes")
    plt.show()

    # Q-learning
    testWorld = World(False)
    testAgent = Agent(0.9, 0.001, testWorld)
    timeSteps = [0]
    episodes = [0]
    convergedX = []
    convergedY = []
    alreadyConverged = False
    for x in range(170):  # the max number of episodes
        result = testAgent.QLearning()
        timeSteps.append(testAgent.numOfSteps)
        episodes.append(x)
        if (result[0] and not alreadyConverged):
            print("Found the optimal policy using Q-Learning. The path is ...")
            print(result[1])
            print("The actions taken on that path are ...")
            print(result[2])
            convergedX.append(x)
            convergedY.append((testAgent.numOfSteps))
            alreadyConverged = True

    plt.plot(timeSteps, episodes, 'r--', convergedY, convergedX, 'bs')
    plt.suptitle("Q_Learning")
    plt.xlabel("Total Timesteps")
    plt.ylabel("Episodes")
    plt.show()

    # Sarsa with King's move
    testWorld = World(True)
    testAgent = Agent(0.9, 0.001, testWorld)
    timeSteps = [0]
    episodes = [0]
    convergedX = []
    convergedY = []
    alreadyConverged = False
    for x in range(170):  # the max number of episodes
        result = testAgent.Sarsa()
        timeSteps.append(testAgent.numOfSteps)
        episodes.append(x)
        if (result[0] and not alreadyConverged):
            print("Found the optimal policy using Sarsa. The path is ...")
            print(result[1])
            print("The actions taken on that path are ...")
            print(result[2])
            convergedX.append(x)
            convergedY.append((testAgent.numOfSteps))
            alreadyConverged = True

    plt.plot(timeSteps, episodes, 'r--', convergedY, convergedX, 'bs')
    plt.suptitle("Sarsa - King's Moves")
    plt.xlabel("Total Timesteps")
    plt.ylabel("Episodes")
    plt.show()

    # Q-Learning with King's move
    testWorld = World(True)
    testAgent = Agent(0.9, 0.001, testWorld)
    timeSteps = [0]
    episodes = [0]
    convergedX = []
    convergedY = []
    alreadyConverged = False
    for x in range(170):  # the max number of episodes
        result = testAgent.QLearning()
        timeSteps.append(testAgent.numOfSteps)
        episodes.append(x)
        if (result[0] and not alreadyConverged):
            print("Found the optimal policy using Q-Learning. The path is ...")
            print(result[1])
            print("The actions taken on that path are ...")
            print(result[2])
            convergedX.append(x)
            convergedY.append((testAgent.numOfSteps))
            alreadyConverged = True

    plt.plot(timeSteps, episodes, 'r--', convergedY, convergedX, 'bs')
    plt.suptitle("Q_Learning - King's Moves")
    plt.xlabel("Total Timesteps")
    plt.ylabel("Episodes")
    plt.show()


if __name__ == '__main__':
    main()

Results

In test, we set the start point (0, 3) and the goal point (7, 3).

In the condition of normal walk, constant wind strength, with the algorithm of Sarsa:
From the line chart, we can clearly conclude that after approximately 1800 steps, that is, 65 episodes, the line becomes straight, which means the agent has find to best path to travel from the start point to the goal point.

The path together with its actions are listed below:
In the condition of normal walk, constant wind strength, with the algorithm of Q-learning:
We get nearly the same results as the first condition.

The path together with its actions are listed below:
In the condition of King’s walk, stochastic wind strength, with the algorithm of Sarsa:
From the line chart, it didn’t converge to a straight line, that means the agent failed to find the best path in 170 episodes, the policy it took swiched all the time. So, we didn’t get any final path in this condition.
Final condition, King’s walk, stochastic wind strength, with the algorithm of Q-learning:
The same as the former condition, it didn’t converge to a straight line.

References

[1]. Reinforcement Learning-An introduction

If there is infringement, promise to delete immediately

fo-in

关注

2
点赞
踩
4

收藏

觉得还不错? 一键收藏
2
评论
Windy Gridworld: A simple implementation of Tabular RL(Sarsa and Q-learning)

ContentsDescription of the problemDescription of the problemThe windy gridworld is a simple example in the textbook. By programming to reproduce and solve this problem, I began to really understand Sarsa and Q-learning. Now I introduce this to you, and
复制链接

扫一扫