Description of the problem
The windy gridworld is a simple example in the textbook. By programming to reproduce and solve this problem, I began to really understand Sarsa and Q-learning. Now I introduce this to you, and it goes:
There is a standard grid world, each of them represents a state, with the start and goal states set on the grid. Your mission is to have a agent travel from the start point to the goal point successfully. Now this is a just standard grid world, if adding a crosswind running upward through the middle of the grid, it turns to be the windy grid world. For the beginning, the action space just contains four actions: up, down, right and left, but in the region of wind, the action will be shifted upward according to the level of the “wind”, which varies from column to column.
For example, if you are just one cell to the right of the goal state, then the action left takes you to the cell just above the goal. The whole process is an undiscounted episodic task, with constant rewards of -1 until the goal state is reached.
To further consider the problem, what if expanding the action space to 8 actions? That is, on the basis of the previous, four actions are added: up left, up right, down left and down right.
And finally, what if the wind strength is stochastic instead of constant in every column?
Now, I’ll show you how to solve this step by step.
Here comes the coding:
- Firstly, we define a “World” to represent the windy grid world, and to determine where the agent can move to and how many rewards it will get.
class World(object):
def __init__(self, kingBool): # if kingBool is true, the agent can use King's moves, else it can only move N, S, W, E
self.gridthWidth = 10
self.gridHeight = 7
self.startPos = (3, 0) # (column, row)
self.currentPos = self.startPos # init the current position
self.goalPos = (3, 6)
self.windVals = [0, 0, 0, 1, 1, 1, 2, 2, 1, 0]
self.king = kingBool
def movePosition(self,
action): # move the agent to a new position based on the chosen action, and return the new position and the reward
newPos = np.add(list(self.currentPos), list(action))
if (self.king): # if the agent is using King's Moves, the wind is stochastic
windProbability = random.randint(1, 3) # integers in the range of [1, 3]
else:
windProbability = 1
newPos[1] = newPos[1] + self.windVals[
list(self.currentPos)[0]] # Adjust the agent's position up and down according to the wind
if (windProbability == 2):
newPos[1] = newPos[1] - 1 # The wind blew the agent a cell down
elif (windProbability == 3):
newPos[1] = newPos[1] + 2 # The wind blew the agent two cells up
# if the agent goes outside the gridworld, make the new position be just inside where it would have gone off
if (newPos[1] < 0):
newPos[1] = 0
elif (newPos[1] >= self.gridHeight):
# print("temp is" + str(temp))
newPos[1] = self.gridHeight - 1
if (newPos[1] == list(self.goalPos)[1] and newPos[0] == list(self.goalPos)[0]):
reward = 1;
else:
reward = -1
return [(newPos[0], newPos[1]), reward]
def allowedActionsFromPos(self, position): # not allowing agent to move off grid, unless because of wind
allowedMoves = []
if (position[0] != 0):
allowedMoves.append((-1, 0))
if (position[0] != self.gridthWidth - 1):
allowedMoves.append((1, 0))
if (position[1] != self.gridHeight - 1):
allowedMoves.append((0, 1))
if (position[1] != 0):
allowedMoves.append((0, -1))
if (self.king): # diagonal actions
allowedMoves.append((1, 1))
allowedMoves.append((-1, -1))
allowedMoves.append((1, -1))
allowedMoves.append((-1, 1))
if (position == (0, 0)): # bottom left(0, 0)
if ((-1, 0) in allowedMoves):
allowedMoves.remove((-1, 0))
if ((0, -1) in allowedMoves):
allowedMoves.remove((0, -1))
allowedMoves.remove((-1, -1))
allowedMoves.remove((-1, 1))
allowedMoves.remove((1, -1))
elif (position == (0, self.gridHeight - 1)): # top left
if ((0, 1) in allowedMoves):
allowedMoves.remove((0, 1))
if ((-1, 0) in allowedMoves):
allowedMoves.remove((-1, 0))
allowedMoves.remove((1, 1))
allowedMoves.remove((-1, 1))
allowedMoves.remove((-1, -1))
elif (position == (self.gridthWidth - 1, 0)): # bottom right
if ((1, 0) in allowedMoves):
allowedMoves.remove((1, 0))
if ((0, -1) in allowedMoves):
allowedMoves.remove((0, -1))
allowedMoves.remove((1, 1))
allowedMoves.remove((1, -1))
allowedMoves.remove((-1, -1))
elif (position == (self.gridthWidth - 1, self.gridHeight - 1)): # top right
if ((0, 1) in allowedMoves):
allowedMoves.remove((0, 1))
if ((1, 0) in allowedMoves):
allowedMoves.remove((1, 0))
allowedMoves.remove((1, 1))
allowedMoves.remove((1, -1))
allowedMoves.remove((-1, 1))
else:
if (position[0] == self.gridthWidth - 1):
if ((1, 0) in allowedMoves):
allowedMoves.remove((1, 0))
allowedMoves.remove((1, 1))
allowedMoves.remove((1, -1))
if (position[0] == 0):
if ((-1, 0) in allowedMoves):
allowedMoves.remove((-1, 0))
allowedMoves.remove((-1, 1))
allowedMoves.remove((-1, -1))
if (position[1] == self.gridHeight):
if ((0, 1) in allowedMoves):
allowedMoves.remove((0, 1))
allowedMoves.remove((1, 1))
allowedMoves.remove((-1, 1))
if (position[1] == 0):
if ((0, -1) in allowedMoves):
allowedMoves.remove((0, -1))
allowedMoves.remove((1, -1))
allowedMoves.remove((-1, -1))
return allowedMoves
- Secondly, in the class named “Agent”, the core part is the implementation of Sarsa and Q-learning, besides, the creation of Q-Table and
ϵ
−
g
r
e
e
d
y
\epsilon-greedy
ϵ−greedy policy are also included in this part. It’s worth noticing that the Q-Tables is a two dimension dictionary whose form can be formulated as:
{‘position1’:{‘action1’: value1, ‘action2’: value2}, ‘position2’:{‘action1’: value1, ‘action2’: value2}, …}
class Agent(object):
def __init__(self, a, e, world: World):
self.alpha = a # learning rate
self.epsilon = e # the probability you want to explore
self.world = world
self.gamma = 0.9
self.Q_values = self.createQTables()
self.numOfSteps = 0
def createQTables(self):
QTable = {} # dictionary where each key is a state that holds a different number of possible actions, each which has a q-value
# QTable is a two dimension dictionary {'position1':{'action1': value1, 'action2': value2}, 'position2':{'action1': value1, 'action2': value2}, ...}
for x in range(self.world.gridthWidth):
for y in range(self.world.gridHeight):
position = (x, y)
QTable[position] = {} # all the posible actions in the current state
allowedActions = self.world.allowedActionsFromPos(position)
for action in allowedActions:
QTable[position][action] = 0
return QTable
def getBestAction(self, position):
maxVal = np.NINF # 负无穷大
maxAction = None
for action, value in self.Q_values[position].items():
if value > maxVal:
maxVal = value
bestAction = action
return bestAction
def chooseAction(self, position):
# if(position[0] >= self.world.gridthWidth or position[1] >= self.world.gridHeight):
# print(position)
# return 0
allowedActions = list(self.Q_values[position].keys()) # get all allowed actions in the state
possibleNextState = self.world.startPos # initialize it, and update in loop later
action = None # initializing the return
randomNum = np.random.rand() # Random sample values subject to uniform distribution of "0~1". The value range of the random sample is [0,1), excluding 1.
# this is epsilon-greedy policy
if (randomNum < self.epsilon):
subOptimalActionChoiceIndex = random.randint(0, len(allowedActions) - 1) # chooes a random action
action = allowedActions[subOptimalActionChoiceIndex]
else:
action = self.getBestAction(position)
possibleNextState = np.add(list(position), list(action))
temp = possibleNextState[1] - self.world.windVals[list(possibleNextState)[0]]
if (temp >= 0):
possibleNextState[1] = temp
possibleNextState = (possibleNextState[0], possibleNextState[1])
return action
def Sarsa(self):
self.world.currentPos = self.world.startPos
visited = [] # initialize
actionsTaken = []
visited.append(self.world.startPos)
chosenAction = self.chooseAction(self.world.currentPos)
newPosAndReward = None # initialize
converged = True
while True: # loop for each step of the episode
# take action A, observe R and S'
newPosAndReward = self.world.movePosition(chosenAction) # [(column, row), reward]
newPosition = newPosAndReward[0]
reward = newPosAndReward[1]
visited.append(newPosition)
actionsTaken.append(chosenAction)
self.numOfSteps = self.numOfSteps + 1
# choose A' from S' using epsilon-greedy policy
nextAction = self.chooseAction(newPosition)
# update Q-value of the current position
oldQ = self.Q_values[self.world.currentPos][chosenAction]
self.Q_values[self.world.currentPos][chosenAction] += self.alpha * (
reward + self.gamma * self.Q_values[newPosition][nextAction] -
self.Q_values[self.world.currentPos][chosenAction])
if (abs(oldQ - self.Q_values[self.world.currentPos][chosenAction]) > self.epsilon):
converged = False
# advanced to the next state and action
self.world.currentPos = newPosition
chosenAction = nextAction
if (self.world.currentPos == self.world.goalPos): # next position is the goal
break
return converged, visited, actionsTaken
def QLearning(self):
self.world.currentPos = self.world.startPos
visited = [] # initialize
actionsTaken = []
visited.append(self.world.startPos)
newPosAndReward = None # initialize
converged = True
while True:
# choose A from S using the epsilon-greedy policy
chosenAction = self.chooseAction(self.world.currentPos)
# take A and observe R and S'
newPosAndReward = self.world.movePosition(chosenAction)
newPosition = newPosAndReward[0]
reward = newPosAndReward[1]
visited.append(newPosition)
actionsTaken.append(chosenAction)
self.numOfSteps = self.numOfSteps + 1
# find the best action from S'
bestAction = self.getBestAction(newPosition)
oldQ = self.Q_values[self.world.currentPos][chosenAction]
self.Q_values[self.world.currentPos][chosenAction] += self.alpha * (
reward + self.gamma * self.Q_values[newPosition][bestAction] -
self.Q_values[self.world.currentPos][chosenAction])
if (abs(oldQ - self.Q_values[self.world.currentPos][chosenAction]) > self.epsilon):
converged = False
# advance to the next state
self.world.currentPos = newPosition
if (self.world.currentPos == self.world.goalPos):
break
return converged, visited, actionsTaken
- Finally, it comes the “Main” part: we separately set the normal move and King’s move here, and set the total number of the episodes as 170, with Sarsa and Q-learning algorithms applied to each condition to compare their performance.
def main():
# Sarsa
testWorld = World(False)
testAgent = Agent(0.9, 0.001, testWorld)
timeSteps = [0]
episodes = [0]
convergedX = []
convergedY = []
alreadyConverged = False
for x in range(170): # the max number of episodes
result = testAgent.Sarsa()
timeSteps.append(testAgent.numOfSteps)
episodes.append(x)
if (result[0] and not alreadyConverged):
print("Found the optimal policy using Sarsa. The path is ...")
print(result[1])
print("The actions taken on that path are ...")
print(result[2])
convergedX.append(x)
convergedY.append((testAgent.numOfSteps))
alreadyConverged = True
plt.plot(timeSteps, episodes, 'r--', convergedY, convergedX, 'bs')
plt.suptitle("Sarsa")
plt.xlabel("Total Timesteps")
plt.ylabel("Episodes")
plt.show()
# Q-learning
testWorld = World(False)
testAgent = Agent(0.9, 0.001, testWorld)
timeSteps = [0]
episodes = [0]
convergedX = []
convergedY = []
alreadyConverged = False
for x in range(170): # the max number of episodes
result = testAgent.QLearning()
timeSteps.append(testAgent.numOfSteps)
episodes.append(x)
if (result[0] and not alreadyConverged):
print("Found the optimal policy using Q-Learning. The path is ...")
print(result[1])
print("The actions taken on that path are ...")
print(result[2])
convergedX.append(x)
convergedY.append((testAgent.numOfSteps))
alreadyConverged = True
plt.plot(timeSteps, episodes, 'r--', convergedY, convergedX, 'bs')
plt.suptitle("Q_Learning")
plt.xlabel("Total Timesteps")
plt.ylabel("Episodes")
plt.show()
# Sarsa with King's move
testWorld = World(True)
testAgent = Agent(0.9, 0.001, testWorld)
timeSteps = [0]
episodes = [0]
convergedX = []
convergedY = []
alreadyConverged = False
for x in range(170): # the max number of episodes
result = testAgent.Sarsa()
timeSteps.append(testAgent.numOfSteps)
episodes.append(x)
if (result[0] and not alreadyConverged):
print("Found the optimal policy using Sarsa. The path is ...")
print(result[1])
print("The actions taken on that path are ...")
print(result[2])
convergedX.append(x)
convergedY.append((testAgent.numOfSteps))
alreadyConverged = True
plt.plot(timeSteps, episodes, 'r--', convergedY, convergedX, 'bs')
plt.suptitle("Sarsa - King's Moves")
plt.xlabel("Total Timesteps")
plt.ylabel("Episodes")
plt.show()
# Q-Learning with King's move
testWorld = World(True)
testAgent = Agent(0.9, 0.001, testWorld)
timeSteps = [0]
episodes = [0]
convergedX = []
convergedY = []
alreadyConverged = False
for x in range(170): # the max number of episodes
result = testAgent.QLearning()
timeSteps.append(testAgent.numOfSteps)
episodes.append(x)
if (result[0] and not alreadyConverged):
print("Found the optimal policy using Q-Learning. The path is ...")
print(result[1])
print("The actions taken on that path are ...")
print(result[2])
convergedX.append(x)
convergedY.append((testAgent.numOfSteps))
alreadyConverged = True
plt.plot(timeSteps, episodes, 'r--', convergedY, convergedX, 'bs')
plt.suptitle("Q_Learning - King's Moves")
plt.xlabel("Total Timesteps")
plt.ylabel("Episodes")
plt.show()
if __name__ == '__main__':
main()
Results
In test, we set the start point (0, 3) and the goal point (7, 3).
-
In the condition of normal walk, constant wind strength, with the algorithm of Sarsa:
From the line chart, we can clearly conclude that after approximately 1800 steps, that is, 65 episodes, the line becomes straight, which means the agent has find to best path to travel from the start point to the goal point.
The path together with its actions are listed below:
-
In the condition of normal walk, constant wind strength, with the algorithm of Q-learning:
We get nearly the same results as the first condition.
The path together with its actions are listed below:
-
In the condition of King’s walk, stochastic wind strength, with the algorithm of Sarsa:
From the line chart, it didn’t converge to a straight line, that means the agent failed to find the best path in 170 episodes, the policy it took swiched all the time. So, we didn’t get any final path in this condition.
-
Final condition, King’s walk, stochastic wind strength, with the algorithm of Q-learning:
The same as the former condition, it didn’t converge to a straight line.
References
[1]. Reinforcement Learning-An introduction
If there is infringement, promise to delete immediately