0. Statement 🫡
This blog is my self-study study notes on reinforcement learning based on this website. After that, I organized some of the more core and important contents of the course. 🧐
Start 🤠
Reinforcement Learning. - Reinforcement learning algorithms seek to find a policy that will yield more return to the agent than all other policies.
- A Markov decision - Agent, Environment, State, Action, Reward.
The goal of an agent in an MDP is to maximize its cumulative rewards. - The cumulative rewards.
The expected return is what’s driving the agent to make the decisions it makes.
- It is the agent’s goal to maximize the expected return of rewards.
- Tasks
- Episodic tasks - The agent-environment. interaction naturally breaks up into subsequences. Such as the ping-pong game.
- Continuing tasks.
- Discounted Return - Rather than the agent’s goal being to maximize the expected return of rewards, it will instead be to maximize the expected discounted return of rewards. The agent will be choosing action A-t at each time t to maximize the expected discounted return.
- It is the agent’s goal to maximize the expected discounted return of rewards.
- The discount rate will be the rate for which we discount future rewards and will determine the present value of future rewards. This definition of the discounted return makes it to where our agent will care more about the immediate reward over future rewards since future rewards will be more heavily discounted.
- The infinite sum of discounted returns is finite if the conditions as long as the reward are nonzero and constant, and lambda less than 1.
- Policy(π) - How probable is it for an agent to select any action from a given state?
- Definition: A policy is a function that maps a given state to probabilities of selecting each possible action from that state.
- Value functions - How good is any given or any given state for an agent?
- Definition: Value functions are functions of states, or of state-action pairs, that estimate how good it is for an agent to be in a given state, or how good it is for the agent to perform a given action in a given state.
- State-Value Function(Vπ) Vπ(s)
- The state-value function for policy π, denoted as Vπ, tells us how good any given state is for an agent following policy π. (It gives us the value of a state under π)
- Action-Value Function(qπ) qπ(s, a)
- The action-value function for policy π, denoted as qπ, tells us how good it is for the agent to take any given action from a given state while following policy π. (It gives us the value of action under π)
- The action-value function qπ is referred to as the Q-function, and the output from the function for any given state-action pair is called a Q-value. (The letter “Q” is used to represent the quality of taking a given action in a given state.)
- Wrapping up
- The state-value function tells us how good any given state is for the agent, whereas the action-value function tells us how good it is for the agent to take any action from a given state.
- Value functions are defined with respect to the expected return, specific ways of acting, and the policy.
- Optimal policies. - A policy that is better than or at least the same as all other policies.
- Optimal State-value Function V*: V* gives the largest expected return achievable by any policy π for each state.
- Optimal Acton-Value Function q*: q* gives the largest expected return achievable by any policy π for each possible state-action pair.
- Q-learning: It is a reinforcement learning technique used for learning the optimal policy in a Markov Decision Process.
- The goal of Q-learning is to find the optimal policy by learning the optimal Q-values for each state-action pair.
- Exploration Vs. Exploitation:
- Exploration is the act of exploring the environment to find out information about it.
- Exploitation is the act of exploiting the information that is already known about the environment in order to maximize the return.
- To get this balance between exploitation and exploration, we use what is called an epsilon greedy strategy.
- Choosing Actions with An Epsilon Greedy Strategy.
- With this strategy, we define an exploration rate epsilon that we initially set to 1. This exploration rate is the probability that our agent will explore the environment rather than exploit it. With epsilon = 1, it is 100% certain that the agent will start out by exploring the environment.
As the agent learns more about the environment, at the start of each new episode, epsilon will decay by some rate that we set so that the likelihood of exploration becomes less and less probable as the agent learns more and more about the environment. - We generate a random number between 0 and 1. If this number is greater than epsilon, then the agent will choose its next action via exploitation, i.e. it will choose the action with the highest Q-value for its current state from the Q-table. Otherwise, its action will be chosen via exploration, i.e. randomly choosing its action and exploring what happens in the environments.
- With this strategy, we define an exploration rate epsilon that we initially set to 1. This exploration rate is the probability that our agent will explore the environment rather than exploit it. With epsilon = 1, it is 100% certain that the agent will start out by exploring the environment.
- All of the steps for Q-learning.
- Initialize all Q-values in the Q-table to 0.
- For each time-step in each episode:
- Choose an action (Considering the exploration-exploitation trade-off).
- Observe the reward and next state.
- Update the Q-value function(Make the Q-value function converge to the right hand side of the Bellman equation).
import numpy as np
import gym
import random
import time
from IPython.display import clear_output
env = gym.make('FrozenLake-v1', render_mode='ansi')
action_space_size = env.action_space.n
state_space_size = env.observation_space.n
q_table = np.zeros(
(state_space_size, action_space_size)
)
a = np.zeros(
(1, 2)
)
print(q_table)
num_episodes = 1000
max_steps_per_episode = 100
learning_rate = 0.1
discount_rate = 0.99
exploration_rate = 1
max_exploration_rate = 1
min_exploration_rate = 0.01
exploration_decay_rate = 0.001
rewards_all_episodes = []
# Q-learning algorithm
for episode in range(num_episodes):
state = env.reset()[0]
done = False
rewards_current_episode = 0
for step in range(max_steps_per_episode):
# Exploration-exploitation trade-off
exploration_rate_threshold = random.uniform(0, 1)
if exploration_rate_threshold > exploration_rate:
action = np.argmax(q_table[state,:])
else:
action = env.action_space.sample()
new_state, reward, done, truncated, info = env.step(action)
# Update Q-table for Q(s, a)
q_table[state, action] = q_table[state, action] * (1 - learning_rate) + \
learning_rate * (reward + discount_rate * np.max(q_table[new_state, :]))
state = new_state
rewards_current_episode += reward
if done == True:
break
# Exploration rate decay
exploration_rate = min_exploration_rate + \
(max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate * episode)
rewards_all_episodes.append(rewards_current_episode)
# Calculate and print the average reward per thousand episodes
rewards_per_thousand_episodes = np.split(np.array(rewards_all_episodes), num_episodes / 1000)
count = 1000
print("======== Average reward per thousand episodes========\n")
for r in rewards_per_thousand_episodes:
print(count, ": ", str(sum(r / 1000)))
count += 1000
# print updated Q-table
print("\n\n========Q-table========\n")
print(q_table)