[RL] The Study note of Reinforcement Learning.

0. Statement 🫡

This blog is my self-study study notes on reinforcement learning based on this website. After that, I organized some of the more core and important contents of the course. 🧐

Start 🤠

Reinforcement Learning. - Reinforcement learning algorithms seek to find a policy that will yield more return to the agent than all other policies.

  1. A Markov decision - Agent, Environment, State, Action, Reward.
    The goal of an agent in an MDP is to maximize its cumulative rewards.
  2. The cumulative rewards.
    The expected return is what’s driving the agent to make the decisions it makes.
  • It is the agent’s goal to maximize the expected return of rewards.
  1. Tasks
    1. Episodic tasks - The agent-environment. interaction naturally breaks up into subsequences. Such as the ping-pong game.
    2. Continuing tasks.
      1. Discounted Return - Rather than the agent’s goal being to maximize the expected return of rewards, it will instead be to maximize the expected discounted return of rewards. The agent will be choosing action A-t at each time t to maximize the expected discounted return.
  • It is the agent’s goal to maximize the expected discounted return of rewards.
  • The discount rate will be the rate for which we discount future rewards and will determine the present value of future rewards. This definition of the discounted return makes it to where our agent will care more about the immediate reward over future rewards since future rewards will be more heavily discounted.
  • The infinite sum of discounted returns is finite if the conditions as long as the reward are nonzero and constant, and lambda less than 1.
  1. Policy(π) - How probable is it for an agent to select any action from a given state?
    1. Definition: A policy is a function that maps a given state to probabilities of selecting each possible action from that state.
  2. Value functions - How good is any given or any given state for an agent?
    1. Definition: Value functions are functions of states, or of state-action pairs, that estimate how good it is for an agent to be in a given state, or how good it is for the agent to perform a given action in a given state.
  3. State-Value Function(Vπ) Vπ(s)
    1. The state-value function for policy π, denoted as Vπ, tells us how good any given state is for an agent following policy π. (It gives us the value of a state under π)
  4. Action-Value Function(qπ) qπ(s, a)
    1. The action-value function for policy π, denoted as qπ, tells us how good it is for the agent to take any given action from a given state while following policy π. (It gives us the value of action under π)
    2. The action-value function qπ is referred to as the Q-function, and the output from the function for any given state-action pair is called a Q-value. (The letter “Q” is used to represent the quality of taking a given action in a given state.)
  5. Wrapping up
    1. The state-value function tells us how good any given state is for the agent, whereas the action-value function tells us how good it is for the agent to take any action from a given state.
    2. Value functions are defined with respect to the expected return, specific ways of acting, and the policy.
  6. Optimal policies. - A policy that is better than or at least the same as all other policies.
  7. Optimal State-value Function V*: V* gives the largest expected return achievable by any policy π for each state.
  8. Optimal Acton-Value Function q*: q* gives the largest expected return achievable by any policy π for each possible state-action pair.
  9. Q-learning: It is a reinforcement learning technique used for learning the optimal policy in a Markov Decision Process.
    1. The goal of Q-learning is to find the optimal policy by learning the optimal Q-values for each state-action pair.
    2. Exploration Vs. Exploitation:
      1. Exploration is the act of exploring the environment to find out information about it.
      2. Exploitation is the act of exploiting the information that is already known about the environment in order to maximize the return.
      3. To get this balance between exploitation and exploration, we use what is called an epsilon greedy strategy.
  10. Choosing Actions with An Epsilon Greedy Strategy.
    1. With this strategy, we define an exploration rate epsilon that we initially set to 1. This exploration rate is the probability that our agent will explore the environment rather than exploit it. With epsilon = 1, it is 100% certain that the agent will start out by exploring the environment.
      As the agent learns more about the environment, at the start of each new episode, epsilon will decay by some rate that we set so that the likelihood of exploration becomes less and less probable as the agent learns more and more about the environment.
    2. We generate a random number between 0 and 1. If this number is greater than epsilon, then the agent will choose its next action via exploitation, i.e. it will choose the action with the highest Q-value for its current state from the Q-table. Otherwise, its action will be chosen via exploration, i.e. randomly choosing its action and exploring what happens in the environments.
  11. All of the steps for Q-learning.
    1. Initialize all Q-values in the Q-table to 0.
    2. For each time-step in each episode:
      1. Choose an action (Considering the exploration-exploitation trade-off).
      2. Observe the reward and next state.
      3. Update the Q-value function(Make the Q-value function converge to the right hand side of the Bellman equation).
import numpy as np
import gym
import random
import time
from IPython.display import clear_output

env = gym.make('FrozenLake-v1', render_mode='ansi')

action_space_size = env.action_space.n
state_space_size = env.observation_space.n

q_table = np.zeros(
    (state_space_size, action_space_size)
)
a = np.zeros(
    (1, 2)
)
print(q_table)

num_episodes = 1000
max_steps_per_episode = 100

learning_rate = 0.1
discount_rate = 0.99

exploration_rate = 1
max_exploration_rate = 1
min_exploration_rate = 0.01
exploration_decay_rate = 0.001

rewards_all_episodes = []

# Q-learning algorithm
for episode in range(num_episodes):
    state = env.reset()[0]

    done = False
    rewards_current_episode = 0
    for step in range(max_steps_per_episode):

        # Exploration-exploitation trade-off
        exploration_rate_threshold = random.uniform(0, 1)
        if exploration_rate_threshold > exploration_rate:
            action = np.argmax(q_table[state,:])
        else:
            action = env.action_space.sample()

        new_state, reward, done, truncated, info = env.step(action)


        # Update Q-table for Q(s, a)
        q_table[state, action] = q_table[state, action] * (1 - learning_rate) + \
            learning_rate * (reward + discount_rate * np.max(q_table[new_state, :]))

        state = new_state
        rewards_current_episode += reward

        if done == True:
            break

    # Exploration rate decay
    exploration_rate = min_exploration_rate + \
                       (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate * episode)

    rewards_all_episodes.append(rewards_current_episode)

# Calculate and print the average reward per thousand episodes
rewards_per_thousand_episodes = np.split(np.array(rewards_all_episodes), num_episodes / 1000)
count = 1000
print("======== Average reward per thousand episodes========\n")
for r in rewards_per_thousand_episodes:
    print(count, ": ", str(sum(r / 1000)))
    count += 1000

# print updated Q-table
print("\n\n========Q-table========\n")
print(q_table)

请添加图片描述
请添加图片描述

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Chae_

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值