强化学习算法一:Q Learning

强化学习算法:

通过价值选行为:Q-learning、Sarsa、Deep Q Network
直接选行为:Policy Gradients
想象环境并从中学习:Model Based RL

  • 强化学习:在无任何思路情况下,机器由奖励和惩罚决定下一步行为
  • 奖励和惩罚就是机器人的老师,开始时没有标签,要在学习中学得各个行为的标签(价值),然后用该价值做出决策。
  • QLearning:通过价值选行为
  • Q Learning原理:关键在价值表,每次选择价值表中价值最大的行为执行;开始时随机初始化价值表,每个行为的价值由当前价值-下一回合价值(想象预测的)

以写作业还是看电视为例看一下决策过程:

从状态S1开始抉择:是看电视还是写作业呢?由Q表知选择看电视a1会被惩罚(价值为-2),选择写作业有奖励(价值为1)选择价值最大的a2;接着进入决策S2:是看电视还是写作业呢?由Q表知选择看电视a1会被惩罚(价值为-4),选择写作业有奖励(价值为2)选择价值最大的a2、、、、(不断重复)

Q表是怎么定义(更新)的呢?

【每一轮训练更新Q表为下一轮所用】

决策S1中,我们选择了价值最大的行为a2,进入决策S2。接下来更新Q(s1,a),现实结果为R+γ*maxQ(s2),R:获得的奖励了,此处为并无奖励,所以为0;γ在[0,1]之间,为

 

​​​​该例子作用:从最左边出发每次可往左或右走一步,最右边(6步后)为终点,用Q Learning学习使得能以最小的步数走到终点

部分效果图:

     

                   

"""
A simple example for Reinforcement Learning using table lookup Q-learning method.
An agent "o" is on the left of a 1 dimensional world, the treasure is on the rightmost location.
Run this program and to see how the agent will improve its strategy of finding the treasure.
View more on my tutorial page: https://morvanzhou.github.io/tutorials/
"""

import numpy as np
import pandas as pd
import time

np.random.seed(2)  # reproducible


N_STATES = 6   # the length of the 1 dimensional world
ACTIONS = ['left', 'right']     # available actions
EPSILON = 0.9   # greedy police
ALPHA = 0.1     # learning rate
GAMMA = 0.9    # discount factor
MAX_EPISODES = 13   # maximum episodes
FRESH_TIME = 0.3    # fresh time for one move

#Q表初始化为0
def build_q_table(n_states, actions):
    table = pd.DataFrame(
        np.zeros((n_states, len(actions))),     # q_table initial values 
        columns=actions,    # actions's name
    )
    # print(table)    # show table
    return table

#由Q表或随机值决策下一步行为
def choose_action(state, q_table):
    # This is how to choose an action
    state_actions = q_table.iloc[state, :]
    if (np.random.uniform() > EPSILON) or ((state_actions == 0).all()):  # act non-greedy or state-action have no value
        action_name = np.random.choice(ACTIONS)
    else:   # act greedy
        action_name = state_actions.idxmax()    # replace argmax to idxmax as argmax means a different function in newer version of pandas
    return action_name


def get_env_feedback(S, A):
    # This is how agent will interact with the environment
    if A == 'right':    # move right
        if S == N_STATES - 2:   # terminate
            S_ = 'terminal'
            R = 1
        else:
            S_ = S + 1
            R = 0
    else:   # move left
        R = 0
        if S == 0:
            S_ = S  # reach the wall
        else:
            S_ = S - 1
    return S_, R


def update_env(S, episode, step_counter):
    # This is how environment be updated
    env_list = ['-']*(N_STATES-1) + ['T']   # '---------T' our environment
    if S == 'terminal':
        interaction = 'Episode %s: total_steps = %s' % (episode+1, step_counter)
        print('\r{}'.format(interaction), end='')
        time.sleep(2)
        print('\r                                ', end='')
    else:
        env_list[S] = 'o'
        interaction = ''.join(env_list)
        print('\r{}'.format(interaction), end='')
        time.sleep(FRESH_TIME)


def rl():
    # main part of RL loop
    q_table = build_q_table(N_STATES, ACTIONS)
    for episode in range(MAX_EPISODES):
        step_counter = 0
        S = 0
        is_terminated = False
        update_env(S, episode, step_counter)
        while not is_terminated:

            A = choose_action(S, q_table)
            S_, R = get_env_feedback(S, A)  # take action & get next state and reward
            q_predict = q_table.loc[S, A]
            if S_ != 'terminal':
                q_target = R + GAMMA * q_table.iloc[S_, :].max()   # next state is not terminal
            else:
                q_target = R     # next state is terminal
                is_terminated = True    # terminate this episode

            q_table.loc[S, A] += ALPHA * (q_target - q_predict)  # update
            S = S_  # move to next state

            update_env(S, episode, step_counter+1)
            step_counter += 1
    return q_table


if __name__ == "__main__":
    q_table = rl()
    print('\r\nQ-table:\n')
    print(q_table)

代码来自:https://morvanzhou.github.io/tutorials/

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值