强化学习系列(十):On-policy Control with Approximation

一、前言

本章我们关注on-policy control 问题,这里采用参数化方法逼近action-value函数 q̂ (s,a,w)q(s,a) q ^ ( s , a , w ) ≈ q ( s , a ) ,其中, w w 为权重向量。在11章中会讨论off-policy方法。本章介绍了semi-gradient Sarsa算法,是对上一章中介绍的semi-gradient TD(0)的一种扩展,将其用于逼近action value, 并用于 on-policy control。在episodic 任务中,这种扩展是十分直观的,但对于连续问题来说,我们需要考虑如何将discount (折扣系数)用于定义optimal policy。值得注意的是,在对连续任务进行函数逼近时,我们必须放弃discount ,而改用一个新的形式 ” average reward”和一个“differential” value function进行表示。

首先,针对episodic任务,我们将上一章用于state value 的函数逼近思想扩展到action value上,然后我们将这些思想扩展到 on-policy GPI过程中,用 ϵ ϵ -greedy来选择action,最后针对连续任务,对包含differential value的average-reward运用上述思想。

二、Episode Semi-gradient Control

将第9章中的semi-gradient prediction 方法扩展到control问题中。这里,approximate action-value q̂ qπ q ^ ≈ q π ,是权重向量 w w 的函数。在第9章中逼近state-value时,所采用的训练样例为 StUt S t ↦ U t ,本章中所采用的训练样例为 St,AtUt S t , A t ↦ U t ,update target Ut U t 可以是 qπ(St,At) q π ( S t , A t ) 的任何逼近,无论是由MC还是n-step Sarsa获得。对action-value prediction 的梯度下降如下:
这里写图片描述
对one-step Sarsa而言,

  • 0
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
There are a few issues with the code provided. Here's a corrected version: ```matlab % Define the input matrices A and B A = [0 0 -1 3 -1 0; 0 1/2 0 -1 3 -1; 1/2 0 0 0 -1 3; 3 -1 0 0 0 1.2; -1 3 -1 0 1/2 0; 0 -1 3 -1 0 0]; B = [1; 3/2; 5/2; 5/2; 3/2; 1]; % Define the Gauss-Seidel function function [X] = GaussSeidel(A, b, tol, max_iter) % Get the dimensions of A and initialize the output vector [n, ~] = size(A); X = zeros(n, 1); % Iterate through the Gauss-Seidel method for iter = 1:max_iter X_new = X; % Store the current approximation for i = 1:n X_new(i) = (b(i) - A(i, 1:i-1) * X_new(1:i-1) - A(i, i+1:end) * X(i+1:end)) / A(i, i); end % Check for convergence if norm(X_new - X) < tol break; end X = X_new; % Update the approximation end end % Call the Gauss-Seidel function with the input matrices and parameters tolerance = 1e-3; max_iterations = 1000; X1 = GaussSeidel(A, B, tolerance, max_iterations); % Display the output disp(X1); ``` This code defines the input matrices `A` and `B`, and then defines the `GaussSeidel` function to perform the Gauss-Seidel method. The function takes in the input matrix `A`, the right-hand side vector `b`, a tolerance `tol`, and a maximum number of iterations `max_iter`. It initializes the output vector `X` to be a vector of zeros with the same dimensions as `b`, and then iterates through the Gauss-Seidel method until convergence is reached (i.e., the norm of the difference between the current and previous approximations is less than `tol`). At each iteration, it updates the approximation using the formula for Gauss-Seidel, and then checks for convergence. Finally, it returns the approximation `X`. The code then calls the `GaussSeidel` function with the input matrices `A` and `B`, a tolerance of `1e-3`, and a maximum of `1000` iterations, and stores the output in `X1`. Finally, it displays the output.

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值