Reinforcement Learning with Code 【Chapter 6. Stochastic Approximation】-CSDN博客

本文链接：https://blog.csdn.net/qq_44940689/article/details/131919665

本文介绍了强化学习中的随机近似方法，特别是罗宾斯-蒙罗算法和随机梯度下降。罗宾斯-蒙罗算法在不知道目标函数表达式的情况下，用于寻找方程的根。随机梯度下降是优化问题的一种特殊随机近似算法，常用于解决期望值最小化问题，其中批处理梯度下降和随机梯度下降是两种不同形式。文章还展示了如何将确定性优化问题转化为随机梯度下降的求解形式。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Reinforcement Learning with Code

This note records how the author begin to learn RL. Both theoretical understanding and code practice are presented. Many material are referenced such as ZhaoShiyu’s Mathematical Foundation of Reinforcement Learning, .

文章目录

Reinforcement Learning with Code

Chapter 6. Stochastic Approximation

Stochastic approximation refers to a borad class of stochastic iterative algorithms solving root finding or optimization problems.

6.1 Robbins-Monro algorithm

Compared to many other root-finding algorithms such as gradient-based methods, stochastic approximation is powerful in the sense that it does not require knowing the expression of the objective function or its derivative. The Robbins-Monro algorithm is a pioneering work in the field of stochastic approximation.

Consider the function $g (w)$ , that we would like to find the root of the equation $g (w) = 0$ , where $w\in\mathbb{R}$ is the variable to be solved and $g:\mathbb{R}\to\mathbb{R}$ is a function. Suppose the expression of the function is unkown. We only can know the input $\textcolor{red}{w}$ and the measured output $\textcolor{red}{\tilde{g}(w,\eta)=g(w)+\eta}$ , where $\eta\in\mathbb{R}$ is the observation error.

(Robbins-Monro Theorem). In the Robbins-Monro algorithm, if

(1) $0<c_1\le \nabla_w g(w) \le c_2$ for all $w$ ;

(2) $\sum_{k=1}^\infty a_k=\infty$ and $\sum_{k=1}^\infty a_k^2<\infty$ ;

(3) $\mathbb{E}[\eta_k|\mathcal{H}_k] = 0$ and $\mathbb{E}[\eta_k^2|\mathcal{H}_k] < \infty$ ;

where $\mathcal{H}_k = \{w_k,w_{k-1},\dots \}$ . The increment or iterative fashion as follows

$\textcolor{red}{w_{k+1} = w_k - a_k\tilde{g}(w,\eta)}, \quad k=1,2,3,\cdots$

where the $a_k$ is a positive coefficient. Then $w_k$ converges with probability $1$ (w.p. $1$ ) to the root $w^*$ satisfying $g(w^*)=0$ .

Example:

We use the Robbins-Monro theorem to sovle the mean eastimation problem. Consider the function
$g(w)=w-\mathbb{E}[X]$
When we take sample $x$ of random variable $X$ , the oberservation is

$\begin{aligned} \tilde{g}(w,x) & = w - x \\ \tilde{g}(w,\eta) & = w - \mathbb{E}[X] + (\mathbb{E}[X]-x) \\ \tilde{g}(w,\eta) & = \underbrace{w - \mathbb{E}[X]}_{g(w)} + \underbrace{\mathbb{E}[X]-x}_{\eta} \end{aligned}$

Therefore, the observation $\tilde{g}(w,x)$ is the sum of $g (w)$ and an observation error $\eta$ . Hence the Robins-Monro algorithm for solving $g (w) = 0$ is

$\begin{aligned} w_{k+1} & = w_k - a_k\tilde{g}(w,\eta) \\ w_{k+1} & = w_k - a_k \tilde{g}(w,x) \\ w_{k+1} & = w_k - a_k(w_k - x_k) \end{aligned}$

when $a_k$ takes $\frac{1}{k}$ , the iterative equation is $w_{k+1} = \frac{k-1}{k}w_k +\frac{x_k}{k}$ .

6.2 Stochastic gradient descent

Stochastic gradient descent (SGD) is a special Robins-Monro algorithm and the mean estimation is a special SGD algorithm.

Stochastic gradient descent (SGD) is using to solve the following optimization problem:
$\textcolor{blue}{\min_w \quad J(w) = \mathbb{E}[f(w,X)]}$
where $w$ is the parameter to be optimized and $X$ is a random variable. And the expectation is with respect to random variable $X$ . Here $w$ and $X$ can be either scalars or vectors. The function $f (\cdot)$ is a scalar.

Gradient descent (GD):

We can use gradient descent to solve the above optimization problem,

$\begin{aligned} w_{k+1} & = w_k - a_k \nabla_{w_k} \mathbb{E}[f(w_k,X)] \\ \textcolor{red}{w_{k+1}} & \textcolor{red}{= w_k - a_k \mathbb{E}[\nabla_{w_k} f(w_k,X)]} \end{aligned}$

The problem of the gradient descent is that the expected value on the right-hand side is difficult to calculate. One potential way to calculate the expected value is based on the probability distribution of $X$ , which is unlike to know in practice.

Batch gradient descent (BGD):

Inspired by the Monte Carlo learning, we can collect a large number of iid samples ${x_i\}_{i=1}^n$ of $X$ so that the expected value can be approximated as
$\mathbb{E}[\nabla_{w_k} f(w_k)] \approx \frac{1}{n} \sum_{i=1}^n \nabla_{w_k} f(w_k,x_i)$
Then the gradient descent equaition becomes
$\textcolor{red}{w_{k+1} = w_k - a_k \frac{1}{n} \sum_{i=1}^n \nabla_{w_k} f(w_k,x_i)}$
which is also called batch gradient descent.

Stochastic gradient descent (SGD):

The problem of the BGD is that it requires all the samples in each iteration. In practice, since the samples may be collected incrementally, it is favorable to optimze $w$ instantly every time a sample is collected. We use the following algorithm,
$\textcolor{red}{w_{k+1} = w_k - a_k \nabla_{w_k} f(w_k,x_k)}$
where $x_k$ is the sample collected as time step $k$ . The reason that this algorithm is called stochastic is that it relies on stochastic samplings ${x_k \}$ .

We will show that SGD is a special form of Robins-Monro algorithm. The problem to be solved by SGD is to minimize $J(w)=\mathbb{E}_X[f(w,X)]$ . This problem can be converted to root finding problem: that is to find the root of $\nabla_wJ(w)=\mathbb{E}_X[\nabla_w f(w,X)]=0$ . We can get the measurement that is $\nabla_w f(w,x)$ where $x$ is a sample of random variable $X$ . Hence, we have

$\begin{aligned} \tilde{g}(w,\eta) & = \nabla_w f(w,x) \\ & = \underbrace{\mathbb{E}_X[\nabla_w f(w,X)]}_{g(w,\eta)} + \underbrace{\nabla_w f(w,x) - \mathbb{E}_X[\nabla_w f(w,X)]}_{\eta(w,x)} \end{aligned}$

Then, the RM algorithm for sovling $g (w) = 0$ is

$w_{k+1} = w_k - a_k \tilde{g}(w_k,\eta_k) = w_k - a_k \nabla_w f(w_k, x_k)$

which is exactly the SGD algorithm.

How to solve deterministic formulation?

Consider the optimization problem
$\min_w \quad J(w) = \frac{1}{n} \sum_{i=1}^n f(w,x_i)$
The optimization problem is deterministic, which does’t have any random variable. Although no random variables are invovled in the above formulation, we can introduce a random variable manually and convert the deterministic formulation to the stochastic formulation of SGD. Introduce a random variable $X$ following distribution $p(X=x_i)=\frac{1}{n}$ . Hence, we have
$\min_w \quad J(w) = \frac{1}{n} \sum_{i=1}^n f(w,x_i) = \mathbb{E}[f(w,X)]$
Then, we can use the SGD to sovle the optimization problem as
$w_{k+1} = w_k - a_k \nabla_{w_k} f(w_k,x_k)$