Reinforcement Learning with Code 【Chapter 6. Stochastic Approximation】

本文介绍了强化学习中的随机近似方法,特别是罗宾斯-蒙罗算法和随机梯度下降。罗宾斯-蒙罗算法在不知道目标函数表达式的情况下,用于寻找方程的根。随机梯度下降是优化问题的一种特殊随机近似算法,常用于解决期望值最小化问题,其中批处理梯度下降和随机梯度下降是两种不同形式。文章还展示了如何将确定性优化问题转化为随机梯度下降的求解形式。
摘要由CSDN通过智能技术生成

Reinforcement Learning with Code

This note records how the author begin to learn RL. Both theoretical understanding and code practice are presented. Many material are referenced such as ZhaoShiyu’s Mathematical Foundation of Reinforcement Learning, .

Chapter 6. Stochastic Approximation

​ Stochastic approximation refers to a borad class of stochastic iterative algorithms solving root finding or optimization problems.

6.1 Robbins-Monro algorithm

​ Compared to many other root-finding algorithms such as gradient-based methods, stochastic approximation is powerful in the sense that it does not require knowing the expression of the objective function or its derivative. The Robbins-Monro algorithm is a pioneering work in the field of stochastic approximation.

​ Consider the function g ( w ) g(w) g(w), that we would like to find the root of the equation g ( w ) = 0 g(w)=0 g(w)=0, where w ∈ R w\in\mathbb{R} wR is the variable to be solved and g : R → R g:\mathbb{R}\to\mathbb{R} g:RR is a function. Suppose the expression of the function is unkown. We only can know the input w \textcolor{red}{w} w and the measured output g ~ ( w , η ) = g ( w ) + η \textcolor{red}{\tilde{g}(w,\eta)=g(w)+\eta} g~(w,η)=g(w)+η, where η ∈ R \eta\in\mathbb{R} ηR is the observation error.

​ (Robbins-Monro Theorem). In the Robbins-Monro algorithm, if

(1) 0 < c 1 ≤ ∇ w g ( w ) ≤ c 2 0<c_1\le \nabla_w g(w) \le c_2 0<c1wg(w)c2 for all w w w;

(2) ∑ k = 1 ∞ a k = ∞ \sum_{k=1}^\infty a_k=\infty k=1ak= and ∑ k = 1 ∞ a k 2 < ∞ \sum_{k=1}^\infty a_k^2<\infty k=1ak2<;

(3) E [ η k ∣ H k ] = 0 \mathbb{E}[\eta_k|\mathcal{H}_k] = 0 E[ηkHk]=0 and E [ η k 2 ∣ H k ] < ∞ \mathbb{E}[\eta_k^2|\mathcal{H}_k] < \infty E[ηk2Hk]<;

where H k = { w k , w k − 1 , …   } \mathcal{H}_k = \{w_k,w_{k-1},\dots \} Hk={wk,wk1,}. The increment or iterative fashion as follows

w k + 1 = w k − a k g ~ ( w , η ) , k = 1 , 2 , 3 , ⋯ \textcolor{red}{w_{k+1} = w_k - a_k\tilde{g}(w,\eta)}, \quad k=1,2,3,\cdots wk+1=wkakg~(w,η),k=1,2,3,

where the a k a_k ak is a positive coefficient. Then w k w_k wk converges with probability 1 1 1 (w.p. 1 1 1) to the root w ∗ w^* w satisfying g ( w ∗ ) = 0 g(w^*)=0 g(w)=0.

Example:

​ We use the Robbins-Monro theorem to sovle the mean eastimation problem. Consider the function
g ( w ) = w − E [ X ] g(w)=w-\mathbb{E}[X] g(w)=wE[X]
When we take sample x x x of random variable X X X, the oberservation is

g ~ ( w , x ) = w − x g ~ ( w , η ) = w − E [ X ] + ( E [ X ] − x ) g ~ ( w , η ) = w − E [ X ] ⏟ g ( w ) + E [ X ] − x ⏟ η \begin{aligned} \tilde{g}(w,x) & = w - x \\ \tilde{g}(w,\eta) & = w - \mathbb{E}[X] + (\mathbb{E}[X]-x) \\ \tilde{g}(w,\eta) & = \underbrace{w - \mathbb{E}[X]}_{g(w)} + \underbrace{\mathbb{E}[X]-x}_{\eta} \end{aligned} g~(w,x)g~(w,η)g~(w,η)=wx=wE[X]+(E[X]x)=g(w) wE[X]+η E[X]x

Therefore, the observation g ~ ( w , x ) \tilde{g}(w,x) g~(w,x) is the sum of g ( w ) g(w) g(w) and an observation error η \eta η. Hence the Robins-Monro algorithm for solving g ( w ) = 0 g(w)=0 g(w)=0 is

w k + 1 = w k − a k g ~ ( w , η ) w k + 1 = w k − a k g ~ ( w , x ) w k + 1 = w k − a k ( w k − x k ) \begin{aligned} w_{k+1} & = w_k - a_k\tilde{g}(w,\eta) \\ w_{k+1} & = w_k - a_k \tilde{g}(w,x) \\ w_{k+1} & = w_k - a_k(w_k - x_k) \end{aligned} wk+1wk+1wk+1=wkakg~(w,η)=wkakg~(w,x)=wkak(wkxk)

when a k a_k ak takes 1 k \frac{1}{k} k1, the iterative equation is w k + 1 = k − 1 k w k + x k k w_{k+1} = \frac{k-1}{k}w_k +\frac{x_k}{k} wk+1=kk1wk+kxk.

6.2 Stochastic gradient descent

​ Stochastic gradient descent (SGD) is a special Robins-Monro algorithm and the mean estimation is a special SGD algorithm.

Stochastic gradient descent (SGD) is using to solve the following optimization problem:
min ⁡ w J ( w ) = E [ f ( w , X ) ] \textcolor{blue}{\min_w \quad J(w) = \mathbb{E}[f(w,X)]} wminJ(w)=E[f(w,X)]
where w w w is the parameter to be optimized and X X X is a random variable. And the expectation is with respect to random variable X X X. Here w w w and X X X can be either scalars or vectors. The function f ( ⋅ ) f(·) f() is a scalar.

Gradient descent (GD):

​ We can use gradient descent to solve the above optimization problem,

w k + 1 = w k − a k ∇ w k E [ f ( w k , X ) ] w k + 1 = w k − a k E [ ∇ w k f ( w k , X ) ] \begin{aligned} w_{k+1} & = w_k - a_k \nabla_{w_k} \mathbb{E}[f(w_k,X)] \\ \textcolor{red}{w_{k+1}} & \textcolor{red}{= w_k - a_k \mathbb{E}[\nabla_{w_k} f(w_k,X)]} \end{aligned} wk+1wk+1=wkakwkE[f(wk,X)]=wkakE[wkf(wk,X)]

The problem of the gradient descent is that the expected value on the right-hand side is difficult to calculate. One potential way to calculate the expected value is based on the probability distribution of X X X, which is unlike to know in practice.

Batch gradient descent (BGD):

​ Inspired by the Monte Carlo learning, we can collect a large number of iid samples { x i } i = 1 n \{x_i\}_{i=1}^n {xi}i=1n of X X X so that the expected value can be approximated as
E [ ∇ w k f ( w k ) ] ≈ 1 n ∑ i = 1 n ∇ w k f ( w k , x i ) \mathbb{E}[\nabla_{w_k} f(w_k)] \approx \frac{1}{n} \sum_{i=1}^n \nabla_{w_k} f(w_k,x_i) E[wkf(wk)]n1i=1nwkf(wk,xi)
Then the gradient descent equaition becomes
w k + 1 = w k − a k 1 n ∑ i = 1 n ∇ w k f ( w k , x i ) \textcolor{red}{w_{k+1} = w_k - a_k \frac{1}{n} \sum_{i=1}^n \nabla_{w_k} f(w_k,x_i)} wk+1=wkakn1i=1nwkf(wk,xi)
which is also called batch gradient descent.

Stochastic gradient descent (SGD):

​ The problem of the BGD is that it requires all the samples in each iteration. In practice, since the samples may be collected incrementally, it is favorable to optimze w w w instantly every time a sample is collected. We use the following algorithm,
w k + 1 = w k − a k ∇ w k f ( w k , x k ) \textcolor{red}{w_{k+1} = w_k - a_k \nabla_{w_k} f(w_k,x_k)} wk+1=wkakwkf(wk,xk)
where x k x_k xk is the sample collected as time step k k k. The reason that this algorithm is called stochastic is that it relies on stochastic samplings { x k } \{x_k \} {xk}.

We will show that SGD is a special form of Robins-Monro algorithm. The problem to be solved by SGD is to minimize J ( w ) = E X [ f ( w , X ) ] J(w)=\mathbb{E}_X[f(w,X)] J(w)=EX[f(w,X)]. This problem can be converted to root finding problem: that is to find the root of ∇ w J ( w ) = E X [ ∇ w f ( w , X ) ] = 0 \nabla_wJ(w)=\mathbb{E}_X[\nabla_w f(w,X)]=0 wJ(w)=EX[wf(w,X)]=0. We can get the measurement that is ∇ w f ( w , x ) \nabla_w f(w,x) wf(w,x) where x x x is a sample of random variable X X X. Hence, we have

g ~ ( w , η ) = ∇ w f ( w , x ) = E X [ ∇ w f ( w , X ) ] ⏟ g ( w , η ) + ∇ w f ( w , x ) − E X [ ∇ w f ( w , X ) ] ⏟ η ( w , x ) \begin{aligned} \tilde{g}(w,\eta) & = \nabla_w f(w,x) \\ & = \underbrace{\mathbb{E}_X[\nabla_w f(w,X)]}_{g(w,\eta)} + \underbrace{\nabla_w f(w,x) - \mathbb{E}_X[\nabla_w f(w,X)]}_{\eta(w,x)} \end{aligned} g~(w,η)=wf(w,x)=g(w,η) EX[wf(w,X)]+η(w,x) wf(w,x)EX[wf(w,X)]

Then, the RM algorithm for sovling g ( w ) = 0 g(w)=0 g(w)=0 is

w k + 1 = w k − a k g ~ ( w k , η k ) = w k − a k ∇ w f ( w k , x k ) w_{k+1} = w_k - a_k \tilde{g}(w_k,\eta_k) = w_k - a_k \nabla_w f(w_k, x_k) wk+1=wkakg~(wk,ηk)=wkakwf(wk,xk)

which is exactly the SGD algorithm.

How to solve deterministic formulation?

​ Consider the optimization problem
min ⁡ w J ( w ) = 1 n ∑ i = 1 n f ( w , x i ) \min_w \quad J(w) = \frac{1}{n} \sum_{i=1}^n f(w,x_i) wminJ(w)=n1i=1nf(w,xi)
The optimization problem is deterministic, which does’t have any random variable. Although no random variables are invovled in the above formulation, we can introduce a random variable manually and convert the deterministic formulation to the stochastic formulation of SGD. Introduce a random variable X X X following distribution p ( X = x i ) = 1 n p(X=x_i)=\frac{1}{n} p(X=xi)=n1. Hence, we have
min ⁡ w J ( w ) = 1 n ∑ i = 1 n f ( w , x i ) = E [ f ( w , X ) ] \min_w \quad J(w) = \frac{1}{n} \sum_{i=1}^n f(w,x_i) = \mathbb{E}[f(w,X)] wminJ(w)=n1i=1nf(w,xi)=E[f(w,X)]
Then, we can use the SGD to sovle the optimization problem as
w k + 1 = w k − a k ∇ w k f ( w k , x k ) w_{k+1} = w_k - a_k \nabla_{w_k} f(w_k,x_k) wk+1=wkakwkf(wk,xk)


Reference

赵世钰老师的课程

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

木心

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值