Reinforcement Learning with Code
This note records how the author begin to learn RL. Both theoretical understanding and code practice are presented. Many material are referenced such as ZhaoShiyu’s Mathematical Foundation of Reinforcement Learning, .
文章目录
Chapter 6. Stochastic Approximation
Stochastic approximation refers to a borad class of stochastic iterative algorithms solving root finding or optimization problems.
6.1 Robbins-Monro algorithm
Compared to many other root-finding algorithms such as gradient-based methods, stochastic approximation is powerful in the sense that it does not require knowing the expression of the objective function or its derivative. The Robbins-Monro algorithm is a pioneering work in the field of stochastic approximation.
Consider the function g ( w ) g(w) g(w), that we would like to find the root of the equation g ( w ) = 0 g(w)=0 g(w)=0, where w ∈ R w\in\mathbb{R} w∈R is the variable to be solved and g : R → R g:\mathbb{R}\to\mathbb{R} g:R→R is a function. Suppose the expression of the function is unkown. We only can know the input w \textcolor{red}{w} w and the measured output g ~ ( w , η ) = g ( w ) + η \textcolor{red}{\tilde{g}(w,\eta)=g(w)+\eta} g~(w,η)=g(w)+η, where η ∈ R \eta\in\mathbb{R} η∈R is the observation error.
(Robbins-Monro Theorem). In the Robbins-Monro algorithm, if
(1) 0 < c 1 ≤ ∇ w g ( w ) ≤ c 2 0<c_1\le \nabla_w g(w) \le c_2 0<c1≤∇wg(w)≤c2 for all w w w;
(2) ∑ k = 1 ∞ a k = ∞ \sum_{k=1}^\infty a_k=\infty ∑k=1∞ak=∞ and ∑ k = 1 ∞ a k 2 < ∞ \sum_{k=1}^\infty a_k^2<\infty ∑k=1∞ak2<∞;
(3) E [ η k ∣ H k ] = 0 \mathbb{E}[\eta_k|\mathcal{H}_k] = 0 E[ηk∣Hk]=0 and E [ η k 2 ∣ H k ] < ∞ \mathbb{E}[\eta_k^2|\mathcal{H}_k] < \infty E[ηk2∣Hk]<∞;
where H k = { w k , w k − 1 , … } \mathcal{H}_k = \{w_k,w_{k-1},\dots \} Hk={wk,wk−1,…}. The increment or iterative fashion as follows
w k + 1 = w k − a k g ~ ( w , η ) , k = 1 , 2 , 3 , ⋯ \textcolor{red}{w_{k+1} = w_k - a_k\tilde{g}(w,\eta)}, \quad k=1,2,3,\cdots wk+1=wk−akg~(w,η),k=1,2,3,⋯
where the a k a_k ak is a positive coefficient. Then w k w_k wk converges with probability 1 1 1 (w.p. 1 1 1) to the root w ∗ w^* w∗ satisfying g ( w ∗ ) = 0 g(w^*)=0 g(w∗)=0.
Example:
We use the Robbins-Monro theorem to sovle the mean eastimation problem. Consider the function
g
(
w
)
=
w
−
E
[
X
]
g(w)=w-\mathbb{E}[X]
g(w)=w−E[X]
When we take sample
x
x
x of random variable
X
X
X, the oberservation is
g ~ ( w , x ) = w − x g ~ ( w , η ) = w − E [ X ] + ( E [ X ] − x ) g ~ ( w , η ) = w − E [ X ] ⏟ g ( w ) + E [ X ] − x ⏟ η \begin{aligned} \tilde{g}(w,x) & = w - x \\ \tilde{g}(w,\eta) & = w - \mathbb{E}[X] + (\mathbb{E}[X]-x) \\ \tilde{g}(w,\eta) & = \underbrace{w - \mathbb{E}[X]}_{g(w)} + \underbrace{\mathbb{E}[X]-x}_{\eta} \end{aligned} g~(w,x)g~(w,η)g~(w,η)=w−x=w−E[X]+(E[X]−x)=g(w) w−E[X]+η E[X]−x
Therefore, the observation g ~ ( w , x ) \tilde{g}(w,x) g~(w,x) is the sum of g ( w ) g(w) g(w) and an observation error η \eta η. Hence the Robins-Monro algorithm for solving g ( w ) = 0 g(w)=0 g(w)=0 is
w k + 1 = w k − a k g ~ ( w , η ) w k + 1 = w k − a k g ~ ( w , x ) w k + 1 = w k − a k ( w k − x k ) \begin{aligned} w_{k+1} & = w_k - a_k\tilde{g}(w,\eta) \\ w_{k+1} & = w_k - a_k \tilde{g}(w,x) \\ w_{k+1} & = w_k - a_k(w_k - x_k) \end{aligned} wk+1wk+1wk+1=wk−akg~(w,η)=wk−akg~(w,x)=wk−ak(wk−xk)
when a k a_k ak takes 1 k \frac{1}{k} k1, the iterative equation is w k + 1 = k − 1 k w k + x k k w_{k+1} = \frac{k-1}{k}w_k +\frac{x_k}{k} wk+1=kk−1wk+kxk.
6.2 Stochastic gradient descent
Stochastic gradient descent (SGD) is a special Robins-Monro algorithm and the mean estimation is a special SGD algorithm.
Stochastic gradient descent (SGD) is using to solve the following optimization problem:
min
w
J
(
w
)
=
E
[
f
(
w
,
X
)
]
\textcolor{blue}{\min_w \quad J(w) = \mathbb{E}[f(w,X)]}
wminJ(w)=E[f(w,X)]
where
w
w
w is the parameter to be optimized and
X
X
X is a random variable. And the expectation is with respect to random variable
X
X
X. Here
w
w
w and
X
X
X can be either scalars or vectors. The function
f
(
⋅
)
f(·)
f(⋅) is a scalar.
Gradient descent (GD):
We can use gradient descent to solve the above optimization problem,
w k + 1 = w k − a k ∇ w k E [ f ( w k , X ) ] w k + 1 = w k − a k E [ ∇ w k f ( w k , X ) ] \begin{aligned} w_{k+1} & = w_k - a_k \nabla_{w_k} \mathbb{E}[f(w_k,X)] \\ \textcolor{red}{w_{k+1}} & \textcolor{red}{= w_k - a_k \mathbb{E}[\nabla_{w_k} f(w_k,X)]} \end{aligned} wk+1wk+1=wk−ak∇wkE[f(wk,X)]=wk−akE[∇wkf(wk,X)]
The problem of the gradient descent is that the expected value on the right-hand side is difficult to calculate. One potential way to calculate the expected value is based on the probability distribution of X X X, which is unlike to know in practice.
Batch gradient descent (BGD):
Inspired by the Monte Carlo learning, we can collect a large number of iid samples
{
x
i
}
i
=
1
n
\{x_i\}_{i=1}^n
{xi}i=1n of
X
X
X so that the expected value can be approximated as
E
[
∇
w
k
f
(
w
k
)
]
≈
1
n
∑
i
=
1
n
∇
w
k
f
(
w
k
,
x
i
)
\mathbb{E}[\nabla_{w_k} f(w_k)] \approx \frac{1}{n} \sum_{i=1}^n \nabla_{w_k} f(w_k,x_i)
E[∇wkf(wk)]≈n1i=1∑n∇wkf(wk,xi)
Then the gradient descent equaition becomes
w
k
+
1
=
w
k
−
a
k
1
n
∑
i
=
1
n
∇
w
k
f
(
w
k
,
x
i
)
\textcolor{red}{w_{k+1} = w_k - a_k \frac{1}{n} \sum_{i=1}^n \nabla_{w_k} f(w_k,x_i)}
wk+1=wk−akn1i=1∑n∇wkf(wk,xi)
which is also called batch gradient descent.
Stochastic gradient descent (SGD):
The problem of the BGD is that it requires all the samples in each iteration. In practice, since the samples may be collected incrementally, it is favorable to optimze
w
w
w instantly every time a sample is collected. We use the following algorithm,
w
k
+
1
=
w
k
−
a
k
∇
w
k
f
(
w
k
,
x
k
)
\textcolor{red}{w_{k+1} = w_k - a_k \nabla_{w_k} f(w_k,x_k)}
wk+1=wk−ak∇wkf(wk,xk)
where
x
k
x_k
xk is the sample collected as time step
k
k
k. The reason that this algorithm is called stochastic is that it relies on stochastic samplings
{
x
k
}
\{x_k \}
{xk}.
We will show that SGD is a special form of Robins-Monro algorithm. The problem to be solved by SGD is to minimize J ( w ) = E X [ f ( w , X ) ] J(w)=\mathbb{E}_X[f(w,X)] J(w)=EX[f(w,X)]. This problem can be converted to root finding problem: that is to find the root of ∇ w J ( w ) = E X [ ∇ w f ( w , X ) ] = 0 \nabla_wJ(w)=\mathbb{E}_X[\nabla_w f(w,X)]=0 ∇wJ(w)=EX[∇wf(w,X)]=0. We can get the measurement that is ∇ w f ( w , x ) \nabla_w f(w,x) ∇wf(w,x) where x x x is a sample of random variable X X X. Hence, we have
g ~ ( w , η ) = ∇ w f ( w , x ) = E X [ ∇ w f ( w , X ) ] ⏟ g ( w , η ) + ∇ w f ( w , x ) − E X [ ∇ w f ( w , X ) ] ⏟ η ( w , x ) \begin{aligned} \tilde{g}(w,\eta) & = \nabla_w f(w,x) \\ & = \underbrace{\mathbb{E}_X[\nabla_w f(w,X)]}_{g(w,\eta)} + \underbrace{\nabla_w f(w,x) - \mathbb{E}_X[\nabla_w f(w,X)]}_{\eta(w,x)} \end{aligned} g~(w,η)=∇wf(w,x)=g(w,η) EX[∇wf(w,X)]+η(w,x) ∇wf(w,x)−EX[∇wf(w,X)]
Then, the RM algorithm for sovling g ( w ) = 0 g(w)=0 g(w)=0 is
w k + 1 = w k − a k g ~ ( w k , η k ) = w k − a k ∇ w f ( w k , x k ) w_{k+1} = w_k - a_k \tilde{g}(w_k,\eta_k) = w_k - a_k \nabla_w f(w_k, x_k) wk+1=wk−akg~(wk,ηk)=wk−ak∇wf(wk,xk)
which is exactly the SGD algorithm.
How to solve deterministic formulation?
Consider the optimization problem
min
w
J
(
w
)
=
1
n
∑
i
=
1
n
f
(
w
,
x
i
)
\min_w \quad J(w) = \frac{1}{n} \sum_{i=1}^n f(w,x_i)
wminJ(w)=n1i=1∑nf(w,xi)
The optimization problem is deterministic, which does’t have any random variable. Although no random variables are invovled in the above formulation, we can introduce a random variable manually and convert the deterministic formulation to the stochastic formulation of SGD. Introduce a random variable
X
X
X following distribution
p
(
X
=
x
i
)
=
1
n
p(X=x_i)=\frac{1}{n}
p(X=xi)=n1. Hence, we have
min
w
J
(
w
)
=
1
n
∑
i
=
1
n
f
(
w
,
x
i
)
=
E
[
f
(
w
,
X
)
]
\min_w \quad J(w) = \frac{1}{n} \sum_{i=1}^n f(w,x_i) = \mathbb{E}[f(w,X)]
wminJ(w)=n1i=1∑nf(w,xi)=E[f(w,X)]
Then, we can use the SGD to sovle the optimization problem as
w
k
+
1
=
w
k
−
a
k
∇
w
k
f
(
w
k
,
x
k
)
w_{k+1} = w_k - a_k \nabla_{w_k} f(w_k,x_k)
wk+1=wk−ak∇wkf(wk,xk)