强化学习经典算法笔记(二十):交叉熵方法Cross Entropy Method
本次补上一个经典RL算法笔记。
感谢 https://the0demiurge.blogspot.com/2017/08/cross-entropy-method-cem.html
感谢 https://en.wikipedia.org/wiki/Cross-entropy_method
CEM基本概念:
交叉熵方法是一种蒙特卡洛方法,主要用来优化和重要性采样。和进化算法类似,进化算法在空间中按照某种规则撒点,获得每个点的误差,再根据这些误差信息决定下一轮撒点的规则。交叉熵方法之所以叫这个名字,是因为该方法(从理论上来说)目标是最小化随机撒点得到的数据分布与数据实际分布的交叉熵(等价于最小化KL 距离),尽量使采样分布(撒的点)与实际情况同分布。
CEM具体流程:
- 首先,建模,将问题的解参数化。比如强化学习中,假设状态 S 为一个 n n n 维向量,动作总共有 2 种,最简单的想法就是建立一个 n n n 维参数向量 W W W ,求 S T × W S^T× W ST×W 得到一个标量 Q Q Q ,当 Q > 0 则采取第一个动作,否则采取第二个。可以使用任何优化算法求解最优的 W ,只不过交叉熵方法可以很快很稳定地收敛。
- 假设参数 W 属于高斯分布,随机设置一个 n n n 维向量 μ μ μ 和一个 n 维向量 σ 2 σ^2 σ2 ,分别对应于 W W W 的每一维。
- 以 μ \mu μ和 σ 2 \sigma^2 σ2为均值、方差采样得到 m 组参数 w 1 , w 2 , w 3 , . . . , w m w_1,w_2,w_3,...,w_m w1,w2,w3,...,wm
- 计算每一组 w w w 的回报 reward
- 选取回报最高的 k ( k < m ) k (k<m) k(k<m) 组参数,计算这组参数的均值和方差,并将 μ μ μ 和 σ 2 σ^2 σ2 更新为刚刚计算的均值和方差。
- 如果收敛则返回 reward 最大的一组 w w w, 否则重复步骤3~6
Cross Entropy Method的伪代码,来自Wikipedia。
// Initialize parameters
μ := −6
σ2 := 100
t := 0
maxits := 100
N := 100
Ne := 10
// While maxits not exceeded and not converged
while t < maxits and σ2 > ε do
// Obtain N samples from current sampling distribution
X := SampleGaussian(μ, σ2, N)
// Evaluate objective function at sampled points
S := exp(−(X − 2) ^ 2) + 0.8 exp(−(X + 2) ^ 2)
// Sort X by objective function values in descending order
X := sort(X, S)
// Update parameters of sampling distribution
μ := mean(X(1:Ne))
σ2 := var(X(1:Ne))
t := t + 1
// Return mean of final sampling distribution as solution
return mu
CEM控制CartPole游戏的例子。
# modified from https://gist.github.com/andrewliao11/d52125b52f76a4af73433e1cf8405a8f
import gym
import numpy as np
import matplotlib.pyplot as plt
env = gym.make('CartPole-v0')
env = env.unwrapped
# env.render()
#vector of means(mu) and standard dev(sigma) for each paramater
mu = np.random.uniform(size = env.observation_space.shape)
sigma = np.random.uniform(low = 0.001,size = env.observation_space.shape)
print(mu.shape)
print(sigma.shape)
def noisy_evaluation(env,W,render = False,):
"""
uses parameter vector W to choose policy for 1 episode,
returns reward from that episode
"""
reward_sum = 0
state = env.reset()
t = 0
while True:
t += 1
action = int(np.dot(W,state)>0) # use parameters/state to choose action
state,reward,done,info = env.step(action)
reward_sum += reward
if render and t%3 == 0: env.render()
if done or t > 2000: #
#print("finished episode, got reward:{}".format(reward_sum))
break
return reward_sum
def init_params(mu,sigma,n):
"""
以mu和sigma的维度(=4)分量为均值和方差,采样n=40个点,组成n个4维向量
"""
l = mu.shape[0] # l=4
w_matrix = np.zeros((n,l))
for p in range(l):
w_matrix[:,p] = np.random.normal(loc = mu[p],scale = sigma[p]+1e-7,size = (n,))
return w_matrix
def get_constant_noise(step):
return np.clip(5-step/10., a_max=1,a_min=0.5)
running_reward = 0
n = 40;p = 8;n_iter = 40;render = False
state = env.reset()
i = 0
while i < n_iter:
#initialize an array of parameter vectors
wvector_array = init_params(mu,sigma,n)
reward_sums = np.zeros((n))
for k in range(n):
#sample rewards based on policy parameters in row k of wvector_array
reward_sums[k] = noisy_evaluation(env,wvector_array[k,:],render)
#sort params/vectors based on total reward of an episode using that policy
rankings = np.argsort(reward_sums)
#pick p vectors with highest reward
top_vectors = wvector_array[rankings,:]
top_vectors = top_vectors[-p:,:]
print("top vectors shpae:{}".format(top_vectors.shape))
#fit new gaussian from which to sample policy
for q in range(top_vectors.shape[1]):
mu[q] = top_vectors[:,q].mean()
sigma[q] = top_vectors[:,q].std()+get_constant_noise(i) # 在方差更新项加入扰动
running_reward = 0.99*running_reward + 0.01*reward_sums.mean()
print("#############################################################################")
print("iteration:{},mean reward:{}, running reward mean:{} \n"
" reward range:{} to {},".format(
i, reward_sums.mean(),running_reward,reward_sums.min(),reward_sums.max(),
))
i += 1