文章目录
前情提要与引用参考:
- b站看课地址:https://www.bilibili.com/video/BV1dJ411W78A
- 官方课程地址:http://rail.eecs.berkeley.edu/deeprlcourse/
- 本人代码地址:https://gitee.com/kin_zhang/drl-hwprogramm/tree/solution/
请先看原文件里的readme.md和installation.md等,课程是2019fall,但是作业我直接做的最新的2020fall的 - 一些参考引用见最后的部分,主要是参考代码等
好像上次的flag 强化学习的书一直没更新完,这次也是pjc同学推荐的课程,觉得很有意思 所以就开始听课做作业 顺便当预习这些流程了,后续Carla还是会继续搞,搭个环境做一下DRL/RL之类的。课程笔记看看后面能不能整理的好一点… 看起来只能我自己看得懂 hhhh
关于作业一的详解见:https://blog.csdn.net/qq_39537898/article/details/116905668
阅读代码顺序:
首先是作业1里有的代码:
- infrastructure/rl_trainer.py
- infrastructure/utils.py
- policies/MLP_policy.py
就算hw1里有还是要把复制粘贴过来的哈,不同之处是utils.py里面还加了三个简单函数:
def normalize(data, mean, std, eps=1e-8):
return (data-mean)/(std+eps)
def unnormalize(data, mean, std):
return data*std+mean
def add_noise(data_inp, noiseToSignal=0.01):
#..... Please see detail at utils.py
然后是MLP_policy.py里面的原来的MLPPolicySL
变成了MLPPolicyPG
,
对于这里因为后面的update
里需要用到log_prob
在forward return了一个distribution的量的时候,所以我们的forward不能延续hw1里面的mlp因为这是一个output_placeholder: the result of a forward pass through the hidden layers + the output layer
首先关于pytorch里的distribution的介绍:https://pytorch.org/docs/stable/distributions.html
可以看到第一次的score function就是以log开始的
Δ θ = α r ∂ log p ( a ∣ π θ ( s ) ) ∂ θ \Delta \theta=\alpha r \frac{\partial \log p\left(a \mid \pi^{\theta}(s)\right)}{\partial \theta} Δθ=αr∂θ∂logp(a∣πθ(s))
对应的求解代码为: θ \theta θ为参数, α \alpha α为learning rate, r r r 是reward, p ( a ∣ π θ ( s ) ) p(a|\pi^\theta(s)) p(a∣πθ(s))为在状态 s s s下采取policy π θ \pi_\theta πθ 做出动作 a a a的概率
probs = policy_network(state)
# Note that this is equivalent to what used to be called multinomial
m = Categorical(probs)
action = m.sample()
next_state, reward = env.step(action)
loss = -m.log_prob(action) * reward
loss.backward()
主要就是按获取的prob来生成一个关于这个状态下做出动作的分布,在update步骤中,我们还要为其乘Q-value减去base,好像advantage是这个意思?是的,我们再跳到pg_agent.py
里看train过程的时候知道的:
# step 1: calculate q values of each (s_t, a_t) point, using rewards (r_0, ..., r_t, ..., r_T)
q_values = self.calculate_q_vals(rewards_list)
# step 2: calculate advantages that correspond to each (s_t, a_t) point
advantages = self.estimate_advantage(observations, q_values)
所以一切就比较明了了,
-
走到forward读取当前状态下的动作分布并使用log_prob得到此时actions的log概率
def forward(self, observation: torch.FloatTensor): if self.discrete: prob_action = self.logits_na(observation) return distributions.Categorical(logits = prob_action) else: mean_prob = self.mean_net(observation) std_prob = torch.exp(self.logstd) return distributions.Normal(loc=mean_prob, std=std_prob)
因为一开始Init的时候会根据自身是否离散来对应模型本身:
if self.discrete: self.logits_na = ptu.build_mlp(input_size=self.ob_dim, output_size=self.ac_dim, n_layers=self.n_layers, size=self.size) self.logits_na.to(ptu.device) self.mean_net = None self.logstd = None self.optimizer = optim.Adam(self.logits_na.parameters(), self.learning_rate) else: self.logits_na = None self.mean_net = ptu.build_mlp(input_size=self.ob_dim, output_size=self.ac_dim, n_layers=self.n_layers, size=self.size) self.logstd = nn.Parameter( torch.zeros(self.ac_dim, dtype=torch.float32, device=ptu.device) ) self.mean_net.to(ptu.device) self.logstd.to(ptu.device) self.optimizer = optim.Adam( itertools.chain([self.logstd], self.mean_net.parameters()), self.learning_rate )
所以在forward中,我们也需要根据对应的模型去对应输出actoin,然后再放入分布函数中去,之所以和前面Pytorch不一样的输入,是因为这里我们需要将这个动作取对数处理也就是pdf里讲到的objective,这里是pytorch里关于两个不同输入的解释:Creates a categorical distribution parameterized by either
probs
orlogits
(but not both).[docs] @lazy_property def logits(self): return probs_to_logits(self.probs) [docs] @lazy_property def probs(self): return logits_to_probs(self.logits)
∇ θ J ( θ ) = ∇ θ ∫ π θ ( τ ) r ( τ ) d τ = ∫ π θ ( τ ) ∇ θ log π θ ( τ ) r ( τ ) d τ = E τ ∼ π θ ( τ ) [ ∇ θ log π θ ( τ ) r ( τ ) ] \begin{aligned}\nabla_{\theta} J(\theta) &=\nabla_{\theta} \int \pi_{\theta}(\tau) r(\tau) d \tau \\&=\int \pi_{\theta}(\tau) \nabla_{\theta} \log \pi_{\theta}(\tau) r(\tau) d \tau \\&=\mathbb{E}_{\tau \sim \pi_{\theta}(\tau)}\left[\nabla_{\theta} \log \pi_{\theta}(\tau) r(\tau)\right]\end{aligned} ∇θJ(θ)=∇θ∫πθ(τ)r(τ)dτ=∫πθ(τ)∇θlogπθ(τ)r(τ)dτ=Eτ∼πθ(τ)[∇θlogπθ(τ)r(τ)]
我们这一步算的是: log π θ ( a ∣ s ) \log \pi_{\theta}(a|s) logπθ(a∣s),但是是对应state下的所有动作分布;第二种连续情况,是对平均值做网络,所以我们找到对应的高斯分布即可Normal输入平均值和方差的即可
-
根据公式乘以 ( Q − b ) (Q-b) (Q−b) 代码中也就是advantages (有选择是否reward-to-go还是discounting)
然后取平均即期望值,因为我们要最大化期望,然后loss function的定义是最小化,所以在前面加负号就OK了
对于提示中的log_prob
pytorch的官网解释是:Returns the log of the probability density/mass function evaluated at value.也就是我们的update的过程:
def update(self, observations, actions, advantages, q_values=None): observations = ptu.from_numpy(observations) actions = ptu.from_numpy(actions) advantages = ptu.from_numpy(advantages) log_pi = self.forward(observations).log_prob(actions) # mul for Multiplies each elements, mean for expectation, neg for minimize the neg expectation function = maximize the pos expectation loss = torch.neg(torch.mean(torch.mul(log_pi, advantages))) self.optimizer.zero_grad() loss.backward() self.optimizer.step()
其中我们输入观测值到forward得到的是关于动作的分布,然后我们再以log_prob(actions) 输入得到对应这个actions的 log π θ ( τ ) \log \pi_{\theta}(\tau) logπθ(τ) 也就是整条序列轨迹值
-
然后就是hw1里面的初始化grad,求backward(),然后优化器step
-
判断是否要启用学习baseline b b b就到对 baseline的学习步骤了
在这里,Pdf中解释了为什么加入b是不会对policy gradient的整体期望值产生影响的,因为b乘出来算的期望是0,具体证明:
E [ ∇ θ log p θ ( τ ) b ] = ∫ p θ ( τ ) ∇ θ log p θ ( τ ) b d τ = ∫ ∇ θ p θ ( τ ) b d τ = b ∇ θ ∫ p θ ( τ ) d τ = b ∇ θ 1 = 0 E\left[\nabla_{\theta} \log p_{\theta}(\tau) b\right]=\int p_{\theta}(\tau) \nabla_{\theta} \log p_{\theta}(\tau) b d \tau=\int \nabla_{\theta} p_{\theta}(\tau) b d \tau=b \nabla_{\theta} \int p_{\theta}(\tau) d \tau=b \nabla_{\theta} 1=0 E[∇θlogpθ(τ)b]=∫pθ(τ)∇θlogpθ(τ)bdτ=∫∇θpθ(τ)bdτ=b∇θ∫pθ(τ)dτ=b∇θ1=0
- 其中有点疑惑的是关于计算baseline loss的时候写的
use F.mse_loss
感觉用不上呀…
到此为止前面三个文档就结束了,下面是关于,咦 好吧 又提前写完了
- agent/pg_agent.py
- policies/MLP_policy.py
接着pg_agent.py 这边就是对于advantage的求解,就比较简单了… 主要是注意维度,一开始维度直接一维走pow,发现torch.mul
没有boardcast然后就维度不对应…
pg_agent.py
关于这边有个问题就是,我一开始以为是:mean给0,std给1,后面才反应过来是归一化后mean处为0这样的意思
# standardize the advantages to have a mean of zero
关于计算的reward list对应起来的问题见后面写出的BUG部分
结果分析
请见代码的solution.md,就不重复了,代码地址:https://gitee.com/kin_zhang/drl-hwprogramm/tree/solution/
写出的bug
np.power & pow & x**power
首先第一次写完后发现… 维度不对,reward和observation的维度一直对不上,然后不断print 看到了是pg_agent.py那边的的计算discount,幂指数没对准维度导致的
第二个Experiment一直在0-10之间的平均return
这个问题很是奇怪,大概原因我知道了是计算advantage的不同和distribution那边怎么给,首先正如最后我提到的引用的是两个,然后他们对于计算advantage有不一样的方式,同时连续下的distribution也不一样
def forward(self, observation: torch.FloatTensor):
if self.discrete:
prob_action = self.logits_na(observation)
return distributions.Categorical(logits = prob_action)
else:
mean_prob = self.mean_net(observation)
std_prob = torch.exp(self.logstd)
return distributions.Normal(loc=mean_prob, scale=std_prob)
def _discounted_return(self, rewards):
discount_np = np.power(np.array(self.gamma), np.arange(len(rewards)))
discounted_returns = np.sum(np.array(rewards) * discount_np, keepdims=True)
list_of_discounted_returns = discounted_returns.tolist()
return list_of_discounted_returns
def _discounted_cumsum(self, rewards):
discount_np = np.power(np.array(self.gamma), np.arange(len(rewards)))
r = np.array(rewards)
list_of_discounted_cumsums = [np.sum(discount_np[:len(rewards) - t] * r[t:])
for t in range(len(rewards))]
return list_of_discounted_cumsums
def update(self, observations, actions, advantages, q_values=None):
observations = ptu.from_numpy(observations)
actions = ptu.from_numpy(actions)
advantages = ptu.from_numpy(advantages)
log_pi = self.forward(observations).log_prob(actions)
loss = -(log_pi * advantages.view(-1, 1)).mean()
首先这种方式是我一开始说维度不对的问题,举个例子哈:
observation和action都是1005的长度,然后reward是41的长度,然后…
这种方法在计算的时候是直接log_pi * advantages.view(-1, 1)
拿(1005)*(41,1)这样的,然后是因为numpy的boardcast的(估计torch也一样)所以乘出来的,主要这里的forward函数内连续情况下是:distributions.Normal(loc=mean_prob, scale=std_prob)
然后乘完之后的size是(41,1005)
view(-1,1)+Normal
torch.mul + MultivariateNormal
然后呢另一种方式是一开始我认为至少维度上直观正确的(后续我才知道我view(-1,1)没做而是直接torch.mul
就会报维度不匹配),这个方法呢会使得reward变成(1005,)然后可以直接相乘,主要的不同是forward那里用的方法也不一样,这里我们输入的方差是对角的矩阵,虽然我找了一个对比Normal和MultivariantNormal的不同之处的但是 感觉没有在这里体现出来这一点:https://bochang.me/blog/posts/pytorch-distributions/
主要是这两种方法是不能交叉使用的,如果你算得是repeat reward那么用Normal的方式forward就会导致平均return一直上不去
def forward(self, observation: torch.FloatTensor):
if self.discrete:
prob_action = self.logits_na(observation)
return distributions.Categorical(logits = prob_action)
else:
mean_prob = self.mean_net(observation)
std_prob = torch.exp(self.logstd)
return distributions.MultivariateNormal(mean_prob, scale_tril = torch.diag(std_prob))
def _discounted_return(self, rewards):
len_reward = len(rewards)
power_array = np.arange(0,len_reward)
discount = self.gamma ** power_array
discounted_returns = np.sum(rewards * discount, keepdims=True)
list_of_discounted_returns = np.repeat(discounted_returns,len_reward)
return list_of_discounted_returns
def _discounted_cumsum(self, rewards):
list_of_discounted_cumsums = []
len_reward = len(rewards)
power_array = np.arange(0,len_reward)
discount = self.gamma ** power_array
np_rewards = np.array(rewards)
for t in range(len_reward):
list_of_discounted_cumsums.append(np.sum(np_rewards[t:] * discount[:len_reward-t]))
return list_of_discounted_cumsums
def update(self, observations, actions, advantages, q_values=None):
observations = ptu.from_numpy(observations)
actions = ptu.from_numpy(actions)
advantages = ptu.from_numpy(advantages)
log_pi = self.forward(observations).log_prob(actions)
loss = torch.neg(torch.mean(torch.mul(log_pi, advantages)))
后面我又分析了一下为什么会出现reward 41是因为他的返回没有直接把reward贴在后面而是新的一个list:
def convert_listofrollouts(paths):
"""
Take a list of rollout dictionaries
and return separate arrays,
where each array is a concatenation of that array from across the rollouts
"""
observations = np.concatenate([path["observation"] for path in paths])
actions = np.concatenate([path["action"] for path in paths])
next_observations = np.concatenate([path["next_observation"] for path in paths])
terminals = np.concatenate([path["terminal"] for path in paths])
concatenated_rewards = np.concatenate([path["reward"] for path in paths])
unconcatenated_rewards = [path["reward"] for path in paths]
return observations, actions, next_observations, terminals, concatenated_rewards, unconcatenated_rewards
可以看到这里的concatenated,但是其实返回的reward用的是:unconcatenated_rewards
所以这么算下来应该是第二种方法是对的 把对应的reward重复一遍,因为其实他本身是跟在后面的,主要是那一段的path给出了一个reward,所以对于那一段的observation都是那一个reward
我的想法是MultivariantNormal取了对角线的也就是,使得求mean的时候是那一条路path的obervation乘以对应的reward求得的mean,虽然我当时画了个图分析 但是好像有点难看懂吼,虽然我又写了遍:
总之,方法就是这样的方法,不能交叉混用…
引用参考
- https://github.com/welkin-feng/cs285-homework-2020/tree/master/hw2
但是这个在pg_agent计算discount有问题一场误会,详情可以自己见链接里的Pull request - https://github.com/vincentkslim/cs285_homework_fall2020/tree/master/hw2