协方差自适应调整的进化策略（CMA-ES）转载自知乎——补充——补充matlab代码

本文链接：https://blog.csdn.net/weixin_50084948/article/details/124841912

写的挺好，居然看懂了，转载一下，随后贴上相应的matlab代码。

之前转载的原文好像挂了，于是在知乎上又找了一篇相关的文章，原文链接为：进化策略及其在深度学习中的应用 - 知乎

本文仅作个人学习用，若有侵权请联系删除

本文翻译自：https://lilianweng.github.io/lil-log/2019/09/05/evolution-strategies.html,这篇博文很早就看过，但是每次都是似懂非懂，各种数学公式，虽然我做这个方向的研究很久，但是仍然不能全懂。这个端午，闲来无事，便萌生了翻译的念头，在此分享给各位对这个方向感兴趣的同学们。进化计算近年来的应用领域也一直被拓展，在很多领域展现了很好的性能，是个不错的研究方向。当研究到Fisher信息矩阵时竟发现与几何流形竟也有一定的相关性。深知知识的相通性，学海无涯，希望能再接再厉，对这一领域探知一二。

在学习最优模型参数时，梯度下降不是唯一的选择。在我们不知道目标函数的精确解析形式或不能直接计算梯度的情况下，进化策略(ES)是有效的。这篇文章深入探讨了几个经典的ES方法，以及如何在深度强化学习中使用ES。

随机梯度下降是优化深度学习模型的普遍选择。然而，这并不是唯一的选择。在黑盒优化算法中,可以评估目标函数 $f(x):\mathbb{R}n\rightarrow \mathbb{R}$ ，即使你不知道目标函数精确的解析形式,因此无需计算梯度或海赛矩阵。黑箱优化方法的实例有模拟退火法、爬坡法和 Nelder-Mead method等一系列算法。

进化策略(ES)是一种黑箱优化算法，诞生于进化算法(EA)家族。这篇文章将深入介绍一些经典的ES方法，并介绍ES如何在深度强化学习中发挥作用的一些应用。

什么是进化策略

进化策略(ES)属于进化算法的大家庭。ES的优化目标是实数的向量， $x\in \mathbb{R}n$ 。

进化算法是一种基于种群划分的优化算法，其灵感来自自然选择。自然选择认为，具有有利于生存的特性的个体可以世代生存，并将好的特性传给下一代。进化是在选择过程中逐渐发生的，进化使种群生长得能更好地适应环境。

preview

进化算法作为一种通用的优化方案，可以归纳为以下格式:

假设我们想优化一个函数f(x)但f(x)的梯度难以求解。只能得到在特定x下f(x)的确定值。

我们认为随机变量 x 的概率分布 $p_\theta(x)$ 是函数 f (x) 优化问题的一个较优的解，θ 是分布 $p_\theta(x)$ 的参数。j我们最终的目标是找到 θ 的最优设置。

在给定固定分布形式（例如，高斯分布）的情况下，参数 θ 包含了最优解的知识，在一代与一代间进行迭代更新。

初始化参数θ，我们可以连续地更新θ通过以下三步循环：

产生初始采样种群 $D=\{ (x_i,f(x_i))\},x_i\sim p_\theta(x)$
评估采样种群的”适应度“值。
选择最优子集来更新参数θ，最优子集的选择通常是根据适应度值或者是排序。

简单高斯进化策略

简单高斯进化策略是进化策略的最基础和最经典的版本。它将 $p_\theta(x)$ 建模为一个n维的各向同性的高斯分布，在高斯分布中 θ 指的是均值 $\mu$ 和标准差 $\sigma$

$\theta=(\mu,\sigma),p_\theta(x)\sim \mathcal N(\mu,\sigma^2I)=\mu+\sigma\mathcal N(0,I)$

简单高斯进化策略的流程如下，给定 $x\in\mathcal R^n$ ：

初始化参数 $\theta=\theta_0$ ，计数器t=0
从高斯分布中采样种群大小为 $\Lambda$ 的后代种群：
选择出使得最优的 λ 个样本组成的子集，该子集被称为「精英集」。为了不失一般性，我们可以考虑 D(t+1) 中适应度排名靠前的 k 个样本，将它们放入「精英集」。我们可以将其标注为：
选择出使得最优的 λ 个样本组成的子集，该子集被称为「精英集」。为了不失一般性，我们可以考虑 D(t+1) 中适应度排名靠前的 k 个样本，将它们放入「精英集」。我们可以将其标注为：
公式（7）：
重复（2）-（4）步直到结果足够好。

协方差自适应进化策略

基本原理

标准差 σ 决定了探索的程度：当 σ 越大时，我们就可以在更大的搜索空间中对后代种群进行采样。在简单高斯演化策略中，σ(t+1) 与 σ(t) 密切相关，因此算法不能在需要时（即置信度改变时）迅速调整探索空间。

「协方差矩阵自适应演化策略」（CMA-ES）通过使用协方差矩阵 C 跟踪分布上得到的样本两两之间的依赖关系，解决了这一问题。新的分布参数变为了：

其中，σ 控制分布的整体尺度，我们通常称之为「步长」。

在我们深入研究 CMA-ES 中的参数更新方法前，不妨先回顾一下多元高斯分布中协方差矩阵的工作原理。作为一个对称阵，协方差矩阵 C 有下列良好的性质：

C 始终是对角阵
C 始终是半正定矩阵
所有的特征值都是非负实数
所有特征值都是正交的
C 的特征向量可以组成 Rn 的一个标准正交基

令矩阵 C 有一个特征向量 B=[b_1,b_2,...,b_n] 组成的标准正交基，相应的特征值分别为 $\lambda_1^2,\lambda_2^2,...,\lambda_n^2$ 令 D=diag (λ1,…,λn)。

公式（1）：

C 的平方根为：

公式（2）：

更新均值

CMA-ES 使用 $\alpha_\mu\leq 1$ 的学习率控制均值 μ 更新的速度。通常情况下，该学习率被设置为1，从而使上述等式与简单高斯演化策略中的均值更新方法相同。

控制步长

采样过程可以与均值和标准差的更新解耦：

公式（6）：

参数 σ 控制着分布的整体尺度。它是从协方差矩阵中分离出来的，所以我们可以比改变完整的协方差更快地改变步长。步长较大会导致参数更新较快。为了评估当前的步长是否合适，CMA-ES 通过将连续的移动步长序列相加 $\small \frac{1}{\lambda}\sum _i^\lambda y_i^{(j)},j=1,2,...,t$ 构建了一个演化路径（evolution path） $p_\sigma$ 通过比较该路径与随机选择（意味着每一步之间是不相关的）状态下期望会生成的路径长度，我们可以相应地调整 σ（详见下图）。

preview

每次演化路径都会以同代中的平均移动步长 y_i 进行更新。

通过与 $C^{-1/2}$ 相乘，我们将演化路径转化为与其方向相独立的形式。

$C^{(t)-1/2}=B^{(t)^T}D^{(t)^{-1/2}B(t)}$ 原理如下：

B(t) 包含 C 的特征向量的行向量。它将原始空间投影到了正交的主轴上。
将各主轴的长度放缩到相等的状态。
$B^{(t)^T}$ 将空间转换回原始坐标系.

为了给最近几代的种群赋予更高的权重，我们使用了「Polyak 平均」算法（平均优化算法在参数空间访问轨迹中的几个点），以学习率 $\alpha_\sigma$ 更新演化路径。同时，我们平衡了权重，从而使 $p_\sigma$ 在更新前和更新后都为服从 N (0,I) 的共轭分布（更新前后的先验分布和后验分布类型相同）。

公式（8）：

随机选择得到的 $p_\sigma$ 的期望长度为 E‖N (0,I)‖，该值是服从 N (0,I) 的随机变量的 L2 范数的数学期望。按照图 2 中的思路，我们将根据 ‖pσ(t+1)‖/E‖N (0,I)‖ 的比值调整步长：

公式（3）：E||N(0,I)|| == E(norm(randn(N,1)))

公式（11）：

自适应协方差矩阵

我们可以使用精英样本的 y_i 从头开始估计协方差矩阵 $(y_i\sim N(0,C))$

只有当我们选择出的种群足够大，上述估计才可靠。然而，在每一代中，我们确实希望使用较小的样本种群进行快速的迭代。这就是 CMA-ES 发明了一种更加可靠，但同时也更加复杂的方式去更新 C 的原因。它包含两种独立的演化路径：

秩 min (λ, n) 更新：使用 { $C_\lambda$ } 中的历史，在每一代中都是从头开始估计的。
秩 1 更新：根据历史估计移动步长以及符号信息

第一条路径根据 { $C_\lambda$ }的全部历史考虑 C 的估计。例如，如果我们拥有很多代种群的经验，

就是一种很好的估计方式。类似于 $p_\sigma$ ，我们也可以使用「polyak」平均，并且通过学习率引入历史信息：

公式（5）：

通常我们选择的学习率为：

第二条路径试图解决丢失符号信息的问题。与我们调整步长 σ 的方法相类似，我们使用了一个演化路径 p_c 来记录符号信息， p_c 仍然是种群更新前后都服从于 N (0,C) 的共轭分布。

我们可以认为我们可以认为 p_c 是另一种计算的（请注意它们都服从于 N (0,C)），此时我们使用了完整的历史信息，并且能够保留符号信息。请注意，在上一节中，我们已经知道了是另一种计算的（请注意它们都服从于 N (0,C)），此时我们使用了完整的历史信息，并且能够保留符号信息。请注意，在上一节中，我们已经知道了

的计算方法如下：

公式（9）：

接下来，通过 p_c 来更新协方差矩阵：

公式（4）：

当 k 较小时，秩 1 更新方法相较于秩 min (λ, n) 更新有很大的性能提升。这是因为我们在这里利用了移动步长的符号信息和连续步骤之间的相关性，而且这些信息可以随着种群的更新被一代一代传递下去。

最后，我们将两种方法结合起来：

公式（10）：

preview

在上面所有的例子中，我们认为每个优秀的样本对于权重的贡献是相等的，都为 1/λ。该过程可以很容易地被扩展至根据具体表现为抽样得到的样本赋予不同权重的情况。

preview

下面的例子引用自一项外骨骼在线参数优化的工作。[1]. Zhang, J., Fiers, P., Witte, K. A., Jackson, R. W., Poggensee, K. L., Atkeson, C. G., Collins, S. H. Human-in-the-loop optimization of exoskeleton assistance during walking. Science 2017, 356, 1280-1284.

后续有时间我对代码加上中文注释，与上面的理论部分对应起来。

%% Sample code for the CMA-ES process used in the main study.

% This function emulates the CMA-ES optimization used in the system. In
% actual experiments, the following process was implemented using simulink
% state machine.

% USER DEFINED PARAMETERS
% N: Number of objective variables
% xmean: Initial guess of objective variable distribution mean
%        xmean(1): peak torque
%        xmean(2): 2 * percent stride at which peak torque occurs
%        xmean(3): rise time, as a percentage of stride period
%        xmean(4): 2 * fall time, as a percentage of stride period
% sigma: Standard deviation of variable distribution. Initial value needs
% selecting
% stopeval: Number of evaluations before stopping the optimization
% lambda: Population size of each generation
% pc, ps   % Evolution paths for C and sigma. Initial values needs
% selecting
% B: Matrix that defines the coordinate system. Initial values needs
% selecting
% D: Vector that defines the scaling. Initial values needs selecting
% C: Covariance matrix C of variable distribution. Initial values needs
% selecting


% OUTPUTS
% xmin: The estimated optimal objective variable value that minimizes the
% objective function.

% This function was adapted from  Nikolaus Hansen's code located at
% URL: http://www.lri.fr/~hansen/purecmaes.m and references at the end. It
% demonstrates a basic CMA-ES process as used in our main study.
clear all
close all
clc
%%
% function xmin=CMAESsamplecode   % (mu/mu_w, lambda)-CMA-ES
% --------------------  Initialization --------------------------------
% User defined input parameters (need to be edited)
objectfun = 'getmetabolicrates';  
% Name of objective/fitness function
N = 4;               
% 变量维度的大小Number of objective variables/problem dimension
subjweight = 50;      
% Subject weight in kg
xmean = [0.55*subjweight 90 25 20]';
% 待求解的x的N维度的初始值
% Objective variables initial point
sigma = 10;          
% 初始步长Coordinate wise standard deviation (step size)
stopeval = 32;   
% 最大迭代步数Stop after stopeval number of function evaluations

% Strategy parameter setting: Selection
lambda = 4+floor(3*log(N));  
%种群大小 Population size, offspring number
mu = lambda/2;               
%精英集的大小，取了一半 Number of parents/points for recombination
weights = log(mu+1/2)-log(1:mu)'; 
%精英集中成员的权重，对应理论部分最后几句 mu*1 array for weighted recombination
mu = floor(mu);
weights = weights/sum(weights);     
%精英集中成员的权重归一化Normalize recombination weights array
mueff=sum(weights)^2/sum(weights.^2); 
% 理论部分没有，精英集中成员权重的方差-有效性Variance-effectiveness of sum w_i x_i
%
% Initialize dynamic (internal) strategy parameters and constants
pc = zeros(N,1);
%协方差的演化路径Evolution paths for C
ps = zeros(N,1);   
%sigma步长的演化路径p_sigma
B = eye(N,N);                       
%协方差矩阵C的标准正交基矩阵B defines the coordinate system
D = ones(N,1);                      
%协方差矩阵C的特征值矩阵D Diagonal D defines the scaling
C = B * diag(D.^2) * B';            
%公式（1） Covariance matrix C
invsqrtC = B * diag(D.^-1) * B';    
% 公式（2）C^-1/2
eigeneval = 0;                      
% Track update of B and D
chiN=N^0.5*(1-1/(4*N)+1/(21*N^2));  
% 公式（3）高斯分布二范数的期望Expectation of
%   ||N(0,I)|| == norm(randn(N,1))




% Strategy parameter setting: Adaptation
cc = (4+mueff/N) / (N+4 + 2*mueff/N);  
% 对应公式（9），协方差路径的系数，cc为alph_cp,Time constant for cumulation for C
cs = (mueff+2) / (N+mueff+5);  
% 对应公式（8）中的alpa_sigma，t-const for cumulation for sigma control
c1 = 2 / ((N+1.3)^2+mueff);    
% 对应公式（4）中alpha_c1，Learning rate for rank-one update of C
cmu = min(1-c1, 2 * (mueff-2+1/mueff) / ((N+2)^2+mueff));  
% 对应公式（5）中alpha_c_lambda，and for rank-mu update
damps = 1 + 2*max(0, sqrt((mueff-1)/(N+1))-1) + cs; 
% 对应公式（11）d_sigma，Damping for sigma
% usually close to 1

% -------------------- Generation Loop --------------------------------
counteval = 0;  %计数器Number of evalution finished
while counteval < stopeval
    fprintf("counteval = %d\n", counteval)
    % Generate and evaluate lambda offspring
    for k=1:lambda,%生成整体种群并评估
        x(:, k) = xmean + sigma * B * (D .* randn(N,1)); 
        % 对应公式（6）m + sig * Normal(0,C)和公式（2）
        x(:, k) = apply_constraints(x(:,k), 50); 
        % 根据约束修正生成的x，Apply hard constraints
        % to the random generated parameters. See apply_constraints.m for
        % seperate definition of the function.
        metrates(k) = feval(objectfun, x(:,k)); 
        % 带入目标函数评估，得到结果metrates，Objective function call
        counteval = counteval+1;
    end
    
    % Sort by fitness and compute weighted mean into xmean
    [metrates, idx] = sort(metrates); 
    %排序，用于筛选出较小值 Minimization
    xold = xmean;
    xmean = x(:,idx(1:mu))*weights;
    %对应公式（7）筛选出精英集，并依据权重求出精英集均值mu，Recombination, new mean value
    fprintf("xmean = %d\n", xmean)
    
    % Cumulation: Update evolution paths
    ps = (1-cs)*ps ...
        + sqrt(cs*(2-cs)*mueff) * invsqrtC * (xmean-xold) / sigma;
    %对应公式（8）更新进化路径，与理论不一致没有用精英集大小mu，mueff与mu有关
    hsig = norm(ps)/sqrt(1-(1-cs)^(2*counteval/lambda))/chiN < 1.4 + 2/(N+1);
    %理论部分没有
    pc = (1-cc)*pc ...
        + hsig * sqrt(cc*(2-cc)*mueff) * (xmean-xold) / sigma;
    %对应公式（9），cc为alph_cp。hsig为记录的符号信息，理论部分没有
    % Adapt covariance matrix C
    artmp = (1/sigma) * (x(:,idx(1:mu))-repmat(xold,1,mu));
    %对应公式（6）求解出yi
    C = (1-c1-cmu) * C ...                  % Regard old matrix
        + c1 * (pc*pc' ...                 % Plus rank one update
        + (1-hsig) * cc*(2-cc) * C) ... % Minor correction if hsig==0
        + cmu * artmp * diag(weights) * artmp'; % Plus rank mu update
    %对应公式（10），hsig这项理论没有
    % Adapt step size sigma
    sigma = sigma * exp((cs/damps)*(norm(ps)/chiN - 1));
    %对应公式（11），自适应更新步长
    % Decomposition of C into B*diag(D.^2)*B' (diagonalization)
    if counteval - eigeneval > lambda/(c1+cmu)/N/10  % to achieve O(N^2)
         %矩阵异常处理，理论部分没有
        eigeneval = counteval;
        C = triu(C) + triu(C,1)'; % Enforce symmetry
        [B,D] = eig(C);           % Eigen decomposition, B==normalized eigenvectors
        D = sqrt(diag(D));        % D is a vector of standard deviations now
        invsqrtC = B * diag(D.^-1) * B';
    end
    
    pause(0.5)
    
 end 

xmin = xmean; 
%得到的一组精英集的平均值，按照权重weights平均
%Return the lastest variable distribution as an estimate of the opitmal value


%% REFERENCES
%
% Hansen, N. and S. Kern (2004). Evaluating the CMA Evolution
% Strategy on Multimodal Test Functions.  Eighth International
% Conference on Parallel Problem Solving from Nature PPSN VIII,
% Proceedings, pp. 282-291, Berlin: Springer.
% (http://www.bionik.tu-berlin.de/user/niko/ppsn2004hansenkern.pdf)
%
% Hansen, N. and A. Ostermeier (2001). Completely Derandomized
% Self-Adaptation in Evolution Strategies. Evolutionary Computation,
% 9(2), pp. 159-195.
% (http://www.bionik.tu-berlin.de/user/niko/cmaartic.pdf).
%
% Hansen, N., S.D. Mueller and P. Koumoutsakos (2003). Reducing the
% Time Complexity of the Derandomized Evolution Strategy with
% Covariance Matrix Adaptation (CMA-ES). Evolutionary Computation,
% 11(1).  (http://mitpress.mit.edu/journals/pdf/evco_11_1_1_0.pdf).
%

function f=getmetabolicrates(x)
% This is a pseudo function used to reprent the process of metabolic rate
% evalution in actual human-in-the-loop optimization. In actual operation,
% this process was accomplished by measuring the human subject's resparitory
% flow rates of oxygen and carbon dioxide, converting them to raw metabolic
% rate measurements and getting a single estimate of the actual metabolic
% rate while the subject is walking with exoskeleton under one fixed
% assistance condition. The assistance condition was defined by parameter
% x, with a corresponding torque curve (see torquecurve_example.m) defined
% by HLCParams = [x(1) 0.5*x(2) x(3) 0.5*x(4)];

f = 250 + 100*randn(1); % just random noise, for illustration purposes only

%% Apply Constraints to Control Parameters during Optimization

% This function is used during human-in-the-loop optimization. After CMA-ES
% generated a generation of conditions in the form of [peak torque, 2* peak
% time, rise time, 2* fall time], each set of condition is put through this
% function to be hard constraint.

% INPUTS
% original params: One set of control parameters randomly generated by
% CMA-ES. It is a 4x1 or 1x4 vector made of [peak torque, 2* peak time,
% rise time, 2* fall time] or [peak torque, 2*peak time, rise time, 2* fall
% time]'.
% maxTorque: The maximum allowed peak torque of the current subject. It is
% usually set as 0.825*(subject body mass), but can be changed based on
% subject feedback. Normally it's between [0.8 1]*(subject body mass).

% OUTPUTS
% params: Control parameters [peak torque, 2*peak time, rise time, 2*fall
% time]' after constraints applied.

% This function constrains peak torque to be [0 maxTorque]. 2*peak time is
% constraint to be at least 10% of strid period. When peak torque is
% smaller or equal to 75% of maxTroque, 2*peak time is upper bounded at 110%
% stride period. When peak torque is in[75% 100%] of maxTorque, the upper
% bound of 2*peak time is linearly interploted between 100% and 110% ob
% stride period. Both rise time and 2*fall time is bounded in [10% 40%] of
% stride period. Further more, when peak time + fall time >65% of stride
% period, we force fall time = 65- peak time. Then if peak time - rise time
% < 5% of stride period, ew enforce rise time = peak time -5.

% Copyright@Juanjuan Zhang, Steven H Collins 11/30/2016





function params = apply_constraints(original_params, maxTorque)
params = zeros(length(original_params), 1);

% assert params to be column vector;
if iscolumn(original_params)
    params = original_params;
else
    params = original_params';
end;


params(find(params<0))=0; %% All parameters are asserted to be bigger or equal to 0.
params(3,1) = max(10, params(3,1)); % rise time at least 10%
params(3,1) = min(40, params(3,1)); % rise time me at most 40%
params(4,1) = max(10, params(4,1)); % drop time at least 5%
params(4,1) = min(40, params(4,1)); % drop time at most 20%

%% max peak desired torque
params(1,1) = min(maxTorque, params(1,1));
peakTorque = params(1,1);

if peakTorque < 0.75*maxTorque
    maxParam2 = 110;
else
    maxParam2 = 110-4*(peakTorque-0.75*maxTorque)/(0.25*maxTorque);
end;

params(2,1) = min(maxParam2, max(20,params(2,1)));


if params(4,1)/2+params(2,1)/2>65
    params(4,1) = max(2*(65-params(2,1)/2),10);
end;
% if torque starts too early, reduce rise time
if params(2,1)/2-params(3,1)<5
    params(3,1) = max(params(2,1)/2-5, 10);
end;


return;