Adam优化器在智能寻优领域的表现

机器学习与优化算法

于 2024-07-16 15:00:29 发布

阅读量257

点赞数 8

文章标签： python 算法

本文链接：https://blog.csdn.net/m0_59257547/article/details/140463979

版权

在很久之前我就思考，既然Adam优化器在神经网络反向传递的时候那么好用，那么其是否也可以作为一种优化算法应用于智能优化算法领域(如遗传算法GA、粒子群算法PSO等算法做的事情)，今天我正好闲的没事儿干就简单对比了一下Adam优化器与GA优化器在23个基准函数上的表现。

1. Adam优化器简介

引用自：Adam: A Method for Stochastic Optimization
在这里插入图片描述
首先介绍一下Adam优化器，作为地表最强神经网络优化器，引用次数高达十八万在所有论文引用次数中都是排进前十的！其在深度学习领域可谓无人不知无人不晓。Adam优化器是由Diederik P. Kingma和Jimmy Ba在2014年提出的，其结合了AdaGrad和RMSProp的优点，它通过维护梯度的一阶矩（动量）和二阶矩（未中心化方差）的指数加权移动平均，以自适应的方式调整每个参数的学习率。这种方法在深度学习模型的训练中表现出了优异的性能，尤其是当面对大量数据和高度复杂的模型时。由于其实现简单且通常能快速收敛，Adam已成为深度学习中最受欢迎的优化算法之一。

2. Adam 优化器计算原理

Adam (Adaptive Moment Estimation) 是一个用于训练机器学习模型的优化算法，尤其在深度学习中表现突出。它的主要特性是能够自适应地调整每个参数的学习率，这使得它在处理非平稳目标函数和稀疏梯度时特别有效。

2.1 基本概念

梯度的一阶矩估计: $m_t$ ，即动量（momentum），用来估计梯度的均值。
梯度的二阶矩估计: $v_t$ ，用来估计梯度的未中心化方差，帮助控制学习率。

2.2 更新规则

假设我们有一个可微的损失函数 $L(\theta)$ ，其中 $\theta$ 是模型参数向量。Adam 算法在每个时间步 $t$ 都会更新参数 $\theta_t$ 如下：

初始化:
- 设置初始参数 $\theta_0$
- 初始化一阶矩向量 $m_0 = 0$ （动量）
- 初始化二阶矩向量 $v_0 = 0$ （未中心化方差）
- 设置超参数 $\alpha$ (学习率), $\beta_1$ , $\beta_2$ , 和 $\epsilon$ (小常数防止除零)
循环直到收敛:
- 对于每个时间步 $t$ :
  - 计算梯度 $\nabla_\theta L(\theta_{t-1})$
  - 更新一阶矩估计:
    $m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla_\theta L(\theta_{t-1})$
  - 更新二阶矩估计:
    $v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla_\theta L(\theta_{t-1}))^2$
  - 计算偏差校正的一阶和二阶矩:
    $\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$
    $\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$
  - 更新参数:
    $\theta_t = \theta_{t-1} - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$

通过以上步骤，Adam 算法能够自适应地调整每个参数的学习率，从而加速收敛并避免局部最优解。

3. Adam与GA算法对比

对于GA算法网上的介绍太多太多了，主要是因为太经典了，几乎是智能优化领域最知名的进化算法了。我这里就不再赘述，放几个链接，感兴趣的观众老爷可以跳转看一下：
遗传算法原理及其matlab程序实现
 遗传算法详解附python代码实现
然后给出智能优化领域常用的23个标准函数上的测试结果：

Sphere 函数:

$\sum_{i=1}^n x_i^2$

Rosenbrock 函数:

$\sum_{i=1}^{n-1} [100(x_{i+1} - x_i^2)^2 + (1 - x_i)^2]$

Ackley 函数:

$\exp\left(-0.2 \sqrt{\frac{1}{n}\sum_{i=1}^n x_i^2}\right) - \exp\left(\frac{1}{n}\sum_{i=1}^n \cos(2\pi x_i)\right) + 20 + e$
在这里插入图片描述

Griewank 函数:

$\sum_{i=1}^n \frac{x_i^2}{4000} - \prod_{i=1}^n \cos\left(\frac{x_i}{\sqrt{i}}\right) + 1$
在这里插入图片描述

Rastrigin 函数:

$\sum_{i=1}^n [x_i^2 - 10\cos(2\pi x_i)]$

Schwefel 函数:

$\sum_{i=1}^n x_i \sin(\sqrt{|x_i|})$

Levy 函数:

$\sin^2(\pi w_1) + \sum_{i=1}^{n-1} (w_i - 1)^2 [1 + 10\sin^2(\pi w_i + 1)] + (w_n - 1)^2 [1 + \sin^2(2\pi w_n)]$

其中 $w_i = 1 + \frac{x_i - 1}{4}$ , 对于所有 $i = 1, ..., n$
在这里插入图片描述

Michalewicz 函数:

$-\sum_{i=1}^n \sin(x_i) \left[\sin\left(\frac{i x_i^2}{\pi}\right)\right]^{2m}$

其中 $m = 10$
在这里插入图片描述

Zakharov 函数:

$\sum_{i=1}^n x_i^2 + \left(\sum_{i=1}^n 0.5ix_i\right)^2 + \left(\sum_{i=1}^n 0.5ix_i\right)^4$

Dixon-Price 函数:

$(x_1 - 1)^2 + \sum_{i=2}^n i(2x_i^2 - x_{i-1})^2$
在这里插入图片描述

Styblinski-Tang 函数:

$\frac{1}{2}\sum_{i=1}^n (x_i^4 - 16x_i^2 + 5x_i)$

Powell 函数:

$\sum_{i=1}^{n/4} [(x_{4i-3} + 10x_{4i-2})^2 + 5(x_{4i-1} - x_{4i})^2 + (x_{4i-2} - 2x_{4i-1})^4 + 10(x_{4i-3} - x_{4i})^4]$

Alpine 函数:

$\sum_{i=1}^n |x_i \sin(x_i) + 0.1x_i|$
在这里插入图片描述

Cigar 函数:

$x_1^2 + 10^6 \sum_{i=2}^n x_i^2$
在这里插入图片描述

Schaffer F6 函数:

$\frac{\sin^2(\sqrt{x_1^2 + x_2^2}) - 0.5}{[1 + 0.001(x_1^2 + x_2^2)]^2}$

Happy Cat 函数:

$\left[(x^T x - n)^2\right]^\alpha + \frac{1}{2}(x^T x + \sum_{i=1}^n x_i) + \frac{1}{2}$

其中 $\alpha = \frac{1}{8}$
在这里插入图片描述

Bent Cigar 函数:

$x_1^2 + 10^6 \sum_{i=2}^n x_i^2$
在这里插入图片描述

Bohachevsky 函数:

$x_1^2 + 2x_2^2 - 0.3\cos(3\pi x_1) - 0.4\cos(4\pi x_2) + 0.7$
在这里插入图片描述

Drop-Wave 函数:

$-\frac{1 + \cos(12\sqrt{x_1^2 + x_2^2})}{0.5(x_1^2 + x_2^2) + 2}$

Cross-in-Tray 函数:

$\left(|\sin(x_1)\sin(x_2)\exp(|100 - \frac{\sqrt{x_1^2 + x_2^2}}{\pi}|)| + 1\right)^{0.1}$

Holder Table 函数:

$-|\sin(x_1)\cos(x_2)\exp(|1 - \frac{\sqrt{x_1^2 + x_2^2}}{\pi}|)|$
在这里插入图片描述

Langermann 函数:

$-\sum_{i=1}^m c_i \exp(-\frac{1}{\pi} \sum_{j=1}^n (x_j - A_{ij})^2) \cos(\pi \sum_{j=1}^n (x_j - A_{ij})^2)$

其中 $m = 5$ , $c$ 和 $A$ 是预定义的常数向量和矩阵。
在这里插入图片描述

4. 代码

测试函数：

# benchmark_functions.py

import torch
import numpy as np

class BenchmarkFunctions:
    @staticmethod
    def sphere(x):
        return torch.sum(x**2)

    @staticmethod
    def rosenbrock(x):
        return torch.sum(100.0 * (x[1:] - x[:-1]**2)**2 + (1 - x[:-1])**2)

    @staticmethod
    def ackley(x):
        return -20.0 * torch.exp(-0.2 * torch.sqrt(torch.mean(x**2))) - \
               torch.exp(torch.mean(torch.cos(2.0 * np.pi * x))) + 20.0 + np.e

    @staticmethod
    def griewank(x):
        return torch.sum(x**2) / 4000.0 - torch.prod(torch.cos(x / torch.sqrt(torch.arange(1, len(x)+1).float()))) + 1

    @staticmethod
    def rastrigin(x):
        return 10 * len(x) + torch.sum(x**2 - 10 * torch.cos(2 * np.pi * x))

    @staticmethod
    def schwefel(x):
        return 418.9829 * len(x) - torch.sum(x * torch.sin(torch.sqrt(torch.abs(x))))

    @staticmethod
    def levy(x):
        w = 1 + (x - 1) / 4
        return torch.sin(np.pi * w[0])**2 + \
               torch.sum((w[:-1] - 1)**2 * (1 + 10 * torch.sin(np.pi * w[:-1] + 1)**2)) + \
               (w[-1] - 1)**2 * (1 + torch.sin(2 * np.pi * w[-1])**2)

    @staticmethod
    def michalewicz(x):
        m = 10
        return -torch.sum(torch.sin(x) * torch.sin(torch.arange(1, len(x)+1) * x**2 / np.pi)**(2*m))

    @staticmethod
    def zakharov(x):
        return torch.sum(x**2) + (0.5 * torch.sum(torch.arange(1, len(x)+1) * x))**2 + \
               (0.5 * torch.sum(torch.arange(1, len(x)+1) * x))**4

    @staticmethod
    def dixon_price(x):
        return (x[0] - 1)**2 + torch.sum((torch.arange(2, len(x)+1) * (2 * x[1:]**2 - x[:-1])**2))

    @staticmethod
    def styblinski_tang(x):
        return 0.5 * torch.sum(x**4 - 16 * x**2 + 5 * x)

    @staticmethod
    def powell(x):
        return torch.sum((x[::4] + 10*x[1::4])**2 + 5*(x[2::4] - x[3::4])**2 + \
                         (x[1::4] - 2*x[2::4])**4 + 10*(x[::4] - x[3::4])**4)

    @staticmethod
    def alpine(x):
        return torch.sum(torch.abs(x * torch.sin(x) + 0.1 * x))

    @staticmethod
    def cigar(x):
        return x[0]**2 + 1e6 * torch.sum(x[1:]**2)

    @staticmethod
    def schaffer_f6(x):
        num = torch.sin(torch.sqrt(torch.sum(x**2)))**2 - 0.5
        den = (1 + 0.001 * torch.sum(x**2))**2
        return 0.5 + num / den

    @staticmethod
    def happy_cat(x, alpha=1/8):
        return ((torch.sum(x**2) - len(x))**2)**alpha + (0.5 * torch.sum(x**2) + torch.sum(x))/len(x) + 0.5

    @staticmethod
    def bent_cigar(x):
        return x[0]**2 + 1e6 * torch.sum(x[1:]**2)

    @staticmethod
    def bohachevsky(x):
        return x[0]**2 + 2*x[1]**2 - 0.3*torch.cos(3*np.pi*x[0]) - 0.4*torch.cos(4*np.pi*x[1]) + 0.7

    @staticmethod
    def drop_wave(x):
        return -(1 + torch.cos(12 * torch.sqrt(torch.sum(x**2)))) / (0.5 * torch.sum(x**2) + 2)

    @staticmethod
    def cross_in_tray(x):
        return -0.0001 * (torch.abs(torch.sin(x[0]) * torch.sin(x[1]) * 
                          torch.exp(torch.abs(100 - torch.sqrt(torch.sum(x**2))/np.pi))) + 1)**0.1

    @staticmethod
    def holder_table(x):
        return -torch.abs(torch.sin(x[0]) * torch.cos(x[1]) * 
                          torch.exp(torch.abs(1 - torch.sqrt(torch.sum(x**2))/np.pi)))

    @staticmethod
    def langermann(x):
        dim = x.shape[0]
        a = torch.tensor([[3, 5, *[0] * (dim - 2)],
                          [5, 2, *[0] * (dim - 2)],
                          [2, 1, *[0] * (dim - 2)],
                          [1, 4, *[0] * (dim - 2)],
                          [7, 9, *[0] * (dim - 2)]])
        c = torch.tensor([1, 2, 5, 2, 3])
        m = 5
        a = a[:, :dim]  # 只取前dim列
        return -torch.sum(c * torch.exp(-(1 / np.pi) * torch.sum((x.unsqueeze(0) - a) ** 2, dim=1)) *
                          torch.cos(np.pi * torch.sum((x.unsqueeze(0) - a) ** 2, dim=1)))

    @classmethod
    def get_function(cls, name):
        return getattr(cls, name.lower())

    @classmethod
    def get_all_functions(cls):
        return [func for func in dir(cls) if callable(getattr(cls, func)) and not func.startswith("__") and func != "get_function" and func != "get_all_functions"]

优化代码

# main.py

import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.optim as optim
from deap import base, creator, tools, algorithms
from TestFun import BenchmarkFunctions


# 定义GA算法
def ga_optimize(func, dim, bounds, ngen=1000, npop=50):
    creator.create("FitnessMin", base.Fitness, weights=(-1.0,))
    creator.create("Individual", list, fitness=creator.FitnessMin)

    toolbox = base.Toolbox()
    toolbox.register("attr_float", np.random.uniform, bounds[0], bounds[1])
    toolbox.register("individual", tools.initRepeat, creator.Individual, toolbox.attr_float, n=dim)
    toolbox.register("population", tools.initRepeat, list, toolbox.individual)
    toolbox.register("evaluate", lambda x: (func(torch.tensor(x).float()).item(),))
    toolbox.register("mate", tools.cxBlend, alpha=0.5)
    toolbox.register("mutate", tools.mutGaussian, mu=0, sigma=1, indpb=0.1)
    toolbox.register("select", tools.selTournament, tournsize=3)

    pop = toolbox.population(n=npop)
    stats = tools.Statistics(lambda ind: ind.fitness.values)
    stats.register("min", np.min)

    _, log = algorithms.eaSimple(pop, toolbox, cxpb=0.7, mutpb=0.2, ngen=ngen, stats=stats, verbose=False)
    return log.select("min")


# 定义Adam优化
def adam_optimize(func, dim, bounds, niter=1000):
    x = torch.tensor(np.random.uniform(bounds[0], bounds[1], dim), requires_grad=True)
    optimizer = optim.Adam([x])
    losses = []

    for _ in range(niter):
        optimizer.zero_grad()
        loss = func(x)
        loss.backward()
        optimizer.step()
        losses.append(loss.item())

    return losses


def main():
    function_names = BenchmarkFunctions.get_all_functions()

    for name in function_names:
        func = BenchmarkFunctions.get_function(name)
        dim = 30  # 假设所有函数都是30维
        bounds = (-100, 100)  # 假设所有函数的搜索范围都是[-100, 100]

        print(f"Optimizing {name} function...")

        # GA优化
        ga_results = ga_optimize(func, dim, bounds)

        # Adam优化
        adam_results = adam_optimize(func, dim, bounds)

        # 创建包含两个子图的图
        fig = plt.figure(figsize=(20, 10))

        # 左侧子图：3D渲染图
        ax1 = fig.add_subplot(121, projection='3d')
        x = np.linspace(bounds[0], bounds[1], 100)
        y = np.linspace(bounds[0], bounds[1], 100)
        X, Y = np.meshgrid(x, y)
        Z = np.array([func(torch.tensor([xi, yi])).item() for xi, yi in zip(np.ravel(X), np.ravel(Y))]).reshape(X.shape)

        surf = ax1.plot_surface(X, Y, Z, cmap='viridis')
        ax1.set_title(f'{name.capitalize()} Function')
        ax1.set_xlabel('X')
        ax1.set_ylabel('Y')
        ax1.set_zlabel('Z')
        fig.colorbar(surf, ax=ax1, shrink=0.5, aspect=5)

        # 右侧子图：迭代下降图
        ax2 = fig.add_subplot(122)
        ax2.plot(ga_results, label='GA')
        ax2.plot(adam_results, label='Adam')
        ax2.set_title(f'{name.capitalize()} Function Optimization')
        ax2.set_xlabel('Iterations')
        ax2.set_ylabel('Fitness')
        ax2.legend()
        ax2.set_yscale('log')

        plt.tight_layout()
        plt.savefig(f'./img1/{name}_optimization.png')
        plt.close()

        print(f"Optimization of {name} function completed.")
if __name__ == "__main__":
    main()

结论

从上面的若干结果中可以显而易见的发现，再智能寻优领域，Adam的优化器远不如GA算法，甚至对于某些函数直接就摆烂了如：
在这里插入图片描述
至于更深层次的原因由于小编现在的水平有限，还不能给出严谨的证明，之后如果有思路会进行结论补充。
最后最后，烦请各位观众老爷给个三连，小编在这里跪谢了!