在很久之前我就思考,既然Adam优化器在神经网络反向传递的时候那么好用,那么其是否也可以作为一种优化算法应用于智能优化算法领域(如遗传算法GA、粒子群算法PSO等算法做的事情),今天我正好闲的没事儿干就简单对比了一下Adam优化器与GA优化器在23个基准函数上的表现。
1. Adam优化器简介
引用自:Adam: A Method for Stochastic Optimization
首先介绍一下Adam优化器,作为地表最强神经网络优化器,引用次数高达十八万在所有论文引用次数中都是排进前十的!其在深度学习领域可谓无人不知无人不晓。Adam优化器是由Diederik P. Kingma和Jimmy Ba在2014年提出的,其结合了AdaGrad和RMSProp的优点,它通过维护梯度的一阶矩(动量)和二阶矩(未中心化方差)的指数加权移动平均,以自适应的方式调整每个参数的学习率。这种方法在深度学习模型的训练中表现出了优异的性能,尤其是当面对大量数据和高度复杂的模型时。由于其实现简单且通常能快速收敛,Adam已成为深度学习中最受欢迎的优化算法之一。
2. Adam 优化器计算原理
Adam (Adaptive Moment Estimation) 是一个用于训练机器学习模型的优化算法,尤其在深度学习中表现突出。它的主要特性是能够自适应地调整每个参数的学习率,这使得它在处理非平稳目标函数和稀疏梯度时特别有效。
2.1 基本概念
- 梯度的一阶矩估计: m t m_t mt,即动量(momentum),用来估计梯度的均值。
- 梯度的二阶矩估计: v t v_t vt,用来估计梯度的未中心化方差,帮助控制学习率。
2.2 更新规则
假设我们有一个可微的损失函数 L ( θ ) L(\theta) L(θ),其中 θ \theta θ 是模型参数向量。Adam 算法在每个时间步 t t t 都会更新参数 θ t \theta_t θt 如下:
-
初始化:
- 设置初始参数 θ 0 \theta_0 θ0
- 初始化一阶矩向量 m 0 = 0 m_0 = 0 m0=0 (动量)
- 初始化二阶矩向量 v 0 = 0 v_0 = 0 v0=0 (未中心化方差)
- 设置超参数 α \alpha α (学习率), β 1 \beta_1 β1, β 2 \beta_2 β2, 和 ϵ \epsilon ϵ (小常数防止除零)
-
循环直到收敛:
- 对于每个时间步
t
t
t:
- 计算梯度 ∇ θ L ( θ t − 1 ) \nabla_\theta L(\theta_{t-1}) ∇θL(θt−1)
- 更新一阶矩估计:
m t = β 1 m t − 1 + ( 1 − β 1 ) ∇ θ L ( θ t − 1 ) m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla_\theta L(\theta_{t-1}) mt=β1mt−1+(1−β1)∇θL(θt−1) - 更新二阶矩估计:
v t = β 2 v t − 1 + ( 1 − β 2 ) ( ∇ θ L ( θ t − 1 ) ) 2 v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla_\theta L(\theta_{t-1}))^2 vt=β2vt−1+(1−β2)(∇θL(θt−1))2 - 计算偏差校正的一阶和二阶矩:
m ^ t = m t 1 − β 1 t \hat{m}_t = \frac{m_t}{1 - \beta_1^t} m^t=1−β1tmt
v ^ t = v t 1 − β 2 t \hat{v}_t = \frac{v_t}{1 - \beta_2^t} v^t=1−β2tvt - 更新参数:
θ t = θ t − 1 − α v ^ t + ϵ m ^ t \theta_t = \theta_{t-1} - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t θt=θt−1−v^t+ϵαm^t
- 对于每个时间步
t
t
t:
通过以上步骤,Adam 算法能够自适应地调整每个参数的学习率,从而加速收敛并避免局部最优解。
3. Adam与GA算法对比
对于GA算法网上的介绍太多太多了,主要是因为太经典了,几乎是智能优化领域最知名的进化算法了。我这里就不再赘述,放几个链接,感兴趣的观众老爷可以跳转看一下:
遗传算法原理及其matlab程序实现
遗传算法详解 附python代码实现
然后给出智能优化领域常用的23个标准函数上的测试结果:
- Sphere 函数:
f ( x ) = ∑ i = 1 n x i 2 f(x) = \sum_{i=1}^n x_i^2 f(x)=i=1∑nxi2
- Rosenbrock 函数:
f ( x ) = ∑ i = 1 n − 1 [ 100 ( x i + 1 − x i 2 ) 2 + ( 1 − x i ) 2 ] f(x) = \sum_{i=1}^{n-1} [100(x_{i+1} - x_i^2)^2 + (1 - x_i)^2] f(x)=i=1∑n−1[100(xi+1−xi2)2+(1−xi)2]
- Ackley 函数:
f
(
x
)
=
−
20
exp
(
−
0.2
1
n
∑
i
=
1
n
x
i
2
)
−
exp
(
1
n
∑
i
=
1
n
cos
(
2
π
x
i
)
)
+
20
+
e
f(x) = -20 \exp\left(-0.2 \sqrt{\frac{1}{n}\sum_{i=1}^n x_i^2}\right) - \exp\left(\frac{1}{n}\sum_{i=1}^n \cos(2\pi x_i)\right) + 20 + e
f(x)=−20exp(−0.2n1i=1∑nxi2)−exp(n1i=1∑ncos(2πxi))+20+e
- Griewank 函数:
f
(
x
)
=
∑
i
=
1
n
x
i
2
4000
−
∏
i
=
1
n
cos
(
x
i
i
)
+
1
f(x) = \sum_{i=1}^n \frac{x_i^2}{4000} - \prod_{i=1}^n \cos\left(\frac{x_i}{\sqrt{i}}\right) + 1
f(x)=i=1∑n4000xi2−i=1∏ncos(ixi)+1
- Rastrigin 函数:
f ( x ) = 10 n + ∑ i = 1 n [ x i 2 − 10 cos ( 2 π x i ) ] f(x) = 10n + \sum_{i=1}^n [x_i^2 - 10\cos(2\pi x_i)] f(x)=10n+i=1∑n[xi2−10cos(2πxi)]
- Schwefel 函数:
f ( x ) = 418.9829 n − ∑ i = 1 n x i sin ( ∣ x i ∣ ) f(x) = 418.9829n - \sum_{i=1}^n x_i \sin(\sqrt{|x_i|}) f(x)=418.9829n−i=1∑nxisin(∣xi∣)
- Levy 函数:
f ( x ) = sin 2 ( π w 1 ) + ∑ i = 1 n − 1 ( w i − 1 ) 2 [ 1 + 10 sin 2 ( π w i + 1 ) ] + ( w n − 1 ) 2 [ 1 + sin 2 ( 2 π w n ) ] f(x) = \sin^2(\pi w_1) + \sum_{i=1}^{n-1} (w_i - 1)^2 [1 + 10\sin^2(\pi w_i + 1)] + (w_n - 1)^2 [1 + \sin^2(2\pi w_n)] f(x)=sin2(πw1)+i=1∑n−1(wi−1)2[1+10sin2(πwi+1)]+(wn−1)2[1+sin2(2πwn)]
其中
w
i
=
1
+
x
i
−
1
4
w_i = 1 + \frac{x_i - 1}{4}
wi=1+4xi−1, 对于所有
i
=
1
,
.
.
.
,
n
i = 1, ..., n
i=1,...,n
- Michalewicz 函数:
f ( x ) = − ∑ i = 1 n sin ( x i ) [ sin ( i x i 2 π ) ] 2 m f(x) = -\sum_{i=1}^n \sin(x_i) \left[\sin\left(\frac{i x_i^2}{\pi}\right)\right]^{2m} f(x)=−i=1∑nsin(xi)[sin(πixi2)]2m
其中
m
=
10
m = 10
m=10
- Zakharov 函数:
f ( x ) = ∑ i = 1 n x i 2 + ( ∑ i = 1 n 0.5 i x i ) 2 + ( ∑ i = 1 n 0.5 i x i ) 4 f(x) = \sum_{i=1}^n x_i^2 + \left(\sum_{i=1}^n 0.5ix_i\right)^2 + \left(\sum_{i=1}^n 0.5ix_i\right)^4 f(x)=i=1∑nxi2+(i=1∑n0.5ixi)2+(i=1∑n0.5ixi)4
- Dixon-Price 函数:
f
(
x
)
=
(
x
1
−
1
)
2
+
∑
i
=
2
n
i
(
2
x
i
2
−
x
i
−
1
)
2
f(x) = (x_1 - 1)^2 + \sum_{i=2}^n i(2x_i^2 - x_{i-1})^2
f(x)=(x1−1)2+i=2∑ni(2xi2−xi−1)2
- Styblinski-Tang 函数:
f ( x ) = 1 2 ∑ i = 1 n ( x i 4 − 16 x i 2 + 5 x i ) f(x) = \frac{1}{2}\sum_{i=1}^n (x_i^4 - 16x_i^2 + 5x_i) f(x)=21i=1∑n(xi4−16xi2+5xi)
- Powell 函数:
f ( x ) = ∑ i = 1 n / 4 [ ( x 4 i − 3 + 10 x 4 i − 2 ) 2 + 5 ( x 4 i − 1 − x 4 i ) 2 + ( x 4 i − 2 − 2 x 4 i − 1 ) 4 + 10 ( x 4 i − 3 − x 4 i ) 4 ] f(x) = \sum_{i=1}^{n/4} [(x_{4i-3} + 10x_{4i-2})^2 + 5(x_{4i-1} - x_{4i})^2 + (x_{4i-2} - 2x_{4i-1})^4 + 10(x_{4i-3} - x_{4i})^4] f(x)=i=1∑n/4[(x4i−3+10x4i−2)2+5(x4i−1−x4i)2+(x4i−2−2x4i−1)4+10(x4i−3−x4i)4]
- Alpine 函数:
f
(
x
)
=
∑
i
=
1
n
∣
x
i
sin
(
x
i
)
+
0.1
x
i
∣
f(x) = \sum_{i=1}^n |x_i \sin(x_i) + 0.1x_i|
f(x)=i=1∑n∣xisin(xi)+0.1xi∣
- Cigar 函数:
f
(
x
)
=
x
1
2
+
1
0
6
∑
i
=
2
n
x
i
2
f(x) = x_1^2 + 10^6 \sum_{i=2}^n x_i^2
f(x)=x12+106i=2∑nxi2
- Schaffer F6 函数:
f ( x ) = 0.5 + sin 2 ( x 1 2 + x 2 2 ) − 0.5 [ 1 + 0.001 ( x 1 2 + x 2 2 ) ] 2 f(x) = 0.5 + \frac{\sin^2(\sqrt{x_1^2 + x_2^2}) - 0.5}{[1 + 0.001(x_1^2 + x_2^2)]^2} f(x)=0.5+[1+0.001(x12+x22)]2sin2(x12+x22)−0.5
- Happy Cat 函数:
f ( x ) = [ ( x T x − n ) 2 ] α + 1 2 ( x T x + ∑ i = 1 n x i ) + 1 2 f(x) = \left[(x^T x - n)^2\right]^\alpha + \frac{1}{2}(x^T x + \sum_{i=1}^n x_i) + \frac{1}{2} f(x)=[(xTx−n)2]α+21(xTx+i=1∑nxi)+21
其中
α
=
1
8
\alpha = \frac{1}{8}
α=81
- Bent Cigar 函数:
f
(
x
)
=
x
1
2
+
1
0
6
∑
i
=
2
n
x
i
2
f(x) = x_1^2 + 10^6 \sum_{i=2}^n x_i^2
f(x)=x12+106i=2∑nxi2
- Bohachevsky 函数:
f
(
x
)
=
x
1
2
+
2
x
2
2
−
0.3
cos
(
3
π
x
1
)
−
0.4
cos
(
4
π
x
2
)
+
0.7
f(x) = x_1^2 + 2x_2^2 - 0.3\cos(3\pi x_1) - 0.4\cos(4\pi x_2) + 0.7
f(x)=x12+2x22−0.3cos(3πx1)−0.4cos(4πx2)+0.7
- Drop-Wave 函数:
f ( x ) = − 1 + cos ( 12 x 1 2 + x 2 2 ) 0.5 ( x 1 2 + x 2 2 ) + 2 f(x) = -\frac{1 + \cos(12\sqrt{x_1^2 + x_2^2})}{0.5(x_1^2 + x_2^2) + 2} f(x)=−0.5(x12+x22)+21+cos(12x12+x22)
- Cross-in-Tray 函数:
f ( x ) = − 0.0001 ( ∣ sin ( x 1 ) sin ( x 2 ) exp ( ∣ 100 − x 1 2 + x 2 2 π ∣ ) ∣ + 1 ) 0.1 f(x) = -0.0001 \left(|\sin(x_1)\sin(x_2)\exp(|100 - \frac{\sqrt{x_1^2 + x_2^2}}{\pi}|)| + 1\right)^{0.1} f(x)=−0.0001(∣sin(x1)sin(x2)exp(∣100−πx12+x22∣)∣+1)0.1
- Holder Table 函数:
f
(
x
)
=
−
∣
sin
(
x
1
)
cos
(
x
2
)
exp
(
∣
1
−
x
1
2
+
x
2
2
π
∣
)
∣
f(x) = -|\sin(x_1)\cos(x_2)\exp(|1 - \frac{\sqrt{x_1^2 + x_2^2}}{\pi}|)|
f(x)=−∣sin(x1)cos(x2)exp(∣1−πx12+x22∣)∣
- Langermann 函数:
f ( x ) = − ∑ i = 1 m c i exp ( − 1 π ∑ j = 1 n ( x j − A i j ) 2 ) cos ( π ∑ j = 1 n ( x j − A i j ) 2 ) f(x) = -\sum_{i=1}^m c_i \exp(-\frac{1}{\pi} \sum_{j=1}^n (x_j - A_{ij})^2) \cos(\pi \sum_{j=1}^n (x_j - A_{ij})^2) f(x)=−i=1∑mciexp(−π1j=1∑n(xj−Aij)2)cos(πj=1∑n(xj−Aij)2)
其中
m
=
5
m = 5
m=5,
c
c
c 和
A
A
A 是预定义的常数向量和矩阵。
4. 代码
测试函数:
# benchmark_functions.py
import torch
import numpy as np
class BenchmarkFunctions:
@staticmethod
def sphere(x):
return torch.sum(x**2)
@staticmethod
def rosenbrock(x):
return torch.sum(100.0 * (x[1:] - x[:-1]**2)**2 + (1 - x[:-1])**2)
@staticmethod
def ackley(x):
return -20.0 * torch.exp(-0.2 * torch.sqrt(torch.mean(x**2))) - \
torch.exp(torch.mean(torch.cos(2.0 * np.pi * x))) + 20.0 + np.e
@staticmethod
def griewank(x):
return torch.sum(x**2) / 4000.0 - torch.prod(torch.cos(x / torch.sqrt(torch.arange(1, len(x)+1).float()))) + 1
@staticmethod
def rastrigin(x):
return 10 * len(x) + torch.sum(x**2 - 10 * torch.cos(2 * np.pi * x))
@staticmethod
def schwefel(x):
return 418.9829 * len(x) - torch.sum(x * torch.sin(torch.sqrt(torch.abs(x))))
@staticmethod
def levy(x):
w = 1 + (x - 1) / 4
return torch.sin(np.pi * w[0])**2 + \
torch.sum((w[:-1] - 1)**2 * (1 + 10 * torch.sin(np.pi * w[:-1] + 1)**2)) + \
(w[-1] - 1)**2 * (1 + torch.sin(2 * np.pi * w[-1])**2)
@staticmethod
def michalewicz(x):
m = 10
return -torch.sum(torch.sin(x) * torch.sin(torch.arange(1, len(x)+1) * x**2 / np.pi)**(2*m))
@staticmethod
def zakharov(x):
return torch.sum(x**2) + (0.5 * torch.sum(torch.arange(1, len(x)+1) * x))**2 + \
(0.5 * torch.sum(torch.arange(1, len(x)+1) * x))**4
@staticmethod
def dixon_price(x):
return (x[0] - 1)**2 + torch.sum((torch.arange(2, len(x)+1) * (2 * x[1:]**2 - x[:-1])**2))
@staticmethod
def styblinski_tang(x):
return 0.5 * torch.sum(x**4 - 16 * x**2 + 5 * x)
@staticmethod
def powell(x):
return torch.sum((x[::4] + 10*x[1::4])**2 + 5*(x[2::4] - x[3::4])**2 + \
(x[1::4] - 2*x[2::4])**4 + 10*(x[::4] - x[3::4])**4)
@staticmethod
def alpine(x):
return torch.sum(torch.abs(x * torch.sin(x) + 0.1 * x))
@staticmethod
def cigar(x):
return x[0]**2 + 1e6 * torch.sum(x[1:]**2)
@staticmethod
def schaffer_f6(x):
num = torch.sin(torch.sqrt(torch.sum(x**2)))**2 - 0.5
den = (1 + 0.001 * torch.sum(x**2))**2
return 0.5 + num / den
@staticmethod
def happy_cat(x, alpha=1/8):
return ((torch.sum(x**2) - len(x))**2)**alpha + (0.5 * torch.sum(x**2) + torch.sum(x))/len(x) + 0.5
@staticmethod
def bent_cigar(x):
return x[0]**2 + 1e6 * torch.sum(x[1:]**2)
@staticmethod
def bohachevsky(x):
return x[0]**2 + 2*x[1]**2 - 0.3*torch.cos(3*np.pi*x[0]) - 0.4*torch.cos(4*np.pi*x[1]) + 0.7
@staticmethod
def drop_wave(x):
return -(1 + torch.cos(12 * torch.sqrt(torch.sum(x**2)))) / (0.5 * torch.sum(x**2) + 2)
@staticmethod
def cross_in_tray(x):
return -0.0001 * (torch.abs(torch.sin(x[0]) * torch.sin(x[1]) *
torch.exp(torch.abs(100 - torch.sqrt(torch.sum(x**2))/np.pi))) + 1)**0.1
@staticmethod
def holder_table(x):
return -torch.abs(torch.sin(x[0]) * torch.cos(x[1]) *
torch.exp(torch.abs(1 - torch.sqrt(torch.sum(x**2))/np.pi)))
@staticmethod
def langermann(x):
dim = x.shape[0]
a = torch.tensor([[3, 5, *[0] * (dim - 2)],
[5, 2, *[0] * (dim - 2)],
[2, 1, *[0] * (dim - 2)],
[1, 4, *[0] * (dim - 2)],
[7, 9, *[0] * (dim - 2)]])
c = torch.tensor([1, 2, 5, 2, 3])
m = 5
a = a[:, :dim] # 只取前dim列
return -torch.sum(c * torch.exp(-(1 / np.pi) * torch.sum((x.unsqueeze(0) - a) ** 2, dim=1)) *
torch.cos(np.pi * torch.sum((x.unsqueeze(0) - a) ** 2, dim=1)))
@classmethod
def get_function(cls, name):
return getattr(cls, name.lower())
@classmethod
def get_all_functions(cls):
return [func for func in dir(cls) if callable(getattr(cls, func)) and not func.startswith("__") and func != "get_function" and func != "get_all_functions"]
优化代码
# main.py
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.optim as optim
from deap import base, creator, tools, algorithms
from TestFun import BenchmarkFunctions
# 定义GA算法
def ga_optimize(func, dim, bounds, ngen=1000, npop=50):
creator.create("FitnessMin", base.Fitness, weights=(-1.0,))
creator.create("Individual", list, fitness=creator.FitnessMin)
toolbox = base.Toolbox()
toolbox.register("attr_float", np.random.uniform, bounds[0], bounds[1])
toolbox.register("individual", tools.initRepeat, creator.Individual, toolbox.attr_float, n=dim)
toolbox.register("population", tools.initRepeat, list, toolbox.individual)
toolbox.register("evaluate", lambda x: (func(torch.tensor(x).float()).item(),))
toolbox.register("mate", tools.cxBlend, alpha=0.5)
toolbox.register("mutate", tools.mutGaussian, mu=0, sigma=1, indpb=0.1)
toolbox.register("select", tools.selTournament, tournsize=3)
pop = toolbox.population(n=npop)
stats = tools.Statistics(lambda ind: ind.fitness.values)
stats.register("min", np.min)
_, log = algorithms.eaSimple(pop, toolbox, cxpb=0.7, mutpb=0.2, ngen=ngen, stats=stats, verbose=False)
return log.select("min")
# 定义Adam优化
def adam_optimize(func, dim, bounds, niter=1000):
x = torch.tensor(np.random.uniform(bounds[0], bounds[1], dim), requires_grad=True)
optimizer = optim.Adam([x])
losses = []
for _ in range(niter):
optimizer.zero_grad()
loss = func(x)
loss.backward()
optimizer.step()
losses.append(loss.item())
return losses
def main():
function_names = BenchmarkFunctions.get_all_functions()
for name in function_names:
func = BenchmarkFunctions.get_function(name)
dim = 30 # 假设所有函数都是30维
bounds = (-100, 100) # 假设所有函数的搜索范围都是[-100, 100]
print(f"Optimizing {name} function...")
# GA优化
ga_results = ga_optimize(func, dim, bounds)
# Adam优化
adam_results = adam_optimize(func, dim, bounds)
# 创建包含两个子图的图
fig = plt.figure(figsize=(20, 10))
# 左侧子图:3D渲染图
ax1 = fig.add_subplot(121, projection='3d')
x = np.linspace(bounds[0], bounds[1], 100)
y = np.linspace(bounds[0], bounds[1], 100)
X, Y = np.meshgrid(x, y)
Z = np.array([func(torch.tensor([xi, yi])).item() for xi, yi in zip(np.ravel(X), np.ravel(Y))]).reshape(X.shape)
surf = ax1.plot_surface(X, Y, Z, cmap='viridis')
ax1.set_title(f'{name.capitalize()} Function')
ax1.set_xlabel('X')
ax1.set_ylabel('Y')
ax1.set_zlabel('Z')
fig.colorbar(surf, ax=ax1, shrink=0.5, aspect=5)
# 右侧子图:迭代下降图
ax2 = fig.add_subplot(122)
ax2.plot(ga_results, label='GA')
ax2.plot(adam_results, label='Adam')
ax2.set_title(f'{name.capitalize()} Function Optimization')
ax2.set_xlabel('Iterations')
ax2.set_ylabel('Fitness')
ax2.legend()
ax2.set_yscale('log')
plt.tight_layout()
plt.savefig(f'./img1/{name}_optimization.png')
plt.close()
print(f"Optimization of {name} function completed.")
if __name__ == "__main__":
main()
结论
从上面的若干结果中可以显而易见的发现,再智能寻优领域,Adam的优化器远不如GA算法,甚至对于某些函数直接就摆烂了如:
至于更深层次的原因由于小编现在的水平有限,还不能给出严谨的证明,之后如果有思路会进行结论补充。
最后最后,烦请各位观众老爷给个三连,小编在这里跪谢了!