多变量微积分
10. 二阶导检验
函数的全局最值位于:临界点、定义域边界或无穷远处
临界点类型:局部最大值、局部最小值或鞍点,由二阶导数判定
假设函数f的二阶导数为:
A
=
f
x
x
B
=
f
x
y
C
=
f
y
y
A=f_{xx} \\ B=f_{xy} \\ C=f_{yy}
A=fxxB=fxyC=fyy 那么:
A
C
−
B
2
>
0
and
A
>
0
⇒
local minimum
A
C
−
B
2
>
0
and
A
<
0
⇒
local maximum
A
C
−
B
2
<
0
⇒
saddle point
A
C
−
B
2
=
0
⇒
cannot compute
AC-B^2>0 \text{ and } A>0 \Rightarrow \text{local minimum} \\ AC-B^2>0 \text{ and } A<0 \Rightarrow \text{local maximum} \\ AC-B^2<0 \Rightarrow \text{saddle point} \\ AC-B^2=0 \Rightarrow \text{cannot compute}
AC−B2>0 and A>0⇒local minimumAC−B2>0 and A<0⇒local maximumAC−B2<0⇒saddle pointAC−B2=0⇒cannot compute
大致推导:
假设函数f为:
f
=
a
x
2
+
b
x
y
+
c
y
2
f=ax^2+bxy+cy^2
f=ax2+bxy+cy2 那么其一阶偏导数为:
∂
f
∂
x
=
2
a
x
+
b
y
=
f
x
∂
f
∂
y
=
b
x
+
2
c
y
=
f
y
\frac{\partial f}{\partial x}=2ax+by = f_x \\ \frac{\partial f}{\partial y}=bx+2cy = f_y
∂x∂f=2ax+by=fx∂y∂f=bx+2cy=fy 二阶偏导数为:
∂
f
x
∂
x
=
2
a
=
f
x
x
=
A
∂
f
x
∂
y
=
b
=
f
x
y
=
B
∂
f
y
∂
x
=
b
=
f
y
x
=
B
∂
f
y
∂
y
=
2
c
=
f
y
y
=
C
\frac{\partial f_x}{\partial x}=2a=f_{xx}=A \\ \frac{\partial f_x}{\partial y}=b=f_{xy}=B \\ \frac{\partial f_y}{\partial x}=b=f_{yx}=B \\ \frac{\partial f_y}{\partial y}=2c=f_{yy}=C
∂x∂fx=2a=fxx=A∂y∂fx=b=fxy=B∂x∂fy=b=fyx=B∂y∂fy=2c=fyy=C
二阶偏导数判断方法:
4
a
c
−
b
2
>
0
and
a
>
0
⇒
local minimum
4
a
c
−
b
2
>
0
and
a
<
0
⇒
local maximum
4
a
c
−
b
2
<
0
⇒
saddle point
4
a
c
−
b
2
=
0
⇒
cannot compute
4ac-b^2>0 \text{ and } a > 0 \Rightarrow \text{local minimum} \\ 4ac-b^2>0 \text{ and } a < 0 \Rightarrow \text{local maximum} \\ 4ac-b^2<0 \Rightarrow \text{saddle point} \\ 4ac-b^2=0 \Rightarrow \text{cannot compute}
4ac−b2>0 and a>0⇒local minimum4ac−b2>0 and a<0⇒local maximum4ac−b2<0⇒saddle point4ac−b2=0⇒cannot compute
11. 微分,链式法则
全微分:多元函数微分的确切名字,包含所有能改变函数值的因素
f
=
f
(
x
,
y
,
z
)
d
f
=
f
x
d
x
+
f
y
d
y
+
f
z
d
z
=
∂
f
∂
x
d
x
+
∂
f
∂
y
d
y
+
∂
f
∂
z
d
z
Δ
f
≈
f
x
Δ
x
+
f
y
Δ
y
+
f
z
Δ
z
\begin{aligned} f&=f(x,y,z) \\ \mathrm{d} f &= f_x \mathrm{d}x + f_y \mathrm{d}y + f_z \mathrm{d}z \\ &= \frac{\partial f}{\partial x} \mathrm{d}x + \frac{\partial f}{\partial y} \mathrm{d}y + \frac{\partial f}{\partial z} \mathrm{d}z \end{aligned} \\ \Delta f \approx f_x \Delta x + f_y \Delta y + f_z \Delta z
fdf=f(x,y,z)=fxdx+fydy+fzdz=∂x∂fdx+∂y∂fdy+∂z∂fdzΔf≈fxΔx+fyΔy+fzΔz 重要:
d
f
≠
Δ
f
df \neq \Delta f
df=Δf;
d
f
\mathrm{d}f
df是极限,
Δ
f
\Delta f
Δf是数量,当x,y,z变化时,
Δ
f
\Delta f
Δf表示的就是变化的量值,当变化的量值趋于0时,
≈
\approx
≈变为=,且
Δ
f
\Delta f
Δf变为
d
f
\mathrm{d}f
df
链式法则:
f
=
f
(
x
,
y
)
,
x
=
x
(
u
,
v
)
,
y
=
y
(
u
,
v
)
d
f
=
f
x
d
x
+
f
y
d
y
=
f
x
(
x
u
d
u
+
x
v
d
v
)
+
f
y
(
y
u
d
u
+
y
v
d
v
)
=
(
∂
f
∂
x
∂
x
∂
u
+
∂
f
∂
y
∂
y
∂
u
)
d
u
+
(
∂
f
∂
x
∂
x
∂
v
+
∂
f
∂
y
∂
y
∂
v
)
d
v
∂
f
∂
u
=
∂
f
∂
x
∂
x
∂
u
+
∂
f
∂
y
∂
y
∂
u
\begin{aligned} & f=f(x,y),x=x(u,v),y=y(u,v) \\ \mathrm{d}f &= f_x \mathrm{d}x+f_y \mathrm{d}y \\ &= f_x (x_u du + x_v dv) + f_y(y_u du + y_v dv) \\ &= (\frac{\partial f}{\partial x} \frac{\partial x}{\partial u} + \frac{\partial f}{\partial y} \frac{\partial y}{\partial u}) du + (\frac{\partial f}{\partial x} \frac{\partial x}{\partial v} + \frac{\partial f}{\partial y} \frac{\partial y}{\partial v})dv \\ \frac{\partial f}{\partial u} &= \frac{\partial f}{\partial x} \frac{\partial x}{\partial u} + \frac{\partial f}{\partial y} \frac{\partial y}{\partial u} \end{aligned}
df∂u∂ff=f(x,y),x=x(u,v),y=y(u,v)=fxdx+fydy=fx(xudu+xvdv)+fy(yudu+yvdv)=(∂x∂f∂u∂x+∂y∂f∂u∂y)du+(∂x∂f∂v∂x+∂y∂f∂v∂y)dv=∂x∂f∂u∂x+∂y∂f∂u∂y 注意: 不能约分,因为是偏导数,不是导数,导数可以约分
偏微分:针对某一个变量的微分
12. 梯度,方向导数,切平面
偏导数告诉我们,函数f对每个变量的变化有多敏感
用新方法重写导数式子:
d
w
d
t
=
w
x
d
x
d
t
+
w
y
d
y
d
t
=
∇
w
⋅
d
r
⃗
d
t
\begin{aligned} \frac{dw}{dt}&=w_x \frac{dx}{dt} + w_y \frac{dy}{dt} \\ &= \nabla w \cdot \frac{d\vec{r}}{dt} \end{aligned}
dtdw=wxdtdx+wydtdy=∇w⋅dtdr
梯度(gradient):
∇
w
=
<
w
x
,
w
y
>
d
r
⃗
d
t
=
<
d
x
d
t
,
d
y
d
t
>
\nabla w=<w_x,w_y> \\ \frac{d \vec{r}}{dt}=<\frac{dx}{dt},\frac{dy}{dt}>
∇w=<wx,wy>dtdr=<dtdx,dtdy>
方向导数:
d
w
d
s
∣
u
⃗
=
∇
w
⋅
u
⃗
=
∣
∇
w
∣
cos
θ
\frac{dw}{ds} | _{\vec{u}} = \nabla w \cdot \vec{u} = | \nabla w | \cos \theta
dsdw∣u=∇w⋅u=∣∇w∣cosθ
梯度向量含义:包含偏导或者导数
性质:1. 梯度向量垂直于原函数的等值面的切平面(level surface)(梯度向量为原函数等值面的切平面的法向量,升维降维考虑)
应用:1. 求切面方程 2. 方向导数;方向向量与梯度向量夹角为0时函数值变化最大,即沿着梯度的方向函数值增加的最快
方向导数是一个值(斜率),梯度是一个向量;方向导数为某点的垂直切面与函数相交的曲线的斜率(附录)
偏导数可以解决很多物理问题,很多规律都是由偏微分方程描述的(未知函数的偏导构成的方程)
13. 拉格朗日乘数法
拉格朗日乘子法(Lagrange Multipliers):有约束条件时,最小化或者最大化多元函数;例如,当x、y不独立时,最小化或者最大化f(x,y),x、y不独立可能表现为g(x,y)=C;该方法适用于约束条件比较复杂的情况
这里的最值不能简单地使用临界点,因为临界点通常不满足约束条件,因此不能使用“最小二乘法”或“梯度下降法”
∇ f / / λ ∇ g \nabla f \mathrel{/\mkern-5mu/} \lambda \nabla g ∇f//λ∇g
疑问:多元一次方程求最值?
14. 非独立变量
- 约束变量相互之间的变化率(在等值面上移动):
g = g ( x , y , z ) = C d g = g x d x + g y d y + g z d z = 0 d z = − g x g z d x − g y g z d y ⇒ ∂ z ∂ x = − g x g z ∂ z ∂ y = − g y g z g=g(x,y,z)=C \\ \mathrm{d}g=g_x dx + g_y dy + g_z dz = 0 \\ dz = -\frac{g_x}{g_z} dx - \frac{g_y}{g_z} dy \\ \Rightarrow \frac{\partial z}{\partial x} = -\frac{g_x}{g_z} \quad \frac{\partial z} {\partial y} = -\frac{g_y}{g_z} g=g(x,y,z)=Cdg=gxdx+gydy+gzdz=0dz=−gzgxdx−gzgydy⇒∂x∂z=−gzgx∂y∂z=−gzgy - 有约束的偏导数,例如 f(x,y,z) where g(x,y,z)=C,
(
∂
f
∂
z
)
y
(\frac{\partial f}{\partial z})_y
(∂z∂f)y :微分法、链式法则—
( ∂ f ∂ z ) y = ∂ f ∂ x ( ∂ x ∂ z ) y + ∂ f ∂ y ( ∂ y ∂ z ) y + ∂ f ∂ z ( ∂ z ∂ z ) y (\frac{\partial f}{\partial z})_y=\frac{\partial f}{\partial x} (\frac{\partial x}{\partial z})_y + \frac{\partial f}{\partial y} (\frac{\partial y}{\partial z})_y + \frac{\partial f}{\partial z} (\frac{\partial z}{\partial z})_y (∂z∂f)y=∂x∂f(∂z∂x)y+∂y∂f(∂z∂y)y+∂z∂f(∂z∂z)y
附录
附录1. 梯度向量垂直于等值面的切平面
clear; clc; clf;
f = @(x, y, z) x.^3 + 3*y.^2 + 2.*x.*y + z.^2 - 1;
fimplicit3(f);
hold on;
quiver3(0, 0, 1, 0, 0, 2);
xlabel('x'); ylabel('y'); zlabel('z');
quiver3(1, 0, 0, 3, 2, 0);
axis vis3d;
附录2. 梯度下降法
假设
y
=
β
0
+
β
1
x
L
(
β
)
=
1
N
∑
j
=
1
N
(
β
0
+
β
1
x
j
−
y
j
)
2
∇
L
=
(
∂
L
∂
β
0
,
∂
L
∂
β
1
)
=
(
2
N
∑
j
=
1
N
(
β
0
+
β
1
x
j
−
y
j
)
,
2
N
∑
j
=
1
N
(
β
0
+
β
1
x
j
−
y
j
)
x
j
)
y=\beta_0+\beta_1 x \\ L(\beta) = \frac{1}{N} \sum_{j=1}^N (\beta_0 + \beta_1 x_j - y_j)^2 \\ \nabla L = \left (\frac{\partial L}{\partial \beta_0},\frac{\partial L}{\partial \beta_1} \right )=\left (\frac{2}{N} \sum_{j=1}^N(\beta_0 + \beta_1 x_j - y_j),\frac{2}{N} \sum_{j=1}^N (\beta_0 + \beta_1 x_j - y_j) x_j \right )
y=β0+β1xL(β)=N1j=1∑N(β0+β1xj−yj)2∇L=(∂β0∂L,∂β1∂L)=(N2j=1∑N(β0+β1xj−yj),N2j=1∑N(β0+β1xj−yj)xj)
梯度下降法的步骤:
- 当 i = 0 i = 0 i=0 时,设置初始点 β 0 = ( β 0 0 , β 1 0 ) \beta^0=(\beta_0^0,\beta_1^0) β0=(β00,β10),设置步长(又称学习率) α \alpha α,设置迭代终止的误差忍耐度tol
- 计算目标函数 L ( β ) L(\beta) L(β) 在点 ( β 0 i , β 1 i ) (\beta_0^i,\beta_1^i) (β0i,β1i)上的梯度 ∇ L β i \nabla L_{\beta^i} ∇Lβi
- 计算
β
i
+
1
\beta^{i+1}
βi+1,公式如下:
β i + 1 = β i − α ∇ L β i \beta^{i+1} = \beta^i - \alpha \nabla L_{\beta^i} βi+1=βi−α∇Lβi - 计算梯度 ∇ L β i + 1 \nabla L_{\beta^{i+1}} ∇Lβi+1,如果梯度的二范数 ∣ ∣ ∇ L β i + 1 ∣ ∣ 2 ≤ tol ||\nabla L_{\beta^{i+1}}||_2 \leq \text{tol} ∣∣∇Lβi+1∣∣2≤tol,则停止迭代,最优解的取值为 β i + 1 \beta^{i+1} βi+1;否则 i = i + 1 i=i+1 i=i+1,并返回第3步
% 梯度下降法
function gd()
clear
clc
clf
% 训练数据
X = 1:9;
Y = [1 2 6 7 9 12 13 15 20];
% X = 1:9;
% Y = [742 400 388 762 821 876 854 793 327];
% 初始设置
beta = [1, 1];
alpha = 0.2;
tol_L = 0.01;
batch_size = 4;
% 对X进行归一化
max_x = max(X);
X = X / max_x;
subplot(1, 2, 1);
% syms beta_0 beta_1;
% L = 1/length(x)*sum((beta_0 + beta_1*x - y)^2);
% L = mean((beta_0 + beta_1.*X - Y).^2);
[bb_0, bb_1] = meshgrid(-15:.5:15);
LL = bb_0;
[m n] = size(bb_0);
for i = 1:m
for j = 1:n
% LL(i,j) = subs(L, {beta_0, beta_1}, {bb_0(i,j), bb_1(i,j)});
LL(i,j) = rmse([bb_0(i,j), bb_1(i,j)], X, Y);
end
end
mesh(bb_0, bb_1, LL);
% 进行第一次计算
% grad = compute_grad(beta, X, Y);
% grad = compute_grad_SGD(beta, X, Y);
grad = compute_grad_batch(beta, batch_size, X, Y);
loss = rmse(beta, X, Y);
% 画图 begin
hold on
plot3(beta(1), beta(2), loss, 'ro');
quiver3(beta(1), beta(2), loss, -grad(1), -grad(2), 0, 'Color', 'r');
subplot(1, 2, 2);
plot(X, Y, 'o');
hold on
XA = 0:0.01:1.2;
YA = beta(1) + beta(2) .* XA;
plot(XA, YA);
% 画图 end
beta = update_beta(beta, alpha, grad);
% grad = compute_grad(beta, X, Y);
% grad = compute_grad_SGD(beta, X, Y);
grad = compute_grad_batch(beta, batch_size, X, Y);
loss_new = rmse(beta, X, Y);
% 开始迭代
i = 1;
while abs(loss_new - loss) > tol_L
% 画图
subplot(1, 2, 1);
plot3(beta(1), beta(2), loss_new, 'bo');
quiver3(beta(1), beta(2), loss_new, -grad(1), -grad(2), 0, 'Color', 'r');
subplot(1, 2, 2);
plot(X, Y, 'o');
hold on
XA = 0:0.01:1.2;
YA = beta(1) + beta(2) .* XA;
plot(XA, YA);
axis([0 1.2 0 25]);
hold off
% subplot(2, 2, [3 4]);
% hold on
% plot(i, abs(loss_new - loss), 'or');
% M(i) = getframe;
getframe;
beta = update_beta(beta, alpha, grad);
% grad = compute_grad(beta, X, Y);
% loss = loss_new;
% loss_new = rmse(beta, X, Y);
% fprintf('Round %d Diff RMSE %f\n', i, abs(loss_new - loss));
% grad = compute_grad_SGD(beta, X, Y);
grad = compute_grad_batch(beta, batch_size, X, Y);
if mod(i, 2) == 0
loss = loss_new;
loss_new = rmse(beta, X, Y);
fprintf('Round %d Diff RMSE %f\n', i, abs(loss_new - loss));
end
i = i + 1;
end
fprintf('Coef: %f, Intercept: %f\n', beta(2), beta(1));
fprintf('Our Coef: %f, Intercept: %f\n', beta(2) / max_x, beta(1))
res = rmse(beta, X, Y);
fprintf('Our RMSE: %f\n', res);
end
% 定义计算梯度的函数
% 优缺点:稳定,但速度慢
function grad = compute_grad(beta, x, y)
grad = [0, 0];
grad(1) = 2 .* mean(beta(1) + beta(2) .* x - y);
grad(2) = 2 .* mean((beta(1) + beta(2) .* x - y) .* x);
end
% 定义计算随机梯度的函数
% 优缺点:速度快,但不够稳定
function grad = compute_grad_SGD(beta, x, y)
grad = [0, 0];
r = randperm(length(x), 1);
grad(1) = 2 .* mean(beta(1) + beta(2) .* x(r) - y(r));
grad(2) = 2 .* mean((beta(1) + beta(2) .* x(r) - y(r)) .* x(r));
end
% 定义计算mini-batch随机梯度的函数
% 对速度和稳定性进行妥协后的产物
function grad = compute_grad_batch(beta, batch_size, x, y)
grad = [0, 0];
r = randperm(length(x), batch_size);
grad(1) = 2 .* mean(beta(1) + beta(2) .* x(r) - y(r));
grad(2) = 2 .* mean((beta(1) + beta(2) .* x(r) - y(r)) .* x(r));
end
% 定义更新beta的函数
function new_beta = update_beta(beta, alpha, grad)
new_beta = beta - alpha .* grad;
end
% 定义计算RMSE的函数(Root Mean Squared Error,均方根误差)
function res = rmse(beta, x, y)
squared_err = (beta(1) + beta(2) .* x - y).^2;
res = sqrt(mean(squared_err));
end
附录3. 方向导数与梯度向量
clear; clc; clf;
% 使用三维隐函数绘图
f = @(x, y, z) x.^2 + y.^2 - z;
fimplicit3(f);
% 点(1, -1)处的梯度
hold on;
syms x y;
z = x^2 + y^2;
gradz = gradient(z);
gradzv = subs(gradz, {x,y}, {1,-1});
quiver(1, -1, gradzv(1,1), gradzv(2,1));
% 点(1, -1)处的方向导数,其方向与x轴之间的夹角为theta
theta = pi*3/4;
f_2 = @(x, y, z) x*cos(theta) + y*sin(theta) - (cos(theta) - sin(theta));
fimplicit3(f_2);
% 坐标轴配置
axis vis3d;
xlabel('x轴'); ylabel('y轴'); zlabel('z轴');
附录4. 拉格朗日乘子法
% 有约束条件时,相当于求两条三维曲线的最小值
clear
clc
clf
subplot(1, 2, 1);
% 约束条件 xy=3
f = @(x, y, z) x .* y - 3;
fimplicit3(f, [-10 10 -10 10 0 60]);
hold on
f = @(x, y, z) x.^2 + y.^2 - z;
fimplicit3(f);
% 梯度向量平行
quiver([-sqrt(3)], [-sqrt(3)], [-2*sqrt(3)], [-2*sqrt(3)]);
quiver([-sqrt(3)], [-sqrt(3)], [-sqrt(3)], [-sqrt(3)]);
xlabel('x');
ylabel('y');
zlabel('z');
subplot(1, 2, 2);
% xy=3 垂直切面与 z=x^2+y^2 的交线
x = 0.01:0.01:10;
y = 3./x;
z = x.^2 + y.^2;
plot3(x, y, z);
hold on
plot(x, y);
x = -10:0.01:0.01;
y = 3./x;
z = x.^2 + y.^2;
plot3(x, y, z);
plot(x, y);
axis([-10 10 -10 10 0 60]);
xlabel('x');
ylabel('y');
zlabel('z');