Mathematics Basics - Multivariate Calculus (Optimization)

Newton-Raphson Method

We have learned previously that a continuous and differentiable function can be approximated by a straight line tangent to it at a point.

f ( x + Δ x ) ≈ f ( x ) + f ( 1 ) ( x ) ( Δ x ) f(x+\Delta x)\approx f(x)+f^{(1)}(x)(\Delta x) f(x+Δx)f(x)+f(1)(x)(Δx)

We can use this approximation formula to help us find the roots of the function, i.e. when f ( x + Δ x ) = 0 f(x+\Delta x)=0 f(x+Δx)=0. First, let’s rearrange the equation to find the expression for Δ x \Delta x Δx.

Δ x ≈ f ( x + Δ x ) − f ( x ) f ( 1 ) ( x ) \Delta x\approx\frac{f(x+\Delta x)-f(x)}{f^{(1)}(x)} Δxf(1)(x)f(x+Δx)f(x)

At the root of the function, we have f ( x + Δ x ) = 0 f(x+\Delta x)=0 f(x+Δx)=0.

Δ x ≈ 0 − f ( x ) f ( 1 ) ( x ) ≈ − f ( x ) f ( 1 ) ( x ) \begin{aligned}\Delta x&\approx\frac{0-f(x)}{f^{(1)}(x)}\\&\approx-\frac{f(x)}{f^{(1)}(x)}\end{aligned} Δxf(1)(x)0f(x)f(1)(x)f(x)

Therefore, we obtained an expression for Δ x \Delta x Δx with function f ( x ) f(x) f(x) and its first order derivative f ( 1 ) ( x ) f^{(1)}(x) f(1)(x). Let’s see how the sign of Δ x \Delta x Δx changes with different values of f ( x ) f(x) f(x) and f ( 1 ) ( x ) f^{(1)}(x) f(1)(x).

  1. f ( x ) > 0 f(x)>0 f(x)>0 and f ( 1 ) ( x ) > 0 f^{(1)}(x)>0 f(1)(x)>0

 and

When both f ( x ) f(x) f(x) and f ( 1 ) ( x ) f^{(1)}(x) f(1)(x) are positive, Δ x \Delta x Δx is negative. In order to reach the root value x ′ x' x, we need to decrease x x x by distance ∣ Δ x ∣ |{\Delta x}| Δx. Therefore, x ′ = x + Δ x x'=x+\Delta x x=x+Δx.

  1. f ( x ) < 0 f(x)<0 f(x)<0 and f ( 1 ) ( x ) > 0 f^{(1)}(x)>0 f(1)(x)>0

 and

When f ( x ) f(x) f(x) is negative and f ( 1 ) ( x ) f^{(1)}(x) f(1)(x) is positive, Δ x \Delta x Δx is positive. In order to reach the root value x ′ x' x, we need to increase x x x by distance ∣ Δ x ∣ |\Delta x| Δx. Therefore, x ′ = x + Δ x x'=x+\Delta x x=x+Δx.

  1. f ( x ) > 0 f(x)>0 f(x)>0 and f ( 1 ) ( x ) < 0 f^{(1)}(x)<0 f(1)(x)<0

 and

When f ( x ) f(x) f(x) is positive and f ( 1 ) ( x ) f^{(1)}(x) f(1)(x) is negative, Δ x \Delta x Δx is positive. In order to reach the root value x ′ x' x, we need to increase x x x by distance ∣ Δ x ∣ |\Delta x| Δx. Therefore, x ′ = x + Δ x x'=x+\Delta x x=x+Δx.

  1. f ( x ) < 0 f(x)<0 f(x)<0 and f ( 1 ) ( x ) < 0 f^{(1)}(x)<0 f(1)(x)<0

 and

When both f ( x ) f(x) f(x) and f ( 1 ) ( x ) f^{(1)}(x) f(1)(x) are negative, Δ x \Delta x Δx is negative. In order to reach the root value x ′ x' x, we need to decrease x x x by distance ∣ Δ x ∣ |\Delta x| Δx. Therefore, x ′ = x + Δ x x'=x+\Delta x x=x+Δx.

We can see that no matter what values f ( x ) f(x) f(x) and f ( 1 ) ( x ) f^{(1)}(x) f(1)(x) take, we always use the same formula to adjust x x x by Δ x \Delta x Δx.

x ′ = x + Δ x x'=x+\Delta x x=x+Δx

There is one hitch here. In practice, function f ( x ) f(x) f(x) is not exactly a straight line. We cannot reach the root value x ′ x' x with one single update. Instead, we have to repeatedly apply the formula to update x x x until we hit a value that is close enough to our root value.

x n + 1 = x n + Δ x = x n − f ( x ) f ( 1 ) ( x ) x_{n+1}=x_n+\Delta x=x_n-\frac{f(x)}{f^{(1)}(x)} xn+1=xn+Δx=xnf(1)(x)f(x)

This is called the Newton-Raphson method. It is used to find an approximate solution for f ( x ) = 0 f(x)=0 f(x)=0 when there is not enough computational resource to evaluate every point of the function.

Let’s walk through one example to demonstrate how this method is used. We define a cubic equation f ( x ) f(x) f(x) and its first order derivative f ( 1 ) ( x ) f^{(1)}(x) f(1)(x) as

f ( x ) = x 3 − 2 x + 2 f ( 1 ) ( x ) = 3 x 2 − 2 \begin{aligned} f(x)&=x^3-2x+2\\ f^{(1)}(x)&=3x^2-2 \end{aligned} f(x)f(1)(x)=x32x+2=3x22

Newton Raphson Example

We will use Newton-Raphson method to find the solution for x 3 − 2 x + 2 = 0 x^3-2x+2=0 x32x+2=0.

Starting at point x = − 2 x=-2 x=2, let’s update the value of x x x by f ( x ) f ( 1 ) ( x ) \frac{f(x)}{f^{(1)}(x)} f(1)(x)f(x) in table below. Each row represents one update iteration.

n x n x_n xn f ( x n ) f(x_n) f(xn) f ( 1 ) ( x n ) f^{(1)}(x_n) f(1)(xn) f ( x ) f ( 1 ) ( x ) \frac{f(x)}{f^{(1)}(x)} f(1)(x)f(x)
0-2-210-0.2
1-1.8-0.2327.72-0.0301
2-1.770-0.004857.398-0.000656
3-1.769-2.28E-067.391-3.087E-07

Within just three iterations, we have obtained a fairly good x x x value that can solve the equation f ( x ) = 0 f(x)=0 f(x)=0. Although theoretically we can keep on iterating until f ( x ) f(x) f(x) is infinitely close to zero, we usually choose to stop at an acceptable distance near f ( x ) = 0 f(x)=0 f(x)=0. So Newton-Raphson method can be conveniently implemented by computers for solving roots of equations.

There are, however, some limitations of this method. Firstly, the adjustment to x x x at each iteration is controlled by f ( x ) f ( 1 ) ( x ) \frac{f(x)}{f^{(1)}(x)} f(1)(x)f(x). This ratio can be very big when f ( 1 ) ( x ) {f^{(1)}(x)} f(1)(x) is small. As a result, x x x values are diverging instead of converging around the stationary points of the function. It can be increased or declassed so much that it overshoots the root value. Moreover, if we were lucky to land x x x just at the point f ( 1 ) = 0 f^{(1)}=0 f(1)=0, then the ratio f ( x ) f ( 1 ) ( x ) \frac{f(x)}{f^{(1)}(x)} f(1)(x)f(x) becomes undefined.

Secondly, the starting point of x x x has significant influence on whether a solution can be found. Using the same function f ( x ) f(x) f(x), but now we start at point x = 0 x=0 x=0.

n x n x_n xn f ( x n ) f(x_n) f(xn) f ( 1 ) ( x n ) f^{(1)}(x_n) f(1)(xn) f ( x ) f ( 1 ) ( x ) \frac{f(x)}{f^{(1)}(x)} f(1)(x)f(x)
002-2-1
11111
202-2-1
31111
402-2-1

x n x_n xn is oscillating between 0 and 1 in each iteration without reaching the root value. Therefore, Newton-Raphson method does not guarantee a solution can be found with any starting point.

Gradient Descent

Gradient descent is a method used a lot in modern machine learning and deep learning algorithms. Interestingly, it shares some intuition with the Newton-Raphson method. We are also iteratively updating the input variables of a function until certain conditions are met. However, instead of finding the roots of a function, gradient descent is applied to find the maxima or minima of a function. Moreover, Newton-Raphson method is usually used for univariate functions only, but gradient descent can be used in multivariate functions.

We first define a term called gradient vector. It is a vector consisting of partial derivatives of function f f f with respect to each of its input variables. This is denoted by ∇ f \nabla f f.

∇ f ( x 1 , x 2 , ⋯   , x n ) = ( ∂ f ∂ x 1 ∂ f ∂ x 2 ⋮ ∂ f ∂ x n ) \nabla f(x_1,x_2,\cdots,x_n)= \begin{pmatrix} \frac{\partial f}{\partial x_1}\\ \frac{\partial f}{\partial x_2}\\ \vdots\\ \frac{\partial f}{\partial x_n} \end{pmatrix} f(x1,x2,,xn)=x1fx2fxnf

You may recall we have defined the Jacobian vector in a similar way, except that Jacobian is a row vector by convention. Therefore, gradient vector is just the transpose of Jacobian.

∇ f = ( J f ) T \nabla f=(J_f)^T f=(Jf)T

For any starting point of function f f f, we would like to find the steepest direction in which the function can increase (or decrease) the most. It turns out that the gradient vector is giving us exactly this direction.

To illustrate this idea, let’s look at the case of a two dimensional function f ( x , y ) f(x, y) f(x,y). At point ( x 1 , y 1 ) (x_1, y_1) (x1,y1), we can move some slight distance ( δ x , δ y ) (\delta x, \delta y) (δx,δy) away from ( x 1 , y 1 ) (x_1, y_1) (x1,y1) in the direction given by gradient vector. This will change our function value by

d f = ∂ f ∂ x ⋅ δ x + ∂ f ∂ y ⋅ δ y df= \frac{\partial f}{\partial x}\cdot\delta x+\frac{\partial f}{\partial y}\cdot\delta y df=xfδx+yfδy

We represent the distance moved in x x x and y y y direction by a unit vector r r r which has length 1 and direction component δ x \delta x δx and δ y \delta y δy.

r = ( δ x δ y ) r=\begin{pmatrix}\delta x\\\delta y\end{pmatrix} r=(δxδy)

∣ r ∣ = 1 |r|=1 r=1

Therefore, the change in value for function f f f can be expressed as the dot product of gradient vector ∇ f \nabla f f and distance vector r r r as

d f = ∇ f ⋅ r df=\nabla f\cdot r df=fr

Now, in order to find the direction where function f f f increases the most, we need to maximize the dot product ∇ f ⋅ r \nabla f\cdot r fr. We have learned previously in linear algebra that dot product of two vectors can be found by the magnitude of these two vectors and the angle θ \theta θ between them.

∇ f ⋅ r = ∣ ∣ ∇ f ∣ ∣ ⋅ ∣ ∣ r ∣ ∣ ⋅ cos ⁡ ( θ ) \nabla f\cdot r=||\nabla f||\cdot||r||\cdot\cos(\theta) fr=frcos(θ)

To maximize this expression, we would have cos ⁡ ( θ ) = 1 \cos(\theta)=1 cos(θ)=1 which means θ = 0 \theta=0 θ=0. Therefore, direction of vector r r r is parallel to ∇ f \nabla f f. To increase function f f f the most, we need to move at the direction of gradient vector.

Moreover, since vector r r r is a unit vector in the same direction as ∇ f \nabla f f, it is just the normalized vector of ∇ f \nabla f f.

r = ∇ f ∣ ∣ ∇ f ∣ ∣ r=\frac{\nabla f}{||\nabla f||} r=ff

Substituting this to our equation of d f df df.

d f = ∇ f ⋅ r = ∇ f ⋅ ∇ f ∣ ∣ ∇ f ∣ ∣ = ∇ f ⋅ ∇ f ∣ ∣ ∇ f ∣ ∣ = ∣ ∣ ∇ f ∣ ∣ 2 ∣ ∣ ∇ f ∣ ∣ = ∣ ∣ ∇ f ∣ ∣ \begin{aligned} df&=\nabla f\cdot r\\ &=\nabla f\cdot\frac{\nabla f}{||\nabla f||}\\ &=\frac{\nabla f \cdot \nabla f}{||\nabla f||}\\ &=\frac{||\nabla f||^2}{||\nabla f||}\\ &=||\nabla f|| \end{aligned} df=fr=fff=fff=ff2=f

The maximum increase at direction ∇ f \nabla f f by a unit vector r r r is the magnitude of ∇ f \nabla f f itself. Isn’t that amazing?

As ∇ f \nabla f f is pointing at a direction to increase function f f f, we will take the negative of ∇ f \nabla f f to decrease f f f and thus find the minimum point of f f f. This is what we usually use in machine learning algorithms to minimize cost function contributed by multiple parameters. Just like Newton-Raphson method, we will iteratively update the input variables by the gradient vector until ∇ f \nabla f f is 0 or very close to 0. This process is called gradient descent.

x n + 1 = x n − α ∇ f ( x n ) \mathbf{x}_{n+1}=\mathbf{x}_n-\alpha\nabla f(\mathbf{x}_n) xn+1=xnαf(xn)

We use x n \mathbf{x}_n xn to represent a vector of input variables at iteration n n n. And a constant term α \alpha α is introduced to control the pace of gradient descent. As we are approaching the minimum of function f f f, ∇ f \nabla f f will get smaller and smaller. When ∇ f \nabla f f is decreased to 0 or an extremely small number, we are at the minimum point and x n + 1 \mathbf{x}_{n+1} xn+1 is not updated any more.

Although we have only demonstrated gradient descent in two-dimensional case, it is practically the same in higher dimensions. With gradient descent method, we do not need to evaluate the function everywhere to find its minimum. Instead, we just take a small step from our current point of estimation with distance calculated by the gradient at this point. And we repeat the same process until we are down at the minimum point. This is a task computers can accomplish fairly efficiently. Therefore, we see gradient descent being applied universally in machine learning and deep learning algorithms to optimize highly complex functions.

Lagrange multipliers

We have learned to use gradient descent to find the minimum (or maximum) of a multivariate function. There is another type of optimization problem we encounter in practice called constrained optimization. It requires us to find the maximum or minimum of a function, subject to some constraints.

For example,

max ⁡ f ( x , y ) = x 2 y s.t.  x 2 + y 2 = 1 \max f(x,y)=x^2y\\\text{s.t. } x^2+y^2=1 maxf(x,y)=x2ys.t. x2+y2=1

The constraint equation tells us that x x x and y y y must come from a certain range of values that satisfy x 2 + y 2 = 1 x^2+y^2=1 x2+y2=1. A brute force approach is therefore to simply enumerate all possible values of x x x and y y y and find which combination yields the maximum f ( x , y ) f(x,y) f(x,y). However, this approach is often not practical either because too many combinations exist or because it is computationally too expensive to evaluate all possible combinations.

Luckily, it was later discovered by mathematician Joseph-Louis Lagrange that there is an implicit relationship between gradient of the function and gradient of the constraint. At the maximum or minimum point that satisfies the constraint, gradient of the function and gradient of the constraint must be parallel to each other. Therefore,

∇ f ( x ) = λ ∇ g ( x ) \nabla f(\mathbf{x})=\lambda\nabla g(\mathbf{x}) f(x)=λg(x)

where ∇ f ( x ) \nabla f(\mathbf{x}) f(x) is the gradient vector of function f f f at the maximum or minimum point and ∇ g ( x ) \nabla g(\mathbf{x}) g(x) is the gradient vector of constraint function g g g also at the maximum of minimum point. The constant λ \lambda λ is called Lagrange multiplier.

In our example earlier, we can evaluate the gradient vectors for its function and constraint to

( 2 x y x 2 ) = λ ( 2 x 2 y ) \begin{pmatrix}2xy\\x^2\end{pmatrix}= \lambda\begin{pmatrix}2x\\2y\end{pmatrix} (2xyx2)=λ(2x2y)

There are 3 variables, but only 2 equations to evaluate. Don’t forget that we have x 2 + y 2 = 1 x^2+y^2=1 x2+y2=1 by the constraint equation. We can then write out a simultaneous equation that relates our input variable x x x, y y y, and Lagrange multiplier λ \lambda λ.

2 x y = 2 λ x x 2 = 2 λ y x 2 + y 2 = 1 \begin{aligned} 2xy&=2\lambda x\\ x^2&=2\lambda y\\ x^2+y^2&=1 \end{aligned} 2xyx2x2+y2=2λx=2λy=1

Solving this simultaneous equation, we can obtain the following four pairs of x x x and y y y values.

1 3 ( 2 1 ) , 1 3 ( 2 − 1 ) , 1 3 ( − 2 1 ) , 1 3 ( − 2 − 1 ) \frac{1}{\sqrt{3}}\begin{pmatrix}\sqrt{2}\\1\end{pmatrix}, \frac{1}{\sqrt{3}}\begin{pmatrix}\sqrt{2}\\-1\end{pmatrix}, \frac{1}{\sqrt{3}}\begin{pmatrix}-\sqrt{2}\\1\end{pmatrix}, \frac{1}{\sqrt{3}}\begin{pmatrix}-\sqrt{2}\\-1\end{pmatrix} 3 1(2 1),3 1(2 1),3 1(2 1),3 1(2 1)

It is easy to verify that 2 pairs of them give us the maximum value for function f f f.

f ( 2 3 , 1 3 ) = f ( − 2 3 , 1 3 ) = 2 3 3 f(\sqrt{\frac{2}{3}},\sqrt{\frac{1}{3}})=f(-\sqrt{\frac{2}{3}},\sqrt{\frac{1}{3}})=\frac{2}{3\sqrt{3}} f(32 ,31 )=f(32 ,31 )=33 2

And the other 2 pairs give us the minimum value for function f f f.

f ( 2 3 , − 1 3 ) = f ( − 2 3 , − 1 3 ) = − 2 3 3 f(\sqrt{\frac{2}{3}},-\sqrt{\frac{1}{3}})=f(-\sqrt{\frac{2}{3}},-\sqrt{\frac{1}{3}})=-\frac{2}{3\sqrt{3}} f(32 ,31 )=f(32 ,31 )=33 2

How about the Lagrange multiplier λ \lambda λ in the simultaneous equation? This constant λ \lambda λ carries a meaning by itself, too. It shows how much the maximum (or minimum) value of function f f f can be increased (or decreased) by a unit change in the constraint.

The solution of λ \lambda λ is,

λ = y \lambda=y λ=y

At the maximum points of the function,

λ = 1 3 \lambda=\frac{1}{\sqrt{3}} λ=3 1

It means for any unit increase in constraint, the maximum of function f f f can be increased by 1 3 \frac{1}{\sqrt{3}} 3 1. If our constrained optimization problem becomes,

max ⁡ f ( x , y ) = x 2 y s.t.  x 2 + y 2 = 2 \max f(x,y)=x^2y\\\text{s.t. } x^2+y^2=2 maxf(x,y)=x2ys.t. x2+y2=2

where the constraint value is 2 instead of 1. The maximum value for function f f f will be increased from 2 3 3 \frac{2}{3\sqrt{3}} 33 2 to 5 3 3 \frac{5}{3\sqrt{3}} 33 5.

Conversely, at the minimum points of the function,

λ = − 1 3 \lambda=-\frac{1}{\sqrt{3}} λ=3 1

Therefore, for any unit increase in constraint , the minimum of function f f f can be decreased by 1 3 \frac{1}{\sqrt{3}} 3 1.

This result is very important. With the Lagrange multiplier, we not only know the values of variables that maximize (or minimize) a function subject to some constraint. We also know how much the constraint itself can influence the maximum (or minimum) attainable value of the function. One typical application is solving for the maximum revenue of a project subject to some resource constraint. We could evaluate how much the maximum revenue would be increased if we could just lift up the resource constraint a bit.

In the last three articles, we have seen how multivariate calculus is applied to different optimization algorithms. Starting from Newton-Raphson method, we keep on differentiating a function until we reach a point where the function value is very closed to zero. Then we extend this idea to multivariate problem and use gradient descent method to help us find the maximum or minimum point of a function. Moreover, when our optimization problem involves constraints, we can use Lagrange multiplier method to solve it. I hope you start to see the beauty of multivariate calculus from these optimization examples.

This also concludes our journey of introducing mathematics basics for multivariate calculus. It is such an important tool that you will find it very often in future studies of machine learning and deep learning. I hope you have enjoyed reading so far and gained a good understanding about multivariate calculus and its applications.


(Inspired by Mathematics for Machine Learning lecture series from Imperial College London)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值