Mathematics Basics - Multivariate Calculus (Optimization)

Lin D.

于 2020-01-09 15:51:17 发布

阅读量242

点赞数

分类专栏： Mathematics Basics

本文链接：https://blog.csdn.net/datascientistlin/article/details/103909849

版权

Mathematics Basics 专栏收录该内容

9 篇文章 1 订阅

订阅专栏

Newton-Raphson Method

We have learned previously that a continuous and differentiable function can be approximated by a straight line tangent to it at a point.

$f(x+\Delta x)\approx f(x)+f^{(1)}(x)(\Delta x)$

We can use this approximation formula to help us find the roots of the function, i.e. when $f(x+\Delta x)=0$ . First, let’s rearrange the equation to find the expression for $\Delta x$ .

$\Delta x\approx\frac{f(x+\Delta x)-f(x)}{f^{(1)}(x)}$

At the root of the function, we have $f(x+\Delta x)=0$ .

$\begin{aligned}\Delta x&\approx\frac{0-f(x)}{f^{(1)}(x)}\\&\approx-\frac{f(x)}{f^{(1)}(x)}\end{aligned}$

Therefore, we obtained an expression for $\Delta x$ with function $f (x)$ and its first order derivative $f^{(1)}(x)$ . Let’s see how the sign of $\Delta x$ changes with different values of $f (x)$ and $f^{(1)}(x)$ .

$f (x) > 0$ and $f^{(1)}(x)>0$

and

When both $f (x)$ and $f^{(1)}(x)$ are positive, $\Delta x$ is negative. In order to reach the root value $x^{'}$ , we need to decrease $x$ by distance $|{\Delta x}|$ . Therefore, $x'=x+\Delta x$ .

$f (x) < 0$ and $f^{(1)}(x)>0$

and

When $f (x)$ is negative and $f^{(1)}(x)$ is positive, $\Delta x$ is positive. In order to reach the root value $x^{'}$ , we need to increase $x$ by distance $|\Delta x|$ . Therefore, $x'=x+\Delta x$ .

$f (x) > 0$ and $f^{(1)}(x)<0$

and

When $f (x)$ is positive and $f^{(1)}(x)$ is negative, $\Delta x$ is positive. In order to reach the root value $x^{'}$ , we need to increase $x$ by distance $|\Delta x|$ . Therefore, $x'=x+\Delta x$ .

$f (x) < 0$ and $f^{(1)}(x)<0$

and

When both $f (x)$ and $f^{(1)}(x)$ are negative, $\Delta x$ is negative. In order to reach the root value $x^{'}$ , we need to decrease $x$ by distance $|\Delta x|$ . Therefore, $x'=x+\Delta x$ .

We can see that no matter what values $f (x)$ and $f^{(1)}(x)$ take, we always use the same formula to adjust $x$ by $\Delta x$ .

$x'=x+\Delta x$

There is one hitch here. In practice, function $f (x)$ is not exactly a straight line. We cannot reach the root value $x^{'}$ with one single update. Instead, we have to repeatedly apply the formula to update $x$ until we hit a value that is close enough to our root value.

$x_{n+1}=x_n+\Delta x=x_n-\frac{f(x)}{f^{(1)}(x)}$

This is called the Newton-Raphson method. It is used to find an approximate solution for $f (x) = 0$ when there is not enough computational resource to evaluate every point of the function.

Let’s walk through one example to demonstrate how this method is used. We define a cubic equation $f (x)$ and its first order derivative $f^{(1)}(x)$ as

$\begin{aligned} f(x)&=x^3-2x+2\\ f^{(1)}(x)&=3x^2-2 \end{aligned}$

Newton Raphson Example

We will use Newton-Raphson method to find the solution for $x^3-2x+2=0$ .

Starting at point $x = - 2$ , let’s update the value of $x$ by $\frac{f(x)}{f^{(1)}(x)}$ in table below. Each row represents one update iteration.

n	$x_n$	$f(x_n)$	$f^{(1)}(x_n)$	$\frac{f(x)}{f^{(1)}(x)}$
0	-2	-2	10	-0.2
1	-1.8	-0.232	7.72	-0.0301
2	-1.770	-0.00485	7.398	-0.000656
3	-1.769	-2.28E-06	7.391	-3.087E-07

Within just three iterations, we have obtained a fairly good $x$ value that can solve the equation $f (x) = 0$ . Although theoretically we can keep on iterating until $f (x)$ is infinitely close to zero, we usually choose to stop at an acceptable distance near $f (x) = 0$ . So Newton-Raphson method can be conveniently implemented by computers for solving roots of equations.

There are, however, some limitations of this method. Firstly, the adjustment to $x$ at each iteration is controlled by $\frac{f(x)}{f^{(1)}(x)}$ . This ratio can be very big when ${f^{(1)}(x)}$ is small. As a result, $x$ values are diverging instead of converging around the stationary points of the function. It can be increased or declassed so much that it overshoots the root value. Moreover, if we were lucky to land $x$ just at the point $f^{(1)}=0$ , then the ratio $\frac{f(x)}{f^{(1)}(x)}$ becomes undefined.

Secondly, the starting point of $x$ has significant influence on whether a solution can be found. Using the same function $f (x)$ , but now we start at point $x = 0$ .

n	$x_n$	$f(x_n)$	$f^{(1)}(x_n)$	$\frac{f(x)}{f^{(1)}(x)}$
0	0	2	-2	-1
1	1	1	1	1
2	0	2	-2	-1
3	1	1	1	1
4	0	2	-2	-1

$x_n$ is oscillating between 0 and 1 in each iteration without reaching the root value. Therefore, Newton-Raphson method does not guarantee a solution can be found with any starting point.

Gradient Descent

Gradient descent is a method used a lot in modern machine learning and deep learning algorithms. Interestingly, it shares some intuition with the Newton-Raphson method. We are also iteratively updating the input variables of a function until certain conditions are met. However, instead of finding the roots of a function, gradient descent is applied to find the maxima or minima of a function. Moreover, Newton-Raphson method is usually used for univariate functions only, but gradient descent can be used in multivariate functions.

We first define a term called gradient vector. It is a vector consisting of partial derivatives of function $f$ with respect to each of its input variables. This is denoted by $\nabla f$ .

$\nabla f(x_1,x_2,\cdots,x_n)= \begin{pmatrix} \frac{\partial f}{\partial x_1}\\ \frac{\partial f}{\partial x_2}\\ \vdots\\ \frac{\partial f}{\partial x_n} \end{pmatrix}$

You may recall we have defined the Jacobian vector in a similar way, except that Jacobian is a row vector by convention. Therefore, gradient vector is just the transpose of Jacobian.

$\nabla f=(J_f)^T$

For any starting point of function $f$ , we would like to find the steepest direction in which the function can increase (or decrease) the most. It turns out that the gradient vector is giving us exactly this direction.

To illustrate this idea, let’s look at the case of a two dimensional function $f (x, y)$ . At point $x_1, y_1)$ , we can move some slight distance $(\delta x, \delta y)$ away from $x_1, y_1)$ in the direction given by gradient vector. This will change our function value by

$\frac{\partial f}{\partial x}\cdot\delta x+\frac{\partial f}{\partial y}\cdot\delta y$

We represent the distance moved in $x$ and $y$ direction by a unit vector $r$ which has length 1 and direction component $\delta x$ and $\delta y$ .

$r=\begin{pmatrix}\delta x\\\delta y\end{pmatrix}$

$∣ r ∣ = 1$

Therefore, the change in value for function $f$ can be expressed as the dot product of gradient vector $\nabla f$ and distance vector $r$ as

$df=\nabla f\cdot r$

Now, in order to find the direction where function $f$ increases the most, we need to maximize the dot product $\nabla f\cdot r$ . We have learned previously in linear algebra that dot product of two vectors can be found by the magnitude of these two vectors and the angle $\theta$ between them.

$\nabla f\cdot r=||\nabla f||\cdot||r||\cdot\cos(\theta)$

To maximize this expression, we would have $\cos(\theta)=1$ which means $\theta=0$ . Therefore, direction of vector $r$ is parallel to $\nabla f$ . To increase function $f$ the most, we need to move at the direction of gradient vector.

Moreover, since vector $r$ is a unit vector in the same direction as $\nabla f$ , it is just the normalized vector of $\nabla f$ .

$r=\frac{\nabla f}{||\nabla f||}$

Substituting this to our equation of $d f$ .

$\begin{aligned} df&=\nabla f\cdot r\\ &=\nabla f\cdot\frac{\nabla f}{||\nabla f||}\\ &=\frac{\nabla f \cdot \nabla f}{||\nabla f||}\\ &=\frac{||\nabla f||^2}{||\nabla f||}\\ &=||\nabla f|| \end{aligned}$

The maximum increase at direction $\nabla f$ by a unit vector $r$ is the magnitude of $\nabla f$ itself. Isn’t that amazing?

As $\nabla f$ is pointing at a direction to increase function $f$ , we will take the negative of $\nabla f$ to decrease $f$ and thus find the minimum point of $f$ . This is what we usually use in machine learning algorithms to minimize cost function contributed by multiple parameters. Just like Newton-Raphson method, we will iteratively update the input variables by the gradient vector until $\nabla f$ is 0 or very close to 0. This process is called gradient descent.

$\mathbf{x}_{n+1}=\mathbf{x}_n-\alpha\nabla f(\mathbf{x}_n)$

We use $\mathbf{x}_n$ to represent a vector of input variables at iteration $n$ . And a constant term $\alpha$ is introduced to control the pace of gradient descent. As we are approaching the minimum of function $f$ , $\nabla f$ will get smaller and smaller. When $\nabla f$ is decreased to 0 or an extremely small number, we are at the minimum point and $\mathbf{x}_{n+1}$ is not updated any more.

Although we have only demonstrated gradient descent in two-dimensional case, it is practically the same in higher dimensions. With gradient descent method, we do not need to evaluate the function everywhere to find its minimum. Instead, we just take a small step from our current point of estimation with distance calculated by the gradient at this point. And we repeat the same process until we are down at the minimum point. This is a task computers can accomplish fairly efficiently. Therefore, we see gradient descent being applied universally in machine learning and deep learning algorithms to optimize highly complex functions.

Lagrange multipliers

We have learned to use gradient descent to find the minimum (or maximum) of a multivariate function. There is another type of optimization problem we encounter in practice called constrained optimization. It requires us to find the maximum or minimum of a function, subject to some constraints.

For example,

$\max f(x,y)=x^2y\\\text{s.t. } x^2+y^2=1$

The constraint equation tells us that $x$ and $y$ must come from a certain range of values that satisfy $x^2+y^2=1$ . A brute force approach is therefore to simply enumerate all possible values of $x$ and $y$ and find which combination yields the maximum $f (x, y)$ . However, this approach is often not practical either because too many combinations exist or because it is computationally too expensive to evaluate all possible combinations.

Luckily, it was later discovered by mathematician Joseph-Louis Lagrange that there is an implicit relationship between gradient of the function and gradient of the constraint. At the maximum or minimum point that satisfies the constraint, gradient of the function and gradient of the constraint must be parallel to each other. Therefore,

$\nabla f(\mathbf{x})=\lambda\nabla g(\mathbf{x})$

where $\nabla f(\mathbf{x})$ is the gradient vector of function $f$ at the maximum or minimum point and $\nabla g(\mathbf{x})$ is the gradient vector of constraint function $g$ also at the maximum of minimum point. The constant $\lambda$ is called Lagrange multiplier.

In our example earlier, we can evaluate the gradient vectors for its function and constraint to

$\begin{pmatrix}2xy\\x^2\end{pmatrix}= \lambda\begin{pmatrix}2x\\2y\end{pmatrix}$

There are 3 variables, but only 2 equations to evaluate. Don’t forget that we have $x^2+y^2=1$ by the constraint equation. We can then write out a simultaneous equation that relates our input variable $x$ , $y$ , and Lagrange multiplier $\lambda$ .

$\begin{aligned} 2xy&=2\lambda x\\ x^2&=2\lambda y\\ x^2+y^2&=1 \end{aligned}$

Solving this simultaneous equation, we can obtain the following four pairs of $x$ and $y$ values.

$\frac{1}{\sqrt{3}}\begin{pmatrix}\sqrt{2}\\1\end{pmatrix}, \frac{1}{\sqrt{3}}\begin{pmatrix}\sqrt{2}\\-1\end{pmatrix}, \frac{1}{\sqrt{3}}\begin{pmatrix}-\sqrt{2}\\1\end{pmatrix}, \frac{1}{\sqrt{3}}\begin{pmatrix}-\sqrt{2}\\-1\end{pmatrix}$

It is easy to verify that 2 pairs of them give us the maximum value for function $f$ .

$f(\sqrt{\frac{2}{3}},\sqrt{\frac{1}{3}})=f(-\sqrt{\frac{2}{3}},\sqrt{\frac{1}{3}})=\frac{2}{3\sqrt{3}}$

And the other 2 pairs give us the minimum value for function $f$ .

$f(\sqrt{\frac{2}{3}},-\sqrt{\frac{1}{3}})=f(-\sqrt{\frac{2}{3}},-\sqrt{\frac{1}{3}})=-\frac{2}{3\sqrt{3}}$

How about the Lagrange multiplier $\lambda$ in the simultaneous equation? This constant $\lambda$ carries a meaning by itself, too. It shows how much the maximum (or minimum) value of function $f$ can be increased (or decreased) by a unit change in the constraint.

The solution of $\lambda$ is,

$\lambda=y$

At the maximum points of the function,

$\lambda=\frac{1}{\sqrt{3}}$

It means for any unit increase in constraint, the maximum of function $f$ can be increased by $\frac{1}{\sqrt{3}}$ . If our constrained optimization problem becomes,

$\max f(x,y)=x^2y\\\text{s.t. } x^2+y^2=2$

where the constraint value is 2 instead of 1. The maximum value for function $f$ will be increased from $\frac{2}{3\sqrt{3}}$ to $\frac{5}{3\sqrt{3}}$ .

Conversely, at the minimum points of the function,

$\lambda=-\frac{1}{\sqrt{3}}$

Therefore, for any unit increase in constraint , the minimum of function $f$ can be decreased by $\frac{1}{\sqrt{3}}$ .

This result is very important. With the Lagrange multiplier, we not only know the values of variables that maximize (or minimize) a function subject to some constraint. We also know how much the constraint itself can influence the maximum (or minimum) attainable value of the function. One typical application is solving for the maximum revenue of a project subject to some resource constraint. We could evaluate how much the maximum revenue would be increased if we could just lift up the resource constraint a bit.

In the last three articles, we have seen how multivariate calculus is applied to different optimization algorithms. Starting from Newton-Raphson method, we keep on differentiating a function until we reach a point where the function value is very closed to zero. Then we extend this idea to multivariate problem and use gradient descent method to help us find the maximum or minimum point of a function. Moreover, when our optimization problem involves constraints, we can use Lagrange multiplier method to solve it. I hope you start to see the beauty of multivariate calculus from these optimization examples.

This also concludes our journey of introducing mathematics basics for multivariate calculus. It is such an important tool that you will find it very often in future studies of machine learning and deep learning. I hope you have enjoyed reading so far and gained a good understanding about multivariate calculus and its applications.

(Inspired by Mathematics for Machine Learning lecture series from Imperial College London)