Mathematics Basics - Multivariate Calculus (Partial Derivatives)

Partial Derivatives

Building on what we have learned previously on univariate calculus, we can extend the concept of gradient to multivariate cases. For example, in a function f ( x , y , z ) = sin ⁡ ( x ) e y z 2 f(x,y,z)=\sin(x)e^{yz^2} f(x,y,z)=sin(x)eyz2 we want to understand the influence of each input variable x x x, y y y and z z z on this function. This requires us to perform differentiation of function f f f with respect to each input variable separately. The results obtained are called partial derivatives of function f f f because only one variable is differentiated at a time.

∂ f ∂ x = cos ⁡ ( x ) e y z 2 ∂ f ∂ y = z 2 sin ⁡ ( x ) e y z 2 ∂ f ∂ z = 2 y z sin ⁡ ( x ) e y z 2 \begin{aligned} \frac{\partial f}{\partial x}&=\cos(x)e^{yz^2}\\ \frac{\partial f}{\partial y}&=z^2\sin(x)e^{yz^2}\\ \frac{\partial f}{\partial z}&=2yz\sin(x)e^{yz^2} \end{aligned} xfyfzf=cos(x)eyz2=z2sin(x)eyz2=2yzsin(x)eyz2

Notice two things here. First, the derivative symbol is changed from d f d x \frac{df}{dx} dxdf to ∂ f ∂ x \frac{\partial f}{\partial x} xf to signify this only partially differentiates the function. Second, when we partially differentiate with respect to one variable, all the other variables are treated as constant. Therefore, when performing ∂ f ∂ x \frac{\partial f}{\partial x} xf, only sin ⁡ ( x ) \sin(x) sin(x) term is differentiated and e y z 2 e^{yz^2} eyz2 remains as a constant term.

Let’s take one step forward. If our input variables x x x, y y y and z z z are subsequently expressed in another variable t t t as follows.

x = t − 1 y = t 2 z = 1 t \begin{aligned} x&=t-1\\ y&=t^2\\ z&=\frac{1}{t} \end{aligned} xyz=t1=t2=t1

We can obtain the derivative of function f f f with respect to t t t by applying chain rules to each of its partial derivatives with respect to x x x, y y y and z z z.

d f ( x , y , z ) d t = ∂ f ∂ x ⋅ d x d t + ∂ f ∂ y ⋅ d y d t + ∂ f ∂ z ⋅ d z d t = cos ⁡ ( x ) e y z 2 ⋅ ( 1 ) + z 2 sin ⁡ ( x ) e y z 2 ⋅ ( 2 t ) + 2 y sin ⁡ ( x ) e y z 2 ⋅ ( − 1 t 2 ) = e y z 2 [ cos ⁡ ( x ) + 2 t z 2 sin ⁡ ( x ) − 2 y z t 2 sin ⁡ ( x ) ] \begin{aligned} \frac{df(x,y,z)}{dt}&=\frac{\partial f}{\partial x}\cdot\frac{dx}{dt}+\frac{\partial f}{\partial y}\cdot\frac{dy}{dt}+\frac{\partial f}{\partial z}\cdot\frac{dz}{dt}\\ &=\cos(x)e^{yz^2}\cdot(1)+z^2\sin(x)e^{yz^2}\cdot(2t)+2y\sin(x)e^{yz^2}\cdot(-\frac{1}{t^2})\\ &=e^{yz^2}[\cos(x)+2tz^2\sin(x)-\frac{2yz}{t^2}\sin(x)] \end{aligned} dtdf(x,y,z)=xfdtdx+yfdtdy+zfdtdz=cos(x)eyz2(1)+z2sin(x)eyz2(2t)+2ysin(x)eyz2(t21)=eyz2[cos(x)+2tz2sin(x)t22yzsin(x)]

Next we substitute the x x x, y y y and z z z terms by their respective t t t expression. The result is derivative of function f f f with respect to t t t only. This process of deriving derivatives via some intermediate variables is called total derivative.

d f ( x , y , z ) d t = e t 2 ⋅ 1 t 2 [ cos ⁡ ( t − 1 ) + 2 t ⋅ 1 t 2 sin ⁡ ( t − 1 ) − 2 t 2 ⋅ t 2 ⋅ 1 t sin ⁡ ( t − 1 ) ] = e [ cos ⁡ ( t − 1 ) + 2 t sin ⁡ ( t − 1 ) − 2 t sin ⁡ ( t − 1 ) ] = e cos ⁡ ( t − 1 ) \begin{aligned} \frac{df(x,y,z)}{dt}&=e^{t^2\cdot \frac{1}{t^2}}[\cos(t-1)+2t\cdot\frac{1}{t^2}\sin(t-1)-\frac{2}{t^2}\cdot t^2\cdot\frac{1}{t}\sin(t-1)]\\ &=e[\cos(t-1)+\cancel{\frac{2}{t}\sin(t-1)}-\cancel{\frac{2}{t}\sin(t-1)}]\\ &=e\cos(t-1) \end{aligned} dtdf(x,y,z)=et2t21[cos(t1)+2tt21sin(t1)t22t2t1sin(t1)]=e[cos(t1)+t2sin(t1) t2sin(t1) ]=ecos(t1)

We can verify the derivative result by substituting t t t into f ( x , y , z ) f(x,y,z) f(x,y,z) in the beginning and differentiating it with respect to t t t directly.

d f ( x , y , z ) d t = d d t sin ⁡ ( t − 1 ) e t 2 ⋅ 1 t 2 = d d t sin ⁡ ( t − 1 ) e = e cos ⁡ ( t − 1 ) \begin{aligned} \frac{df(x,y,z)}{dt}&=\frac{d}{dt}\sin(t-1)e^{t^2\cdot\frac{1}{t^2}}\\ &=\frac{d}{dt}\sin(t-1)e\\ &=e\cos(t-1) \end{aligned} dtdf(x,y,z)=dtdsin(t1)et2t21=dtdsin(t1)e=ecos(t1)

That is exactly the same as our total derivative approach. You might be wondering why we have to take a detour by doing partial derivative first followed by a substitution. In real-world applications, we can seldom find a nice analytical expression that explicitly relates function f f f to its input variable t t t or such an expression might be too complicated to be differentiated. Therefore, we have to break down the differentiation process into smaller manageable pieces and join the results back afterwards.

Jacobian

We can expression the partial derivatives of a function in a vector form. This vector representation is called Jacobian and is denoted by letter J. For example, we have a function f ( x , y , z ) = x 2 y + z 3 f(x,y,z)=x^2y+z^3 f(x,y,z)=x2y+z3 and its partial derivatives as

∂ f ∂ x = 2 x y ∂ f ∂ y = x 2 ∂ f ∂ z = 3 z 2 \begin{aligned} \frac{\partial f}{\partial x}&=2xy\\ \frac{\partial f}{\partial y}&=x^2\\ \frac{\partial f}{\partial z}&=3z^2 \end{aligned} xfyfzf=2xy=x2=3z2

Then we write the Jacobian of function f f f as

J ( x , y , z ) = [ 2 x y , x 2 , 3 z 2 ] J(x,y,z)=[2xy,x^2,3z^2] J(x,y,z)=[2xy,x2,3z2]

One property of Jacobian is that it points at the direction of steepest slope at any differentiable point. For example, at point (1,1,1)​, the corresponding vector in our previously calculated Jacobian is

J ( 1 , 1 , 1 ) = [ 2 , 1 , 3 ] J(1,1,1)=[2,1,3] J(1,1,1)=[2,1,3]

That is equivalent to say at point (1,1,1), the steepest slope is pointing at direction [2,1,3]. Furthermore, the steeper the slope, the greater the magnitude of Jacobian is. Therefore, we can compare the slope at one point with that at another point directly by magnitudes of their corresponding Jacobian. I will save the proof of finding steepest slope at a point in our later discussion. For now, please just accept it as a unique property of Jacobian.

In addition, since Jacobian vector tells us the direction of steepest slope, it is also a good indicator of special points in a function. If Jacobian evaluated at a point is 0, it means this is already the steepest point around its neighbors. This point must be a maximum, a minimum or a saddle point. As a result, we can find all points satisfying J = 0 J=0 J=0 as special points of a function.

Jacobian has some other interesting properties, too. For instance, when it is applied to multiple functions and multiple variables, Jacobian is expressed in a matrix form. Jacobian matrix is a very important tool in nonlinear algebra. We have learned from linear algebra that a transformation matrix can change a vector from one vector space to another vector space. However, this requires the vector spaces before and after transformation are both linear. What do we do with a transformation like below?

u ( x , y ) = x + sin ⁡ ( y ) v ( x , y ) = y + sin ⁡ ( x ) \begin{aligned} u(x,y)&=x+\sin(y)\\ v(x,y)&=y+\sin(x) \end{aligned} u(x,y)v(x,y)=x+sin(y)=y+sin(x)

Although the vector space defined by u u u and v v v are not linear, it turns out the finite region around each point after transformation is still very closed to linear. This is called local linearity. Jacobian matrix can be used to describe the linear transformation happening at each point where the function is differentiable. In addition, the determinant of Jacobian matrix (if it is a square matrix) represents the change in size around any given point after transformation. This property is utilized in the evaluation of multiple integral of a function.

We are not going to walk through the details of Jacobian matrix and its application here as it is not relevant to our topics later. Nonetheless, for people who are interested to know more about Jacobian, I highly recommend the introductory courses offered by Khan Academy and MIT. Both of them have an excellent discussion of this topic.

Apply Jacobian in Reality

There is one practical concern about using Jacobian. If we have found a point with Jacobian J = 0 J=0 J=0, how do we tell whether it is a maximum or a minimum point? One way is to also check the Jacobian at points around it to see whether they are all above or below the point of interest. However, this method is not very robust. Instead, we can use a simple extension to Jacobian called Hessian to help us. We know that every element of the Jacobian vector is obtained by partially differentiating function f f f with respect to its input variables x x x, y y y, z z z, etc. Hessian is the second order derivative of f f f which differentiates the Jacobian vector again with respect to each input variables.

Let’s consider our previous example of f ( x , y , z ) = x 2 y + z 3 f(x,y,z)=x^2y+z^3 f(x,y,z)=x2y+z3. Given that it has a Jacobian vector J = ( 2 x y , x 2 , 3 z 2 ) J=(2xy, x^2, 3z^2) J=(2xy,x2,3z2), we can calculate its Hessian, H, as

H = ( ∂ 2 f ∂ 2 x ∂ 2 f ∂ x ∂ y ∂ 2 f ∂ x ∂ z ∂ 2 f ∂ y ∂ x ∂ 2 f ∂ 2 y ∂ 2 f ∂ y ∂ z ∂ 2 f ∂ z ∂ x ∂ 2 f ∂ z ∂ y ∂ 2 f ∂ 2 z ) = ( 2 y 2 x 0 2 x 0 0 0 0 6 z ) H=\begin {pmatrix}\frac{\partial^2f}{\partial^2x}&\frac{\partial^2f}{\partial x\partial y}&\frac{\partial^2f}{\partial x\partial z}\\ \frac{\partial^2f}{\partial y\partial x}&\frac{\partial^2f}{\partial^2y}&\frac{\partial^2f}{\partial y\partial z}\\ \frac{\partial^2f}{\partial z\partial x}&\frac{\partial^2f}{\partial z\partial y}&\frac{\partial^2f}{\partial^2z} \end{pmatrix} =\begin{pmatrix} 2y&2x&0\\ 2x&0&0\\ 0&0&6z \end{pmatrix} H=2x2fyx2fzx2fxy2f2y2fzy2fxz2fyz2f2z2f=2y2x02x00006z

Hessian is an n by n square matrix where n is the number of variables in function f f f. It is also symmetric across the leading diagonal if f f f is continuous everywhere. We can use Hessian to deduce the maximum, minimum and saddle point of a function after we have obtained all the points with Jacobian J = 0 J=0 J=0.

  • If the determinant of Hessian at a point is positive, this point is either a maximum or a minimum point.
  • If the determinant of Hessian at a point is positive and the first term (i.e. top left corner) of Hessian is positive, this point is a minimum point.
  • If the determinant of Hessian at a point is positive and the first term (i.e. top left corner) of Hessian is negative, this point is a maximum point.
  • If the determinant of Hessian at a point is negative, this point is a saddle point.

We typically use Jacobian (and Hessian) to help us solve optimization problems because it gives us the combination of input variables that yield a maximum or a minimum value. However, this would require a function relating input and output variables to exist in the first place. In reality, this could be the most challenging step. For example, in many optimization problems the dimensionality can easily go up to hundreds or thousands (think about the neurons in a neural network problem). It is not possible to explicitly write out the expression for such a large number of variables. Moreover, even we are just solving for a 2-dimensional optimization problem it might happen that no clear analytical expression exists or calculating the value at each point is computationally too expensive. So it is still not viable to write out the function and subsequently evaluate Jacobian for every single point.

What could we do if there is no function for us to optimize? One approach we can adopt to is called numerical methods. Recall that we derived the gradient at a point by approximating rise over run of a finite interval. Given a number of data points, we can calculate the gradient between any pair of neighboring points. These are our approximation of the function’s derivatives at different points when a well-defined analytical expression does not exist. Starting from an initial point, we can take a step in the direction calculated by the gradient. In each subsequent step, we recalculate the gradient at the current point and make a move in its direction until the gradient is zero. This is how we arrive at the maximum or minimum point without an explicit function. In the multi-dimensional case, we will perform partial derivatives with respect to each of the input variable and take a step in each dimension until all partial derivatives are evaluated to zero.

One practical consideration is how big a step we shall take. Clearly, if we take a large step each time we might overshoot and miss the optimal point. But there are problems with taking too small a step too. Not only because it will take a long time to reach the optimal point, but also because of the constraint in our computational power. It will not be possible for computers to store infinitely precise numbers. So if a step change is too small, our computer might not be able to detect it at all. This happens a lot in actual machine learning practice.

Other problems we might encounter in practice include discontinuous functions and noisy data. If we blandly follow the steepest gradient path, we could be hit by a sudden stop where no value exists. With noise in data, the Jacobian evaluated at these points might lead us to a wrong direction. These practical issues remind us to always treat every gradient step with caution and perform validation where possible.

This concludes our discussion on partial derivatives and Jacobian. We now know that partial derivatives are useful in solving maximum and minimum point of a function. They can be conveniently represented by a Jacobian vector. There are, however, some challenges we have to overcome in actual applications because we are not always working with a nice analytical expression. We shall accept this fact and make use of whatever data point available to keep on moving towards the optimal point.


(Inspired by Mathematics for Machine Learning lecture series from Imperial College London)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值