深度学习中的Matrix Calculus (2): Trace And Matrix Differential

  本篇主要内容就是矩阵标量函数的求导,基本思路就是:

给标量函数套上迹trace;
利用迹和矩阵微分的性质进行化简,化简到 df=tr((fx)Tdx) d f = t r ( ( ∂ f ∂ x ) T d x ) 就可以了
然后就可以得到 fx ∂ f ∂ x

  因此,在深度学习中,假如loss是L2 Norm,也就是 f=loss=aNy22 f = l o s s = ‖ a N − y ‖ 2 2 ,那么 faL=2(ay) ∂ f ∂ a L = 2 ( a − y )
下面贴上参考资料:
这里写图片描述
这里写图片描述
这里写图片描述
这里写图片描述
这里写图片描述
这里写图片描述
这里写图片描述
这里写图片描述
这里写图片描述
这里写图片描述
这里写图片描述
这里写图片描述
这里写图片描述
这里写图片描述
这里写图片描述
这里写图片描述
这里写图片描述
这里写图片描述
这里写图片描述
这里写图片描述
这里写图片描述
这里写图片描述
这里写图片描述
这里写图片描述
这里写图片描述
这里写图片描述
这里写图片描述
这里写图片描述
这里写图片描述
这里写图片描述

Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Part One — Matrices 1 Basic properties of vectors and matrices3 1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 2Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 3Matrices: addition and multiplication . . . . . . . . . . . . . . .4 4The transpose of a matrix . . . . . . . . . . . . . . . . . . . . .6 5Square matrices . . . . . . . . . . . . . . . . . . . . . . . . . . .6 6Linear forms and quadratic forms . . . . . . . . . . . . . . . . .7 7The rank of a matrix . . . . . . . . . . . . . . . . . . . . . . . .8 8The inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 9The determinant . . . . . . . . . . . . . . . . . . . . . . . . . . 10 10 The trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 11 Partitioned matrices . . . . . . . . . . . . . . . . . . . . . . . . 11 12 Complex matrices . . . . . . . . . . . . . . . . . . . . . . . . . 13 13 Eigenvalues and eigenvectors . . . . . . . . . . . . . . . . . . . 14 14 Schur’s decomposition theorem . . . . . . . . . . . . . . . . . . 17 15 The Jordan decomposition . . . . . . . . . . . . . . . . . . . . . 18 16 The singular-value decomposition . . . . . . . . . . . . . . . . . 19 17 Further results concerning eigenvalues . . . . . . . . . . . . . . 20 18 Positive (semi)definite matrices . . . . . . . . . . . . . . . . . . 23 19 Three further results for positive definite matrices . . . . . . . 25 20 A useful result . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Miscellaneous exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2 Kronecker products, the vec operator and the Moore-Penrose inverse 31 1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2The Kronecker product . . . . . . . . . . . . . . . . . . . . . . 31 3Eigenvalues of a Kronecker product . . . . . . . . . . . . . . . . 33 4The vec operator . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5The Moore-Penrose (MP) inverse . . . . . . . . . . . . . . . . . 36 6Existence and uniqueness of the MP inverse . . . . . . . . . . . 37viContents 7Some properties of the MP inverse . . . . . . . . . . . . . . . . 38 8Further properties . . . . . . . . . . . . . . . . . . . . . . . . . 39 9The solution of linear equation systems . . . . . . . . . . . . . 41 Miscellaneous exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3 Miscellaneous matrix results47 1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2The adjoint matrix . . . . . . . . . . . . . . . . . . . . . . . . . 47 3Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . 49 4Bordered determinants . . . . . . . . . . . . . . . . . . . . . . . 51 5The matrix equation AX = 0 . . . . . . . . . . . . . . . . . . . 51 6The Hadamard product . . . . . . . . . . . . . . . . . . . . . . 53 7The commutation matrix Kmn. . . . . . . . . . . . . . . . . . 54 8The duplication matrix Dn. . . . . . . . . . . . . . . . . . . . 56 9Relationship between Dn+1and Dn, I . . . . . . . . . . . . . . 58 10 Relationship between Dn+1and Dn, II . . . . . . . . . . . . . . 60 11 Conditions for a quadratic form to be positive (negative) sub- ject to linear constraints . . . . . . . . . . . . . . . . . . . . . . 61 12 Necessary and sufficient conditions for r(A : B) = r(A) + r(B)64 13 The bordered Gramian matrix . . . . . . . . . . . . . . . . . . 66 14 The equations X1A + X2B′= G1,X1B = G2. . . . . . . . . . 68 Miscellaneous exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Part Two — Differentials: the theory 4 Mathematical preliminaries75 1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 2Interior points and accumulation points . . . . . . . . . . . . . 75 3Open and closed sets . . . . . . . . . . . . . . . . . . . . . . . . 76 4The Bolzano-Weierstrass theorem . . . . . . . . . . . . . . . . . 79 5Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6The limit of a function . . . . . . . . . . . . . . . . . . . . . . . 81 7Continuous functions and compactness . . . . . . . . . . . . . . 82 8Convex sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 9Convex and concave functions . . . . . . . . . . . . . . . . . . . 85 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5 Differentials and differentiability89 1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 2Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3Differentiability and linear approximation . . . . . . . . . . . . 91 4The differential of a vector function . . . . . . . . . . . . . . . . 93 5Uniqueness of the differential . . . . . . . . . . . . . . . . . . . 95 6Continuity of differentiable functions . . . . . . . . . . . . . . . 96 7Partial derivatives . . . . . . . . . . . . . . . . . . . . . . . . . 97Contentsvii 8The first identification theorem . . . . . . . . . . . . . . . . . . 98 9Existence of the differential, I . . . . . . . . . . . . . . . . . . . 99 10 Existence of the differential, II . . . . . . . . . . . . . . . . . . 101 11 Continuous differentiability . . . . . . . . . . . . . . . . . . . . 103 12 The chain rule . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 13 Cauchy invariance . . . . . . . . . . . . . . . . . . . . . . . . . 105 14 The mean-value theorem for real-valued functions . . . . . . . . 106 15 Matrix functions . . . . . . . . . . . . . . . . . . . . . . . . . . 107 16 Some remarks on notation . . . . . . . . . . . . . . . . . . . . . 109 Miscellaneous exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6 The second differential113 1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 2Second-order partial derivatives . . . . . . . . . . . . . . . . . . 113 3The Hessian matrix . . . . . . . . . . . . . . . . . . . . . . . . . 114 4Twice differentiability and second-order approximation, I . . . 115 5Definition of twice differentiability . . . . . . . . . . . . . . . . 116 6The second differential . . . . . . . . . . . . . . . . . . . . . . . 118 7(Column) symmetry of the Hessian matrix . . . . . . . . . . . . 120 8The second identification theorem . . . . . . . . . . . . . . . . 122 9Twice differentiability and second-order approximation, II . . . 123 10 Chain rule for Hessian matrices . . . . . . . . . . . . . . . . . . 125 11 The analogue for second differentials . . . . . . . . . . . . . . . 126 12 Taylor’s theorem for real-valued functions . . . . . . . . . . . . 128 13 Higher-order differentials . . . . . . . . . . . . . . . . . . . . . . 129 14 Matrix functions . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 7 Static optimization133 1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 2Unconstrained optimization . . . . . . . . . . . . . . . . . . . . 134 3The existence of absolute extrema . . . . . . . . . . . . . . . . 135 4Necessary conditions for a local minimum . . . . . . . . . . . . 137 5Sufficient conditions for a local minimum: first-derivative test . 138 6Sufficient conditions for a local minimum: second-derivative test140 7Characterization of differentiable convex functions . . . . . . . 142 8Characterization of twice differentiable convex functions . . . . 145 9Sufficient conditions for an absolute minimum . . . . . . . . . . 147 10 Monotonic transformations . . . . . . . . . . . . . . . . . . . . 147 11 Optimization subject to constraints . . . . . . . . . . . . . . . . 148 12 Necessary conditions for a local minimum under constraints . . 149 13 Sufficient conditions for a local minimum under constraints . . 154 14 Sufficient conditions for an absolute minimum under constraints158 15 A note on constraints in matrix form . . . . . . . . . . . . . . . 159 16 Economic interpretation of Lagrange multipliers . . . . . . . . . 160 Appendix: the implicit function theorem . . . . . . . . . . . . . . . . 162viiiContents Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Part Three — Differentials: the practice 8 Some important differentials167 1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 2Fundamental rules of differential calculus . . . . . . . . . . . . 167 3The differential of a determinant . . . . . . . . . . . . . . . . . 169 4The differential of an inverse . . . . . . . . . . . . . . . . . . . 171 5Differential of the Moore-Penrose inverse . . . . . . . . . . . . . 172 6The differential of the adjoint matrix . . . . . . . . . . . . . . . 175 7On differentiating eigenvalues and eigenvectors . . . . . . . . . 177 8The differential of eigenvalues and eigenvectors: symmetric case 179 9The differential of eigenvalues and eigenvectors: complex case . 182 10 Two alternative expressions for dλ . . . . . . . . . . . . . . . . 185 11 Second differential of the eigenvalue function . . . . . . . . . . 188 12 Multiple eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . 189 Miscellaneous exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 9 First-order differentials and Jacobian matrices193 1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 2Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 3Bad notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 4Good notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 5Identification of Jacobian matrices . . . . . . . . . . . . . . . . 198 6The first identification table . . . . . . . . . . . . . . . . . . . . 198 7Partitioning of the derivative . . . . . . . . . . . . . . . . . . . 199 8Scalar functions of a vector . . . . . . . . . . . . . . . . . . . . 200 9Scalar functions of a matrix, I: trace . . . . . . . . . . . . . . . 200 10 Scalar functions of a matrix, II: determinant . . . . . . . . . . . 202 11 Scalar functions of a matrix, III: eigenvalue . . . . . . . . . . . 204 12 Two examples of vector functions . . . . . . . . . . . . . . . . . 204 13 Matrix functions . . . . . . . . . . . . . . . . . . . . . . . . . . 205 14 Kronecker products . . . . . . . . . . . . . . . . . . . . . . . . . 208 15 Some other problems . . . . . . . . . . . . . . . . . . . . . . . . 210 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 10 Second-order differentials and Hessian matrices213 1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 2The Hessian matrix of a matrix function . . . . . . . . . . . . . 213 3Identification of Hessian matrices . . . . . . . . . . . . . . . . . 214 4The second identification table . . . . . . . . . . . . . . . . . . 215 5An explicit formula for the Hessian matrix . . . . . . . . . . . . 217 6Scalar functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 7Vector functions . . . . . . . . . . . . . . . . . . . . . . . . . . 219 8Matrix functions, I . . . . . . . . . . . . . . . . . . . . . . . . . 220Contentsix 9Matrix functions, II . . . . . . . . . . . . . . . . . . . . . . . . 221 Part Four — Inequalities 11 Inequalities225 1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 2The Cauchy-Schwarz inequality . . . . . . . . . . . . . . . . . . 225 3Matrix analogues of the Cauchy-Schwarz inequality . . . . . . . 227 4The theorem of the arithmetic and geometric means . . . . . . 228 5The Rayleigh quotient . . . . . . . . . . . . . . . . . . . . . . . 230 6Concavity of λ1, convexity of λn. . . . . . . . . . . . . . . . . 231 7Variational description of eigenvalues . . . . . . . . . . . . . . . 232 8Fischer’s min-max theorem . . . . . . . . . . . . . . . . . . . . 233 9Monotonicity of the eigenvalues . . . . . . . . . . . . . . . . . . 235 10 The Poincar´e separation theorem . . . . . . . . . . . . . . . . . 236 11 Two corollaries of Poincar´e’s theorem . . . . . . . . . . . . . . 237 12 Further consequences of the Poincar´e theorem . . . . . . . . . . 238 13 Multiplicative version . . . . . . . . . . . . . . . . . . . . . . . 239 14 The maximum of a bilinear form . . . . . . . . . . . . . . . . . 241 15 Hadamard’s inequality . . . . . . . . . . . . . . . . . . . . . . . 242 16 An interlude: Karamata’s inequality . . . . . . . . . . . . . . . 243 17 Karamata’s inequality applied to eigenvalues . . . . . . . . . . 245 18 An inequality concerning positive semidefinite matrices . . . . . 245 19 A representation theorem for (Pap i)1/p. . . . . . . . . . . . . 246 20 A representation theorem for (trAp)1/p. . . . . . . . . . . . . . 248 21 H¨older’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . 249 22 Concavity of log|A| . . . . . . . . . . . . . . . . . . . . . . . . . 250 23 Minkowski’s inequality . . . . . . . . . . . . . . . . . . . . . . . 252 24 Quasilinear representation of |A|1/n. . . . . . . . . . . . . . . . 254 25 Minkowski’s determinant theorem . . . . . . . . . . . . . . . . . 256 26 Weighted means of order p . . . . . . . . . . . . . . . . . . . . . 256 27 Schl¨omilch’s inequality . . . . . . . . . . . . . . . . . . . . . . . 259 28 Curvature properties of Mp(x,a) . . . . . . . . . . . . . . . . . 260 29 Least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 30 Generalized least squares . . . . . . . . . . . . . . . . . . . . . 263 31 Restricted least squares . . . . . . . . . . . . . . . . . . . . . . 263 32 Restricted least squares: matrix version . . . . . . . . . . . . . 265 Miscellaneous exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 266 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 Part Five — The linear model 12 Statistical preliminaries275 1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 2The cumulative distribution function . . . . . . . . . . . . . . . 275 3The joint density function . . . . . . . . . . . . . . . . . . . . . 276 4Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276xContents 5Variance and covariance . . . . . . . . . . . . . . . . . . . . . . 277 6Independence of two random variables . . . . . . . . . . . . . . 279 7Independence of n random variables . . . . . . . . . . . . . . . 281 8Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 9The one-dimensional normal distribution . . . . . . . . . . . . . 281 10 The multivariate normal distribution . . . . . . . . . . . . . . . 282 11 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 Miscellaneous exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 285 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 13 The linear regression model287 1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 2Affine minimum-trace unbiased estimation . . . . . . . . . . . . 288 3The Gauss-Markov theorem . . . . . . . . . . . . . . . . . . . . 289 4The method of least squares . . . . . . . . . . . . . . . . . . . . 292 5Aitken’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 293 6Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . 295 7Estimable functions . . . . . . . . . . . . . . . . . . . . . . . . 297 8Linear constraints: the case M(R′) ⊂ M(X′) . . . . . . . . . . 299 9Linear constraints: the general case . . . . . . . . . . . . . . . . 302 10 Linear constraints: the case M(R′) ∩ M(X′) = {0} . . . . . . . 305 11 A singular variance matrix: the case M(X) ⊂ M(V ) . . . . . . 306 12 A singular variance matrix: the case r(X′V+X) = r(X) . . . . 308 13 A singular variance matrix: the general case, I . . . . . . . . . . 309 14 Explicit and implicit linear constraints . . . . . . . . . . . . . . 310 15 The general linear model, I . . . . . . . . . . . . . . . . . . . . 313 16 A singular variance matrix: the general case, II . . . . . . . . . 314 17 The general linear model, II . . . . . . . . . . . . . . . . . . . . 317 18 Generalized least squares . . . . . . . . . . . . . . . . . . . . . 318 19 Restricted least squares . . . . . . . . . . . . . . . . . . . . . . 319 Miscellaneous exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 321 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 322 14 Further topics in the linear model323 1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 2Best quadratic unbiased estimation of σ2. . . . . . . . . . . . 323 3The best quadratic and positive unbiased estimator of σ2. . . 324 4The best quadratic unbiased estimator of σ2. . . . . . . . . . . 326 5Best quadratic invariant estimation of σ2. . . . . . . . . . . . 329 6The best quadratic and positive invariant estimator of σ2. . . 330 7The best quadratic invariant estimator of σ2. . . . . . . . . . . 331 8Best quadratic unbiased estimation: multivariate normal case . 332 9Bounds for the bias of the least squares estimator of σ2, I . . . 335 10 Bounds for the bias of the least squares estimator of σ2, II . . . 336 11 The prediction of disturbances . . . . . . . . . . . . . . . . . . 338 12 Best linear unbiased predictors with scalar variance matrix . . 339 13 Best linear unbiased predictors with fixed variance matrix, I . . 341Contentsxi 14 Best linear unbiased predictors with fixed variance matrix, II . 344 15 Local sensitivity of the posterior mean . . . . . . . . . . . . . . 345 16 Local sensitivity of the posterior precision . . . . . . . . . . . . 347 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 Part Six — Applications to maximum likelihood estimation 15 Maximum likelihood estimation351 1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 2The method of maximum likelihood (ML) . . . . . . . . . . . . 351 3ML estimation of the multivariate normal distribution . . . . . 352 4Symmetry: implicit versus explicit treatment . . . . . . . . . . 354 5The treatment of positive definiteness . . . . . . . . . . . . . . 355 6The information matrix . . . . . . . . . . . . . . . . . . . . . . 356 7ML estimation of the multivariate normal distribution: distinct means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 8The multivariate linear regression model . . . . . . . . . . . . . 358 9The errors-in-variables model . . . . . . . . . . . . . . . . . . . 361 10 The non-linear regression model with normal errors . . . . . . . 364 11 Special case: functional independence of mean- and variance parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 12 Generalization of Theorem 6 . . . . . . . . . . . . . . . . . . . 366 Miscellaneous exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 368 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 370 16 Simultaneous equations371 1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371 2The simultaneous equations model . . . . . . . . . . . . . . . . 371 3The identification problem . . . . . . . . . . . . . . . . . . . . . 373 4Identification with linear constraints on B and Γ only . . . . . 375 5Identification with linear constraints on B,Γ and Σ . . . . . . . 375 6Non-linear constraints . . . . . . . . . . . . . . . . . . . . . . . 377 7Full-information maximum likelihood (FIML): the information matrix (general case) . . . . . . . . . . . . . . . . . . . . . . . . 378 8Full-information maximum likelihood (FIML): the asymptotic variance matrix (special case) . . . . . . . . . . . . . . . . . . . 380 9Limited-informationmaximumlikelihood(LIML): thefirst-order conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 10 Limited-information maximum likelihood (LIML): the informa- tion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386 11 Limited-information maximum likelihood (LIML): the asymp- totic variance matrix . . . . . . . . . . . . . . . . . . . . . . . . 388 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值