Calculus Pdf 168787

Partial capture of text on file.
                                                                              Vector Differential Calculus in Statistics
                  M.P.WAND
                                                                                                    Section4illustratestheiruseinstatistics.Section5describesan
                                                                                                    extension of the methodology to matrix arguments.
                  Many statistical operations beneﬁt from differential calculus.
                  Examplesinclude optimization of likelihood functions and cal-                                2.   SIMPLEILLUSTRATIVEEXAMPLE
                  culation of information matrices. For multiparameter models                           Let (x ,y), 1 ≤ i ≤ n, be a set of measurements on two
                  differential calculus suited to vector argument functions is usu-                             i   i
                                                                                                    variables x and y, and consider the problem of ﬁtting a line y =
                  ally the most efﬁcient means of performing the required calcu-                    β +β xtothedata. The homoscedastic Gaussian regression
                  lations. We present a primer on vector differential calculus and                    0      1
                                                                                                    model is
                  demonstrate its application to statistics through several worked                                                                           2
                  examples.                                                                                           y=Xβ+ε,                 ε ∼ N(0,σ I),
                  KEY WORDS: Best linear prediction; Generalized linear                             where
                                                                                                                 y                  1 x                               
                  model; Generalized linear mixed model; Information matrix;                                         1                        1
                                                                                                                 .                   .    .                        β
                  Matrix differential calculus; Maximum likelihood estimation;                            y= . , X= . .  and β=                                      0   .
                                                                                                                    .                    .   .                         β
                  Penalized quasi-likelihood; Score equation.                                                      y                     1 x                             1
                                                                                                                     n                        n
                                                                                                    For σ2 known, the log-likelihood for estimation of β is (up to
                                                                                                    an additive constant)
                                          1.   INTRODUCTION                                                                        1
                                                                                                                                                    ⊤
                     Matrix notation is an indispensable tool in statistics. In par-                                ℓ(β)=2σ2(yXβ) (yXβ).                                     (1)
                  ticular, the storage of model parameters in a vector allows for                       The scalar differential calculus approach to maximization of
                  economical expression of models that relate the parameters to                     ℓ(β) involves rewriting (1) as
                  the data. Functions used for estimation of the parameter vector,
                  suchaslikelihoodfunctionsandposteriordensities,arealsocon-                                                             n
                                                                                                                                   1 	                            2
                  veniently expressed using matrix notation. But when it comes                                      ℓ(β)=                  (y β β x )
                                                                                                                                 2σ2           i     0      1 i
                  to optimization of these functions through differential calculus                                                     i=1
                  the statistician usually reverts to scalar subscript notation and                 and then obtaining
                  applies ordinary univariate calculus to the entries of the param-                                                   

                                                                                                                   β                     n
                  eter vector. Subsequent calculus steps, such as those required                                ∂ℓ(ββ)  =(1/σ2)              (y β β x ),
                                                                                                                 ∂β                      i=1    i      0      1 i
                                                                                                                    0                 
                                         (2)
                                                                                                                   β                     n
                  to obtain the information matrix, are also usually done with                                  ∂ℓ(ββ)  =(1/σ2)               x (y β β x ).
                                                                                                                 ∂β                      i=1    i   i      0      1 i
                  anelement-wiseapproach—sometimesfollowedbyconversion                                              1
                  back to matrix notation.                                                          Settingthistozeroweobtaintheuniquestationarypointofℓ(β)
                     This article demonstrates that it is often possible to perform                 occurring at
                  theentireoperationusingmatrixalgebra.Apartfrombeingmore                                                      
                                                                                                                β = yβ x
                  streamlined, it saves conversion between matrix notation and                                    0              1
                                                                                                                           n                         n
                  subscript notation. The key is a differential calculus suited to                                       	                            	              2
                                                                                                                β =           (x x)(y y)                 (x x) .
                  vector argument and scalar-valued functions.                                                    1              i          i                 i
                     The book by Magnus and Neudecker (1988) describes an                                                 i=1                          i=1
                  elegant approach to differential calculus in statistics for gen-                  Asiswell-known, (2) is algebraically equivalent to
                  eral matrix-argument functions. However, the simpliﬁcations
                  that arise for scalar-valued vector-argument functions are some-                                                      ⊤     1 ⊤
                                                                                                                              β=(X X) X y.                                      (3)
                  what opaque in that reference. This article aims to redress
                  this deﬁciency, while otherwise using the tools of Magnus and                     TheHessian matrix of second-order partial derivatives is
                  Neudecker. Other contributions to matrix differential calcu-                            2 β           2  β    
                                                                                                             ∂ ℓ(ββ)    ∂ ℓ(ββ)
                                                                                                                 2     ∂β ∂β
                  lus may be found in Dwyer (1967), Searle (1982), Basilevsky                                 ∂β          0   1
                                                                                                          2 0           2       
                                                                                                                 β          β
                  (1983), and Harville (1997) but each of these references have                              ∂ ℓ(ββ)    ∂ ℓ(ββ)
                                                                                                            ∂β ∂β           2
                                                                                                               0   1     ∂β
                                                                                                                            1
                  similar shortcomings for vector-argument functions.                                                                     
           
                                                                                                                                   n           n   x
                     Section 2 begins with a simple illustrative example. Section                           =(1/σ2) 
                     
i=1 i         =(1/σ2)X⊤X
                                                                                                                                  n           n      2
                                                                                                                                      x            x
                  3 presents the fundamentals of vector differential calculus, and                                                i=1 i       i=1 i
                                                                                                                                                                              
                                                                                                    which can be shown to be negative deﬁnite. Therefore β =
                                                                                                        ⊤
                                                                                                    [β β ] as deﬁned in (2) is the maximizer of ℓ(β) and returns
                    M. P. Wand is Associate Professor, Department of Biostatistics, School of          0   1
                  PublicHealth,HarvardUniversity,665HuntingtonAvenue,Boston,MA02115                 the least squares line. The information matrix of β is
                  (E-mail: mwand@hsph.harvard.edu). I am grateful to Mahlet Tadesse, Misha                                               2    ⊤                2    ⊤
                  Salganik, Bhaswati Ganguli, and Yannis Jemiai for comments on the article.                   I(β)=E{(1/σ )X X}=(1/σ )X X
                  c
                  2002AmericanStatistical Association                                                              The American Statistician, February 2002, Vol. 56, No. 1      1
                                           
              so the covariance matrix of β is                                    use such subscript notation liberally in the examples of Section
                                       1      2   ⊤    1                        4.
                                  I(β)     =σ (X X) .
                 Vector differential calculus achieves the same result more di-         3.  APRIMERONVECTORDIFFERENTIAL
              rectly:                                                                                       CALCULUS
                                           1                                         Consider x ∈ R and the function
                                                         ⊤
                             ℓ(β)=2σ2(yXβ) (yXβ)                                                         y = x3 sinx.
                    =⇒ dℓ(β)= 1 [{d(yXβ)}⊤(yXβ)
                                          2σ2                                     Thenscalar differential calculus leads to
                                                    ⊤                                                 
                   
           
                                       +(yXβ) {d(yXβ)}]                                    dy         d 3               3   d
                                           1                                                      =        x    sin x +x         sin x
                                                         ⊤                                   dx         dx                    dx
                                    = 2σ2[(Xdβ) (yXβ)                                                2           3
                                                    ⊤                                             =3x sinx+x cosx.
                                       (yXβ) Xdβ]
                                    =(1/σ2)(yXβ)⊤Xdβ                             Analternative piece of calculus is
                    =⇒ Dℓ(β)=(1/σ2)(yXβ)⊤X=0                                                              3
                                                                                               dy = d(x sinx)
                                                       ⊤
                                       iff  (yXβ) X=0                                                     3            3
                                          ⊤    1 ⊤                                                 =(dx )sin x+x (dsin x)
                         =⇒ β =(X X) X y.                                                                  2               3
                                                                                                    =(3x dx)sin x+x (cosxdx)
                                                                                                           2          3
                                                                                                    =(3x sinx+x cosx)dx.
              Also,                                                               Hence
                            2              ⊤
 1 ⊤ 
                           d ℓ(β)=(dβ)         σ2X X (dβ)                                           dy       2           3
                                                                                                     dx =3x sinx+x cosx
              so
                               1                                                  as before. This second derivation has used the rules:
                                    ⊤                  1      2    ⊤    1
                  Hℓ(β)=σ2X X =⇒ I(β) =σ (X X) .                                            1. d(uv)=(du)v+u(dv)
                 The symbols “Dℓ(β)” and “Hℓ(β)” may be unfamiliar to                        2. du3 =3u2du
              somereaders, and even “dβ” may require clariﬁcation. The lat-
              ter is a vector of differentials, analogous to the numerator and               3. dsinu = cosudu.
              denominator of dy/dx. In vector differential calculus these dif-
              ferentialsareoftenvectors,socannotbedividedaccordingtothe           Eachoftheserulesseemsreasonablegiventheproductruleand
              usual rules of algebra. The D and H notation is used throughout     rules for differentiation of polynomials and trigonometric func-
              MagnusandNeudecker(1988)and,forscalar-valuedfunctions               tions. The only difference is that they do not involve division
              of vectors, is deﬁned by the following                              by differentials such as du. In scalar differential calculus such
                 Deﬁnitions: Let f be a scalar-valued function with argument      division is “allowable,” but in the vector world division is not
              x ∈ Rp. The derivative vector of f, Df(x),isthe1 × p vector         standard. Therefore, for vector differential calculus, it is more
              whoseith entry is                                                   appropriate to work with products and then compute the deriva-
                                                                                  tive vector using (Magnus and Neudecker 1988):
                                            ∂f(x)
                                             ∂x                                      First Identiﬁcation Theorem: If a is a 1 × p vector for which
                                                i
              TheHessianmatrix of f is                                                                     df(x)=adx,
                                                        ⊤
                                   Hf(x)=D{Df(x) }                                then
              and is, alternatively, the p × p matrix with (i,j) entry equal to                              a=Df(x).
                                           ∂2f(x)
                                           ∂x∂x .
                                              i  j                                TheHessian matrix can be found from
                 Manyofourexamplesareconcernedwithscoreandinforma-                   Second Identiﬁcation Theorem: If A is a p × p matrix for
              tion calculations. If the data vector y has density f(y;β) then     which
              the score vector and information matrices are, respectively,
                                                                                                       2               ⊤
                    D logf(y;θ) and I(β)≡E{H logf(y;θ)}.                                             d f(x)=(dx) Adx,
                     θ                                     θ
                     θθ                                    θθ
              Notethenecessity of a subscript on the D and H in this instance     then
              to specify the argument of logf(y;θ) being differentiated. We                                 A=Hf(x).
              2     Statistical Practice
                    It follows from these theorems that all that is required are                     3.3.1     Rules for Scalar Functions
                 rules for expressing df (x) in the forms
                                                              ⊤                                                                   α          α1
                                       adx and (dx) Adx.                                                                       du     = αu         du,
                                                                                                                           dlogu = u1du,
                 3.1    Notation                                                                                                  u        u
                                                                                                                               de     = e du.
                    For vectors
                                                                                                 3.3.2     Chain Rule
                                         a                                b
                                           1                               1
                                       .                              . 
                                 a= .               and        b= .                                                       ′                          ′
                                          .                                .                                  d{s(u)} = s (u)⊙(du)=diag{s (u)}du.
                                         a                                b
                                           p                               p
                                                                                                     3.3.3     Rules Involving Linear Functions
                 we will introduce the notation (used by modern programming
                 languages such as S-Plus and Matlab)
                                                                                                                           d(AU)=AdU,
                                                          a b
                                                            1 1                                                          d(U+V)=dU+dV,
                                                        . 
                                            a⊙b= . ,
                                                             .                                                            ddiag(u)=diag(du),
                                                          a b
                                                            p p                                                                     ⊤              ⊤
                                                                                                                                dU =(dU) ,
                                                                                                                           dvecU = vec(dU),
                                                        a /b
                                                          1   1                                                               d(trU)=tr(dU),
                                                           .    
                                            a/b= . ,
                                                            .                                                                d(EU)=E(dU).
                                                        a /b
                                                          p   p
                 and                                                                                 3.3.4     Product and Quotient Rules
                                                      s(a ) 
                                                             1
                                                       .                                                      d(U⊙V)=(dU)⊙V+U⊙(dV),
                                            s(a)=           .      .
                                                       .                                                          d(UV)=(dU)V+U(dV),
                                                         s(a )
                                                             p                                                                      (du)⊙vu⊙(dv)
                 wheres : R → Risascalarfunction.Wewilluse1todenotea                                                d(u/v)=                  v⊙v               .
                 vector of ones, with dimension clear from the context. Another
                 useful notation is                                                                  3.3.5     Rules for Determinant and Matrix Inverse
                                                  a 0 ··· 0                                                                                   1
                                                       1                                                                 d|U| = |U|tr(U            dU),
                                                   0 a ··· 0                                                              1            1           1
                                                          2                                                           dU       = U (dU)U .
                                    diag(a)=                               .
                                                   .      .   .     .  
                                                   .      .    ..   .  
                                                      .    .         .
                                                      00··· a                                        Quadratic forms, particularly those involving symmetric ma-
                                                                      p                           trices, are very commoninstatistics, so the following results are
                 For a p × q matrix A we deﬁne vec(A) to be the pq × 1 vector                     worth learning:
                 obtained by stacking the columns of A underneath each other,                        3.3.6     Rules Involving Quadratic Forms
                 in order from left to right.
                                                                                                                    ⊤              ⊤           ⊤
                    If v is a random vector then we use E(v) to denote is expec-                                 du Au = u (A+A )du,
                 tation and cov(v) to denote its covariance matrix.                                                 ⊤               ⊤
                                                                                                                 du Au =2u Adu, Asymmetric.
                 3.2    SomeMatrixAlgebraicRules
                    Thefollowingmatrixalgebraicrelationshipsareusefulinvec-                                                   4.   EXAMPLES
                 tor differential calculus. The ﬁrst one is particularly crucial.
                                     diag(a)b = a⊙b,                                              4.1    Generalized Linear Models
                                     diag(a)1 = a,                                                Let y be a vector of responses and X be a corresponding de-
                                       tr(AB)=tr(BA),                                             sign matrix. The one-parameter exponential family model, with
                                     tr(A⊤                      ⊤                                 canonical link, is characterized by the joint density
                                             B)=vec(A) vec(B).
                                                                                                          f(y;β) = exp{y⊤(Xβ)1⊤b(Xβ)+1⊤c(y)}
                 3.3    Rules for Differentials                                                   whereβisthevectorofcoefﬁcients(e.g.,McCullaghandNelder
                                                                                                                                                  x
                                                                                                  1990). For example, b(x) = log(1 + e ) corresponds to binary
                    LetuandvbevectorfunctionsandUandVbematrixfunc-                                regression with a logit link function.
                 tions. A will denote a constant matrix and s a scalar function.                     Thelog-likelihood of β is
                                                                                                                  The American Statistician, February 2002, Vol. 56, No. 1     3
                          ℓ(β)=y⊤Xβ1⊤b(Xβ)+1⊤c(y)                                where
                                                                                                                                           ⊤
                                                                                       c =[cov{S(x ),S(x )},...cov{S(x ),S(x )}] .
                   =⇒ dℓ(β)=y⊤Xdβ1⊤db(Xβ)                                              0             0       1                0       n
                                  = y⊤Xdβ1⊤diag{b′(Xβ)}d(Xβ)                     Therefore
                                       ⊤           ′     ⊤                                             ⊤           ⊤     ⊤        2       ⊤
                                  = y Xdβb(Xβ) Xdβ                                 daC(a,b)=2{(µ1 a+b)µ1 +a (C+σ I)c0}da
                                            ′       ⊤
                                  = {yb(Xβ)} Xdβ.                                and
              Fromtheﬁrst identiﬁcation theorem the score is                                 2                 ⊤ 2 ⊤               2
                                                                                           daC(a,b)=2(da) (µ 11 +C+σ I)da
                                                 ′       ⊤                        leading to
                                Dℓ(β)={yb(Xβ)} X.
                                                                                                        ⊤          ⊤      ⊤        2       ⊤
                                                                                    D C(a,b)=2{(µ1 a+b)µ1 +a (C+σ I)c },(4)
              Theinformation matrix of β is obtained from                             a                                                    0
                       2                   ′       ⊤                              and
                      d ℓ(β)=d{yb(Xβ)} Xdβ
                               = {diag{b′′(Xβ)}Xdβ}⊤Xdβ                                             2   ⊤           2
                                                                                    H C(a,b)=2(µ 11 +C+σ I)>0 forall b∈R.(5)
                                                                                     a
                                        ⊤ ⊤            ′′
                               =(dβ) X [diag{b (Xβ)}]X(dβ)                       Since
              which leads to                                                                         d C(a,b)=2(µ1⊤a+b),
                              Hℓ(β)=X⊤diag{b′′(Xβ)}X                                               db
                                                                                  and
              and the information matrix being                                                 d2
                                                                                                  C(a,b)=2>0 forall a∈Rd
                 I(β)=E{Hℓ(β)}                                                                db2
                                   =X⊤diag{b′′(Xβ)}X=X⊤cov(y)X.                   it follows from (4) and (5) that the unique minimizers of C(a,b)
                                                                                  are
                                                                                                                 ⊤
              Therefore, the asymptotic covariance matrix of β is                                      b = µ1 a
                      1       ⊤        ′′         1        ⊤          1                                           2  1
                                                                                                       a =(C+σ I) c
                 I(β)     =[X diag{b (Xβ)}X]           =[X cov(y)X] .                                                        0
              4.2   Kriging                                                       and the best linear predictor of y at x is
                                                                                                                        0
                 Let(x ,y)beasetofobservationswithx ∈ Rdandy ∈ R.
                       i   i                               i            i                               ⊤         2  1
                                                                             d                    µ+c (C+σ I) (yµ1).
              Thesimple kriging model for estimation of E(y|x ), x ∈ R ,                                0
                                                                  0    0
              is                                                                  In practice, µ and σ2 are replaced by estimators µ and σ2, and
                                                                                                                        
                                    y =µ+S(x)+ε                                   Cparameterized and estimated by C, leading to the kriging
                                     i            i     i                         formula
              where the random vectors
                                                                                                         ⊤        2   1
                                                                                             y(x )=µ+c (C+σ I) (yµ1).
                                   S(x )                     ε                                   0           0
                                       1                      1
                                     .                   . 
                           S= .  and ε= . 
                                      .                       .                   4.3  Variance Regression Models
                                   S(x )                     ε
                                       n                      n                     Thegeneral Gaussian variance regression model is
              have the moment structure                                                                  y∼N(Xβ,V )
                                                                                                                         θ
                                                                                                                         θθ
                 E(S)=E(ε)=0, cov(S)=C and cov(ε)=σ2I
                                                                                  whereVθrepresentsaparameterizationofthecovariancematrix
                                                                                           θθ
              (e.g., Cressie 1993). Kriging involves determination of the best    of y in terms of θ. An important special case is the Gaussian
                                                                                 mixed model
              linear predictor, S(x ),ofS(x ) in the sense that
                                    0         0
                                                       2                                               y=Xβ+Zu+ε
                                   E[{S(x )S(x )} ]
                                           0         0                            with the random effects vector u and the error ε having joint
              is minimized among all linear combinations of the form              distribution
                                               ⊤                                                u        
0 Gζ 0 
                                     S(x )=a y+b.                                                                         ζζ
                                         0                                                        ε    ∼N 0 , 0R
                                                                                                                               τ
                                                                                                                               ττ
              Thefunction to be minimized over a ∈ Rd and b ∈ R is then
                                                                                  for some parameterized matrices G and R (e.g., Robinson
                                                                                                                       ζ        τ
                                                                                                                       ζζ       ττ
                                          ⊤                  2                    1991). In this instance
                         C(a,b)=E[{a y+bS(x )} ]
                                                         0
                                         ⊤       2     ⊤        2                                    
                                 =(µ1 a+b) +a (C+σ I)a                                              ζ
                                                                                                                              ⊤
                                                                                             θ =          and    V =ZG Z +R .
                                         ⊤               2                                                         θ       ζ          τ
                                     2c a+E{S(x0) },                                               τ              θθ      ζζ        ττ
                                         0
              4    Statistical Practice
The words contained in this file might help you see if this file matches what you are looking for:

...Vector differential calculus in statistics m p wand sectionillustratestheiruseinstatistics sectiondescribesan extension of the methodology to matrix arguments many statistical operations benet from examplesinclude optimization likelihood functions and cal simpleillustrativeexample culation information matrices for multiparameter models let x y i n be a set measurements on two suited argument is usu variables consider problem tting line ally most efcient means performing required calcu xtothedata homoscedastic gaussian regression lations we present primer model demonstrate its application through several worked examples key words best linear prediction generalized where mixed maximum estimation penalized quasi score equation known log up an additive constant introduction notation indispensable tool par ticular storage parameters allows scalar approach maximization economical expression that relate involves rewriting as data used parameter suchaslikelihoodfunctionsandposteriordensities a...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area