285x Filetype PDF File size 0.14 MB Source: matt-wand.utsacademics.info
Vector Differential Calculus in Statistics
M.P.WAND
Section4illustratestheiruseinstatistics.Section5describesan
extension of the methodology to matrix arguments.
Many statistical operations benefit from differential calculus.
Examplesinclude optimization of likelihood functions and cal- 2. SIMPLEILLUSTRATIVEEXAMPLE
culation of information matrices. For multiparameter models Let (x ,y), 1 ≤ i ≤ n, be a set of measurements on two
differential calculus suited to vector argument functions is usu- i i
variables x and y, and consider the problem of fitting a line y =
ally the most efficient means of performing the required calcu- β +β xtothedata. The homoscedastic Gaussian regression
lations. We present a primer on vector differential calculus and 0 1
model is
demonstrate its application to statistics through several worked 2
examples. y=Xβ+ε, ε ∼ N(0,σ I),
KEY WORDS: Best linear prediction; Generalized linear where
y 1 x
model; Generalized linear mixed model; Information matrix; 1 1
. . . β
Matrix differential calculus; Maximum likelihood estimation; y= . , X= . . and β= 0 .
. . . β
Penalized quasi-likelihood; Score equation. y 1 x 1
n n
For σ2 known, the log-likelihood for estimation of β is (up to
an additive constant)
1. INTRODUCTION 1
⊤
Matrix notation is an indispensable tool in statistics. In par- ℓ(β)=2σ2(yXβ) (yXβ). (1)
ticular, the storage of model parameters in a vector allows for The scalar differential calculus approach to maximization of
economical expression of models that relate the parameters to ℓ(β) involves rewriting (1) as
the data. Functions used for estimation of the parameter vector,
suchaslikelihoodfunctionsandposteriordensities,arealsocon- n
1 2
veniently expressed using matrix notation. But when it comes ℓ(β)= (y β β x )
2σ2 i 0 1 i
to optimization of these functions through differential calculus i=1
the statistician usually reverts to scalar subscript notation and and then obtaining
applies ordinary univariate calculus to the entries of the param-
β n
eter vector. Subsequent calculus steps, such as those required ∂ℓ(ββ) =(1/σ2) (y β β x ),
∂β i=1 i 0 1 i
0
(2)
β n
to obtain the information matrix, are also usually done with ∂ℓ(ββ) =(1/σ2) x (y β β x ).
∂β i=1 i i 0 1 i
anelement-wiseapproach—sometimesfollowedbyconversion 1
back to matrix notation. Settingthistozeroweobtaintheuniquestationarypointofℓ(β)
This article demonstrates that it is often possible to perform occurring at
theentireoperationusingmatrixalgebra.Apartfrombeingmore
β = yβ x
streamlined, it saves conversion between matrix notation and 0 1
n n
subscript notation. The key is a differential calculus suited to 2
β = (x x)(y y) (x x) .
vector argument and scalar-valued functions. 1 i i i
The book by Magnus and Neudecker (1988) describes an i=1 i=1
elegant approach to differential calculus in statistics for gen- Asiswell-known, (2) is algebraically equivalent to
eral matrix-argument functions. However, the simplifications
that arise for scalar-valued vector-argument functions are some- ⊤ 1 ⊤
β=(X X) X y. (3)
what opaque in that reference. This article aims to redress
this deficiency, while otherwise using the tools of Magnus and TheHessian matrix of second-order partial derivatives is
Neudecker. Other contributions to matrix differential calcu- 2 β 2 β
∂ ℓ(ββ) ∂ ℓ(ββ)
2 ∂β ∂β
lus may be found in Dwyer (1967), Searle (1982), Basilevsky ∂β 0 1
2 0 2
β β
(1983), and Harville (1997) but each of these references have ∂ ℓ(ββ) ∂ ℓ(ββ)
∂β ∂β 2
0 1 ∂β
1
similar shortcomings for vector-argument functions.
n n x
Section 2 begins with a simple illustrative example. Section =(1/σ2)
i=1 i =(1/σ2)X⊤X
n n 2
x x
3 presents the fundamentals of vector differential calculus, and i=1 i i=1 i
which can be shown to be negative definite. Therefore β =
⊤
[β β ] as defined in (2) is the maximizer of ℓ(β) and returns
M. P. Wand is Associate Professor, Department of Biostatistics, School of 0 1
PublicHealth,HarvardUniversity,665HuntingtonAvenue,Boston,MA02115 the least squares line. The information matrix of β is
(E-mail: mwand@hsph.harvard.edu). I am grateful to Mahlet Tadesse, Misha 2 ⊤ 2 ⊤
Salganik, Bhaswati Ganguli, and Yannis Jemiai for comments on the article. I(β)=E{(1/σ )X X}=(1/σ )X X
c
2002AmericanStatistical Association The American Statistician, February 2002, Vol. 56, No. 1 1
so the covariance matrix of β is use such subscript notation liberally in the examples of Section
1 2 ⊤ 1 4.
I(β) =σ (X X) .
Vector differential calculus achieves the same result more di- 3. APRIMERONVECTORDIFFERENTIAL
rectly: CALCULUS
1 Consider x ∈ R and the function
⊤
ℓ(β)=2σ2(yXβ) (yXβ) y = x3 sinx.
=⇒ dℓ(β)= 1 [{d(yXβ)}⊤(yXβ)
2σ2 Thenscalar differential calculus leads to
⊤
+(yXβ) {d(yXβ)}] dy d 3 3 d
1 = x sin x +x sin x
⊤ dx dx dx
= 2σ2[(Xdβ) (yXβ) 2 3
⊤ =3x sinx+x cosx.
(yXβ) Xdβ]
=(1/σ2)(yXβ)⊤Xdβ Analternative piece of calculus is
=⇒ Dℓ(β)=(1/σ2)(yXβ)⊤X=0 3
dy = d(x sinx)
⊤
iff (yXβ) X=0 3 3
⊤ 1 ⊤ =(dx )sin x+x (dsin x)
=⇒ β =(X X) X y. 2 3
=(3x dx)sin x+x (cosxdx)
2 3
=(3x sinx+x cosx)dx.
Also, Hence
2 ⊤
1 ⊤
d ℓ(β)=(dβ) σ2X X (dβ) dy 2 3
dx =3x sinx+x cosx
so
1 as before. This second derivation has used the rules:
⊤ 1 2 ⊤ 1
Hℓ(β)=σ2X X =⇒ I(β) =σ (X X) . 1. d(uv)=(du)v+u(dv)
The symbols “Dℓ(β)” and “Hℓ(β)” may be unfamiliar to 2. du3 =3u2du
somereaders, and even “dβ” may require clarification. The lat-
ter is a vector of differentials, analogous to the numerator and 3. dsinu = cosudu.
denominator of dy/dx. In vector differential calculus these dif-
ferentialsareoftenvectors,socannotbedividedaccordingtothe Eachoftheserulesseemsreasonablegiventheproductruleand
usual rules of algebra. The D and H notation is used throughout rules for differentiation of polynomials and trigonometric func-
MagnusandNeudecker(1988)and,forscalar-valuedfunctions tions. The only difference is that they do not involve division
of vectors, is defined by the following by differentials such as du. In scalar differential calculus such
Definitions: Let f be a scalar-valued function with argument division is “allowable,” but in the vector world division is not
x ∈ Rp. The derivative vector of f, Df(x),isthe1 × p vector standard. Therefore, for vector differential calculus, it is more
whoseith entry is appropriate to work with products and then compute the deriva-
tive vector using (Magnus and Neudecker 1988):
∂f(x)
∂x First Identification Theorem: If a is a 1 × p vector for which
i
TheHessianmatrix of f is df(x)=adx,
⊤
Hf(x)=D{Df(x) } then
and is, alternatively, the p × p matrix with (i,j) entry equal to a=Df(x).
∂2f(x)
∂x∂x .
i j TheHessian matrix can be found from
Manyofourexamplesareconcernedwithscoreandinforma- Second Identification Theorem: If A is a p × p matrix for
tion calculations. If the data vector y has density f(y;β) then which
the score vector and information matrices are, respectively,
2 ⊤
D logf(y;θ) and I(β)≡E{H logf(y;θ)}. d f(x)=(dx) Adx,
θ θ
θθ θθ
Notethenecessity of a subscript on the D and H in this instance then
to specify the argument of logf(y;θ) being differentiated. We A=Hf(x).
2 Statistical Practice
It follows from these theorems that all that is required are 3.3.1 Rules for Scalar Functions
rules for expressing df (x) in the forms
⊤ α α1
adx and (dx) Adx. du = αu du,
dlogu = u1du,
3.1 Notation u u
de = e du.
For vectors
3.3.2 Chain Rule
a b
1 1
. .
a= . and b= . ′ ′
. . d{s(u)} = s (u)⊙(du)=diag{s (u)}du.
a b
p p
3.3.3 Rules Involving Linear Functions
we will introduce the notation (used by modern programming
languages such as S-Plus and Matlab)
d(AU)=AdU,
a b
1 1 d(U+V)=dU+dV,
.
a⊙b= . ,
. ddiag(u)=diag(du),
a b
p p ⊤ ⊤
dU =(dU) ,
dvecU = vec(dU),
a /b
1 1 d(trU)=tr(dU),
.
a/b= . ,
. d(EU)=E(dU).
a /b
p p
and 3.3.4 Product and Quotient Rules
s(a )
1
. d(U⊙V)=(dU)⊙V+U⊙(dV),
s(a)= . .
. d(UV)=(dU)V+U(dV),
s(a )
p (du)⊙vu⊙(dv)
wheres : R → Risascalarfunction.Wewilluse1todenotea d(u/v)= v⊙v .
vector of ones, with dimension clear from the context. Another
useful notation is 3.3.5 Rules for Determinant and Matrix Inverse
a 0 ··· 0 1
1 d|U| = |U|tr(U dU),
0 a ··· 0 1 1 1
2 dU = U (dU)U .
diag(a)= .
. . . .
. . .. .
. . .
00··· a Quadratic forms, particularly those involving symmetric ma-
p trices, are very commoninstatistics, so the following results are
For a p × q matrix A we define vec(A) to be the pq × 1 vector worth learning:
obtained by stacking the columns of A underneath each other, 3.3.6 Rules Involving Quadratic Forms
in order from left to right.
⊤ ⊤ ⊤
If v is a random vector then we use E(v) to denote is expec- du Au = u (A+A )du,
tation and cov(v) to denote its covariance matrix. ⊤ ⊤
du Au =2u Adu, Asymmetric.
3.2 SomeMatrixAlgebraicRules
Thefollowingmatrixalgebraicrelationshipsareusefulinvec- 4. EXAMPLES
tor differential calculus. The first one is particularly crucial.
diag(a)b = a⊙b, 4.1 Generalized Linear Models
diag(a)1 = a, Let y be a vector of responses and X be a corresponding de-
tr(AB)=tr(BA), sign matrix. The one-parameter exponential family model, with
tr(A⊤ ⊤ canonical link, is characterized by the joint density
B)=vec(A) vec(B).
f(y;β) = exp{y⊤(Xβ)1⊤b(Xβ)+1⊤c(y)}
3.3 Rules for Differentials whereβisthevectorofcoefficients(e.g.,McCullaghandNelder
x
1990). For example, b(x) = log(1 + e ) corresponds to binary
LetuandvbevectorfunctionsandUandVbematrixfunc- regression with a logit link function.
tions. A will denote a constant matrix and s a scalar function. Thelog-likelihood of β is
The American Statistician, February 2002, Vol. 56, No. 1 3
ℓ(β)=y⊤Xβ1⊤b(Xβ)+1⊤c(y) where
⊤
c =[cov{S(x ),S(x )},...cov{S(x ),S(x )}] .
=⇒ dℓ(β)=y⊤Xdβ1⊤db(Xβ) 0 0 1 0 n
= y⊤Xdβ1⊤diag{b′(Xβ)}d(Xβ) Therefore
⊤ ′ ⊤ ⊤ ⊤ ⊤ 2 ⊤
= y Xdβb(Xβ) Xdβ daC(a,b)=2{(µ1 a+b)µ1 +a (C+σ I)c0}da
′ ⊤
= {yb(Xβ)} Xdβ. and
Fromthefirst identification theorem the score is 2 ⊤ 2 ⊤ 2
daC(a,b)=2(da) (µ 11 +C+σ I)da
′ ⊤ leading to
Dℓ(β)={yb(Xβ)} X.
⊤ ⊤ ⊤ 2 ⊤
D C(a,b)=2{(µ1 a+b)µ1 +a (C+σ I)c },(4)
Theinformation matrix of β is obtained from a 0
2 ′ ⊤ and
d ℓ(β)=d{yb(Xβ)} Xdβ
= {diag{b′′(Xβ)}Xdβ}⊤Xdβ 2 ⊤ 2
H C(a,b)=2(µ 11 +C+σ I)>0 forall b∈R.(5)
a
⊤ ⊤ ′′
=(dβ) X [diag{b (Xβ)}]X(dβ) Since
which leads to d C(a,b)=2(µ1⊤a+b),
Hℓ(β)=X⊤diag{b′′(Xβ)}X db
and
and the information matrix being d2
C(a,b)=2>0 forall a∈Rd
I(β)=E{Hℓ(β)} db2
=X⊤diag{b′′(Xβ)}X=X⊤cov(y)X. it follows from (4) and (5) that the unique minimizers of C(a,b)
are
⊤
Therefore, the asymptotic covariance matrix of β is b = µ1 a
1 ⊤ ′′ 1 ⊤ 1 2 1
a =(C+σ I) c
I(β) =[X diag{b (Xβ)}X] =[X cov(y)X] . 0
4.2 Kriging and the best linear predictor of y at x is
0
Let(x ,y)beasetofobservationswithx ∈ Rdandy ∈ R.
i i i i ⊤ 2 1
d µ+c (C+σ I) (yµ1).
Thesimple kriging model for estimation of E(y|x ), x ∈ R , 0
0 0
is In practice, µ and σ2 are replaced by estimators µ and σ2, and
y =µ+S(x)+ε Cparameterized and estimated by C, leading to the kriging
i i i formula
where the random vectors
⊤ 2 1
y(x )=µ+c (C+σ I) (yµ1).
S(x ) ε 0 0
1 1
. .
S= . and ε= .
. . 4.3 Variance Regression Models
S(x ) ε
n n Thegeneral Gaussian variance regression model is
have the moment structure y∼N(Xβ,V )
θ
θθ
E(S)=E(ε)=0, cov(S)=C and cov(ε)=σ2I
whereVθrepresentsaparameterizationofthecovariancematrix
θθ
(e.g., Cressie 1993). Kriging involves determination of the best of y in terms of θ. An important special case is the Gaussian
mixed model
linear predictor, S(x ),ofS(x ) in the sense that
0 0
2 y=Xβ+Zu+ε
E[{S(x )S(x )} ]
0 0 with the random effects vector u and the error ε having joint
is minimized among all linear combinations of the form distribution
⊤ u
0 Gζ 0
S(x )=a y+b. ζζ
0 ε ∼N 0 , 0R
τ
ττ
Thefunction to be minimized over a ∈ Rd and b ∈ R is then
for some parameterized matrices G and R (e.g., Robinson
ζ τ
ζζ ττ
⊤ 2 1991). In this instance
C(a,b)=E[{a y+bS(x )} ]
0
⊤ 2 ⊤ 2
=(µ1 a+b) +a (C+σ I)a ζ
⊤
θ = and V =ZG Z +R .
⊤ 2 θ ζ τ
2c a+E{S(x0) }, τ θθ ζζ ττ
0
4 Statistical Practice
no reviews yet
Please Login to review.