Matrix Pdf 174576 | Lecture 13

Partial capture of text on file.
                                                        11:55 Wednesday 14th October, 2015
                                             See updates and corrections at http://www.stat.cmu.edu/~cshalizi/mreg/
                                   Lecture 13: Simple Linear Regression in Matrix
                                                                      Format
                                                          36-401, Section B, Fall 2015
                                                                  13 October 2015
                                  Contents
                                  1 Least Squares in Matrix Form                                                     2
                                     1.1   The Basic Matrices . . . . . . . . . . . . . . . . . . . . . . . . . .    2
                                     1.2   Mean Squared Error . . . . . . . . . . . . . . . . . . . . . . . . .      3
                                     1.3   Minimizing the MSE . . . . . . . . . . . . . . . . . . . . . . . . .      4
                                  2 Fitted Values and Residuals                                                      5
                                     2.1   Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   7
                                     2.2   Expectations and Covariances . . . . . . . . . . . . . . . . . . . .      7
                                  3 Sampling Distribution of Estimators                                              8
                                  4 Derivatives with Respect to Vectors                                              9
                                     4.1   Second Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . .   11
                                     4.2   Maxima and Minima . . . . . . . . . . . . . . . . . . . . . . . . .      11
                                  5 Expectations and Variances with Vectors and Matrices                           12
                                  6 Further Reading                                                                13
                                                                           1
                              2
                                  So far, we have not used any notions, or notation, that goes beyond basic
                              algebra and calculus (and probability). This has forced us to do a fair amount
                              of book-keeping, as it were by hand. This is just about tolerable for the simple
                              linear model, with one predictor variable. It will get intolerable if we have
                              multiple predictor variables. Fortunately, a little application of linear algebra
                              will let us abstract away from a lot of the book-keeping details, and make
                              multiple linear regression hardly more complicated than the simple version1.
                                  Thesenoteswill not remind you of how matrix algebra works. However, they
                              will review some results about calculus with matrices, and about expectations
                              and variances with vectors and matrices.
                                  Throughout, bold-faced letters will denote matrices, as a as opposed to a
                              scalar a.
                              1     Least Squares in Matrix Form
                              Our data consists of n paired observations of the predictor variable X and the
                              response variable Y , i.e., (x ,y ),...(x ,y ). We wish to ﬁt the model
                                                         1  1       n n
                                                            Y =β +β X+ǫ                                 (1)
                                                                  0    1
                              where E[ǫ|X = x] = 0, Var[ǫ|X = x] = σ2, and ǫ is uncorrelated across mea-
                              surements2.
                              1.1    The Basic Matrices
                              Groupall of the observations of the response into a single column (n×1) matrix
                              y,                                        
                                                                      y
                                                                       1
                                                                    y 
                                                                    2 
                                                               y= .                                   (2)
                                                                    . 
                                                                      .
                                                                     y
                                                                       n
                                  Similarly, we group both the coeﬃcients into a single vector (i.e., a 2 × 1
                              matrix)                                   
                                                               β =   β0                                 (3)
                                                                     β1
                                  We’d also like to group the observations of the predictor variable together,
                              but we need something which looks a little unusual at ﬁrst:
                                                                  1 x 
                                                                        1
                                                                  1 x 
                                                                       2 
                                                             x= .      .                              (4)
                                                                  .    .  
                                                                    .   .
                                                                    1  x
                                                                        n
                                 1Historically, linear models with multiple predictors evolved before the use of matrix alge-
                              bra for regression. You may imagine the resulting drudgery.
                                 2When I need to also assume that ǫ is Gaussian, and strengthen “uncorrelated” to “inde-
                              pendent”, I’ll say so.
                                                                      th
                                                   11:55 Wednesday 14    October, 2015
                                3                                                      1.2  Mean Squared Error
                                Thisisann×2matrix,wheretheﬁrstcolumnisalways1,andthesecondcolumn
                                contains the actual observations of X. We have this apparently redundant ﬁrst
                                column because of what it does for us when we multiply x by β:
                                                                     β0+β1x1 
                                                                     β0+β1x2 
                                                              xβ =                                           (5)
                                                                           .     
                                                                           .     
                                                                            .
                                                                       β +β x
                                                                        0    1 n
                                That is, xβ is the n × 1 matrix which contains the point predictions.
                                    The matrix x is sometimes called the design matrix.
                                1.2    Mean Squared Error
                                At each data point, using the coeﬃcients β results in some error of prediction,
                                so we have n prediction errors. These form a vector:
                                                                 e(β) = y−xβ                                   (6)
                                (You can check that this subtracts an n×1 matrix from an n×1 matrix.)
                                    When we derived the least squares estimator, we used the mean squared
                                error,
                                                                             n
                                                             MSE(β)= 1 Xe2(β)                                  (7)
                                                                          n      i
                                                                            i=1
                                How might we express this in terms of our matrices? I claim that the correct
                                form is                                      1
                                                                               T
                                                                MSE(β)= ne e                                   (8)
                                To see this, look at what the matrix multiplication really involves:
                                                                            e1 
                                                                            e2 
                                                               [e e ...e ]                                   (9)
                                                                 1 2     n  . 
                                                                            . 
                                                                               .
                                                                              en
                                                     P 2
                                This, clearly equals     e , so the MSE has the claimed form.
                                                       i  i
                                    Let us expand this a little for further use.
                                             MSE(β) = 1eTe                                                   (10)
                                                             n
                                                         = 1(y−xβ)T(y−xβ)                                    (11)
                                                             n
                                                         = 1(yT −βTxT)(y−xβ)                                 (12)
                                                             n
                                                             1  T       T        T T       T T     
                                                         = n y y−y xβ−β x y+β x xβ                           (13)
                                                      11:55 Wednesday 14th October, 2015
                                                          4                                                                                                  1.3       Minimizing the MSE
                                                          Notice that (yTxβ)T = βTxTy. Further notice that this is a 1 × 1 matrix, so
                                                          yTxβ =βTxTy. Thus
                                                                                            MSE(β)= 1 yTy−2βTxTy+βTxTxβ                                                                              (14)
                                                                                                                   n
                                                          1.3          Minimizing the MSE
                                                          First, we ﬁnd the gradient of the MSE with respect to β:
                                                                                  ∇MSE(β = 1∇yTy−2∇βTxTy+∇βTxTxβ                                                                                     (15)
                                                                                                                 n
                                                                                                         = 10−2xTy+2xTxβ                                                                             (16)
                                                                                                                 n
                                                                                                         = 2xTxβ−xTy                                                                                 (17)
                                                                                                                 n
                                                                                                                                               b
                                                                 Wenowset this to zero at the optimum, β:
                                                                                                                       T b             T
                                                                                                                    x xβ−x y=0                                                                         (18)
                                                                                                                                                 b
                                                          This equation, for the two-dimensional vector β, corresponds to our pair of nor-
                                                                                                                      ˆ             ˆ
                                                          mal or estimating equations for β and β . Thus, it, too, is called an estimating
                                                                                                                        0             1
                                                          equation.
                                                                 Solving,
                                                                                                                     b           T      −1 T
                                                                                                                    β =(x x) x y                                                                       (19)
                                                          Thatis, we’ve got one matrix equation which gives us both coeﬃcient estimates.
                                                                 If this is right, the equation we’ve got above should in fact reproduce the
                                                          least-squares estimates we’ve already derived, which are of course
                                                                                                                 ˆ         cXY           xy−x¯y¯
                                                                                                                β1 =          2     =                                                                  (20)
                                                                                                                            s               2         2
                                                                                                                              X           x −x¯
                                                          and
                                                                                                                         ˆ                ˆ
                                                                                                                        β =y−β x                                                                       (21)
                                                                                                                           0                1
                                                          Let’s see if that’s right.
                                                                 Asaﬁrststep, let’s introduce normalizing factors of 1/n into both the matrix
                                                          products:
                                                                                                            b           −1 T          −1       −1 T
                                                                                                           β =(n x x) (n x y)                                                                          (22)
                                                          Now let’s look at the two factors in parentheses separately, from right to left.
                                                                                                                                                            y1 
                                                                                           1                       1        1       1       . . .      1        y2 
                                                                                              xTy =                                                                                                  (23)
                                                                                                                                                                . 
                                                                                           n                      n        x1      x2       . . .    xn         . 
                                                                                                                                                                      .
                                                                                                                                                                    yn
                                                                                                                   1        Py 
                                                                                                          =                Pi i                                                                        (24)
                                                                                                                  n            i xiyi
                                                                                                          =  y                                                                                       (25)
                                                                                                                      xy
                                                                                                  11:55 Wednesday 14th October, 2015
The words contained in this file might help you see if this file matches what you are looking for:

...Wednesday th october see updates and corrections at http www stat cmu edu cshalizi mreg lecture simple linear regression in matrix format section b fall contents least squares form the basic matrices mean squared error minimizing mse fitted values residuals expectations covariances sampling distribution of estimators derivatives with respect to vectors second maxima minima variances further reading so far we have not used any notions or notation that goes beyond algebra calculus probability this has forced us do a fair amount book keeping as it were by hand is just about tolerable for model one predictor variable will get intolerable if multiple variables fortunately little application let abstract away from lot details make hardly more complicated than version thesenoteswill remind you how works however they review some results throughout bold faced letters denote opposed scalar our data consists n paired observations x response y i e wish t where var uncorrelated across mea surements...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area