279x Filetype PDF File size 0.29 MB Source: www.stat.cmu.edu
11:55 Wednesday 14th October, 2015
See updates and corrections at http://www.stat.cmu.edu/~cshalizi/mreg/
Lecture 13: Simple Linear Regression in Matrix
Format
36-401, Section B, Fall 2015
13 October 2015
Contents
1 Least Squares in Matrix Form 2
1.1 The Basic Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Mean Squared Error . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Minimizing the MSE . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Fitted Values and Residuals 5
2.1 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Expectations and Covariances . . . . . . . . . . . . . . . . . . . . 7
3 Sampling Distribution of Estimators 8
4 Derivatives with Respect to Vectors 9
4.1 Second Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Maxima and Minima . . . . . . . . . . . . . . . . . . . . . . . . . 11
5 Expectations and Variances with Vectors and Matrices 12
6 Further Reading 13
1
2
So far, we have not used any notions, or notation, that goes beyond basic
algebra and calculus (and probability). This has forced us to do a fair amount
of book-keeping, as it were by hand. This is just about tolerable for the simple
linear model, with one predictor variable. It will get intolerable if we have
multiple predictor variables. Fortunately, a little application of linear algebra
will let us abstract away from a lot of the book-keeping details, and make
multiple linear regression hardly more complicated than the simple version1.
Thesenoteswill not remind you of how matrix algebra works. However, they
will review some results about calculus with matrices, and about expectations
and variances with vectors and matrices.
Throughout, bold-faced letters will denote matrices, as a as opposed to a
scalar a.
1 Least Squares in Matrix Form
Our data consists of n paired observations of the predictor variable X and the
response variable Y , i.e., (x ,y ),...(x ,y ). We wish to fit the model
1 1 n n
Y =β +β X+ǫ (1)
0 1
where E[ǫ|X = x] = 0, Var[ǫ|X = x] = σ2, and ǫ is uncorrelated across mea-
surements2.
1.1 The Basic Matrices
Groupall of the observations of the response into a single column (n×1) matrix
y,
y
1
y
2
y= . (2)
.
.
y
n
Similarly, we group both the coefficients into a single vector (i.e., a 2 × 1
matrix)
β = β0 (3)
β1
We’d also like to group the observations of the predictor variable together,
but we need something which looks a little unusual at first:
1 x
1
1 x
2
x= . . (4)
. .
. .
1 x
n
1Historically, linear models with multiple predictors evolved before the use of matrix alge-
bra for regression. You may imagine the resulting drudgery.
2When I need to also assume that ǫ is Gaussian, and strengthen “uncorrelated” to “inde-
pendent”, I’ll say so.
th
11:55 Wednesday 14 October, 2015
3 1.2 Mean Squared Error
Thisisann×2matrix,wherethefirstcolumnisalways1,andthesecondcolumn
contains the actual observations of X. We have this apparently redundant first
column because of what it does for us when we multiply x by β:
β0+β1x1
β0+β1x2
xβ = (5)
.
.
.
β +β x
0 1 n
That is, xβ is the n × 1 matrix which contains the point predictions.
The matrix x is sometimes called the design matrix.
1.2 Mean Squared Error
At each data point, using the coefficients β results in some error of prediction,
so we have n prediction errors. These form a vector:
e(β) = y−xβ (6)
(You can check that this subtracts an n×1 matrix from an n×1 matrix.)
When we derived the least squares estimator, we used the mean squared
error,
n
MSE(β)= 1 Xe2(β) (7)
n i
i=1
How might we express this in terms of our matrices? I claim that the correct
form is 1
T
MSE(β)= ne e (8)
To see this, look at what the matrix multiplication really involves:
e1
e2
[e e ...e ] (9)
1 2 n .
.
.
en
P 2
This, clearly equals e , so the MSE has the claimed form.
i i
Let us expand this a little for further use.
MSE(β) = 1eTe (10)
n
= 1(y−xβ)T(y−xβ) (11)
n
= 1(yT −βTxT)(y−xβ) (12)
n
1 T T T T T T
= n y y−y xβ−β x y+β x xβ (13)
11:55 Wednesday 14th October, 2015
4 1.3 Minimizing the MSE
Notice that (yTxβ)T = βTxTy. Further notice that this is a 1 × 1 matrix, so
yTxβ =βTxTy. Thus
MSE(β)= 1 yTy−2βTxTy+βTxTxβ (14)
n
1.3 Minimizing the MSE
First, we find the gradient of the MSE with respect to β:
∇MSE(β = 1 ∇yTy−2∇βTxTy+∇βTxTxβ (15)
n
= 1 0−2xTy+2xTxβ (16)
n
= 2 xTxβ−xTy (17)
n
b
Wenowset this to zero at the optimum, β:
T b T
x xβ−x y=0 (18)
b
This equation, for the two-dimensional vector β, corresponds to our pair of nor-
ˆ ˆ
mal or estimating equations for β and β . Thus, it, too, is called an estimating
0 1
equation.
Solving,
b T −1 T
β =(x x) x y (19)
Thatis, we’ve got one matrix equation which gives us both coefficient estimates.
If this is right, the equation we’ve got above should in fact reproduce the
least-squares estimates we’ve already derived, which are of course
ˆ cXY xy−x¯y¯
β1 = 2 = (20)
s 2 2
X x −x¯
and
ˆ ˆ
β =y−β x (21)
0 1
Let’s see if that’s right.
Asafirststep, let’s introduce normalizing factors of 1/n into both the matrix
products:
b −1 T −1 −1 T
β =(n x x) (n x y) (22)
Now let’s look at the two factors in parentheses separately, from right to left.
y1
1 1 1 1 . . . 1 y2
xTy = (23)
.
n n x1 x2 . . . xn .
.
yn
1 Py
= Pi i (24)
n i xiyi
= y (25)
xy
11:55 Wednesday 14th October, 2015
no reviews yet
Please Login to review.