POSTS

BlogPostTitle={Regression Course Reflections, The Hat Matrix and Influence}

January 6, 2020

blog.md

Introduction

What is Regression and Linear Regression?
Regression is a statistical method used to estimate a response variable from predictor(s) found in the sample data. Linear Regression assumes a linear relationship between the response variable and predictor(s), and derives the linear parameters.

When I first encountered regression in an Introduction to Machine Learning course on Coursera, years ago, I simply used the scikit-learn function and I had two vague notions on how to measure the quality of my fitted line. For example, one was that the $R^2$ value should tend to 1 to indicate good fit. In my final semester of college, I decided to take Regression because was curious about how this seemingly simple subject had the depth to be covered over a whole semester. I’m glad I took this course, because it ended up being my first in-depth introduction to Statistics and I loved it!

The technique for deriving Simple Linear Regression (one predictor) and Multiple Linear Regression was day one material. The main objective of the class was analyzing our models and how to measure how good the fit is. In addition, we did many proofs backing up the properties of different attributes. These properties were mainly used to derive the variance and then the standard error of variables (coefficients, mean response, etc). Then the standard error was used to do hypothesis testing to figure out whether the estimated parameters indicate a linear relationship as well as constructing the confidence interval of the population parameter or prediction.

While there were many interesting topics that were covered in this course, today I will be talking about the hat matrix and the concept of leverage and how you can use this to identify whether a point is high leverage or an outlier or both, deeming it influential in a sense*!

Additionally, while writing this post I refer to my notes which are derived from Dr. Maggie Cheng’s Fall 2020 Regression course. I also use the course textbook and the stat462 module by Penn State, which explain these concepts better than I could!

Derivation of the Hat Matrix

Suppose that the $n\times p+1$ dimensional matrix, $X$ is the matrix representation of your $p$ dimensional (number of features) sample data and a constant $1$ . The $i^{th}$ sample looks like this: $X_{i}=[1, x_{i1}, x_{i2}, ..., x_{ip}]$

$\hat{Y}$ is the fitted response variable and $B$ is the vector representation of $p+1$ coefficients. After minimizing the sum squared error or sum squared residuals, the value of the coefficients are solved as $B=(X^T X)^{-1} X^T Y$ . Substuting this into the linear regression equation, we can derive the $H$ matrix.

$\hat{Y}=XB$
$B = (X^T X)^{-1} X^T Y$
$\hat{Y}=X (X^T X)^{-1} X^T Y$
$H = X (X^T X)^{-1} X^T$
$\hat{Y} = HY$

As you see above, the hat matrix, $H$ , allows the fitted response variable to be written as linear transformation of Y, the observed response values. To understand the significance of this we will rewrite $\hat{y_{i}}$
[insert image]
$\hat{y_{i}} = h_{i1} y_{1}+h_{i2}y_{2}+...+h_{ii}y_{i} + ... + h_{in}y_{n}$

Notice $h_{ii}$ in the above equation. $h_{ii}$ is also known as the leverage and measures the effect of the observed response on the fitted response. The larger the $h_{ii}$ , the greater the effect. This also means that points with small leverages and large observed response, even if its an outlier would end up having a small effect on calculating the fitted value.

Properties of the hat matrix, $H$ and leverage, $h_{ii}$

$H$ is symmetric and idempotent. These properties are used to prove certain properties of $h_{ii}$ and derive the variance of the residual, $e$ .
Symmetric Proof
$H^T = H$
$H = X (X^T X)^{-1} X^T$
$H^T = (X (X^T X)^{-1} X^T )^T$
$H^T= (X^T)^T((X^T X)^T)^{-1} X^T$
$H^T = X (X^T X)^{-1} X^T = H$

$h_{ii}$ is between [0,1]
the trace of $H$ or $\sum{h_{ii}}$ is $p+1$
$h_{ii} = \frac{1}{n} + \frac{(x_{i}-\bar{x})^2}{\sum (x_{j}-\bar{x})^2 }$

Analysis of residuals

With a single predictor it may be easy to plot (box-plot, scatter plot, residual plot) and figure out which values cause influence and even what points are outliers. However, with multiple predictors it becomes difficult to analyze visually. Here we turn to analytical analysis and use leverages and residuals.

Standardized residual

We use measures like the standardized residual to identify potential outliers. Without comparison to other residuals, the residual itself doesn’t give a proper measure of how it compares. So we calculate the variance of the residual:
Note the variance of the observed response is the error variance, $\sigma^2$ , which is estimated by $MSE$ .
$e=Y-\hat{Y}$
$e = Y-HY$
$e = Y(I-H)$
At this point you can prove that $I-H$ is both symmetric and idempotent, like the hat matrix. The fact that it is idempotent is useful when deriving the variance:
$var\{e\} = var\{Y(I-H)\} = (I-H)\ var\{Y\} \ (I-H)$
$var\{e\} = \sigma^2 (I-H)(I-H)= \sigma^2 (I-H)$

Here we can substitute $\sigma^2$ with $MSE$ . The variance of the $i^{th}$ residual is $var\{e_{i}\} = MSE\ (1-h_{ii})$ .
This means that points with higher leverage have a smaller standard error. This makes way for the standardized residual formula where the residual is normalized by the standard error: $\frac{e_{i}}{s(e_{i})}$ . This means that points with higher leverage with residual r will produce a larger standardized residual over a point with lower leverage with the same residual r.

An influential outlier (high leverage+outlier) can steer the line towards itself, this means that the residual would be low, when in reality the point should have been an outlier. With one predictor this can easily be spotted via a scatter plot but with more dimensions this is unfeasible. So how can we deal with this?

The studentized residual

For this measure, we calculate the deleted residual, $d_{i}=y_{i}-\hat{y_{(i)}}$ where $\hat{y_{(i)}}$ is the prediction after the i_th point is removed.
$t_{i}=\frac{d_{i}}{s(d_{i})}$
After some derivations, this ends up being:
$t_i=\frac{e_{i}}{\sqrt{MSE_{(i)}(1-h_{ii})}}$

This measure allows for a better sense of what’s really an outlier by accounting for the influence of a high leverage+outlier point. You might be wondering that this is computationally intensive to refit the line for each point, but $MSE_{(i)}$ can be rewritten in terms of the $MSE$ .**

Conclusion

The same concept from a statistical point of view, as opposed to machine learning, has a deeper concern for the robustness of the model and developing diagnostics to assess the quality.

What’s Next? + Reflecting

This is for no particular target audience and was more of an exercise for myself, which is not a bad thing because writing this helped me revisit assumptions I made when going through the course the first time. However, this was a bit out of my comfort zone because I’m more familiar with programming than I am about math or statistics. In fact, statistics has been one of the most confusing things I’ve encountered. I think it was a problem of lacking intuition. However, the more I have been struggling with statistics, the more I am intrigued and learning.

I will be making a part two of this post regarding link functions in General Linear Models. This was a topic that was briefly covered near the end of class and we didn’t get homework problems/tested on, so I’m not sure how much I really understand. I will also be updating this post with plots :)

I might also cover the Stepwise Regression and Subset analysis to build models and predictor selection (otherwise known as feature engineering)

Short note/question about equal variance of residuals

When a linear regression model is appropriate to use for dataset, the residuals (difference between the observed and fitted response) follow equal variance. Does this mean that each $h_{ii}$ needs to hover around $\frac{p+1}{n}$ ? Do high/low leverage points violate the E in the LINE condition?

Cite

This course was taught by Dr. Maggie Cheng and we used the Analysis of Regression textbook. A lifesaver for this course was: https://online.stat.psu.edu/stat462/node/116/

*note that the word “influence” is being used in reference to the sample that influences the parameters (therefore the fitted value as well) rather than just influencing the MSE. A point could inflate the MSE but this might not captured by measures like the studentized residual or standardized residuals. One could say that inflating the MSE makes the point influential, so you should use multiple metrics to find problematic points. This way if one metric misses this point then it can be flagged by another.
**not provided here