# Chapter 3

In linear regression we assume that true $f(x)$ is linear, or can be modeled by $f(x) = B\_0 + B\_1X\_1 + ... B\_pX\_p$. We estimate $f(x)$ with $\\hat f(x) = \\hat B\_0 + \\hat B\_1X\_1 + ... \\hat B\_pX\_p$.

## Estimating Coefficients

Simple linear regression is a parametric problem, in that we need to estimate parameters $B\_0$ and $B\_1$.

Simple linear regression makes a line that is as close to datapoints as possible. We quantify closeness with RSS, defined as:

$RSS = \\sum\_{i = 1 }^{n}(Y\_i - \\hat f(x\_i))^2 = \\sum\_{i = 1 }^{n}(Y\_i - \\hat B\_0 - \\hat B\_1X\_1)^2$

We want to find the parameters that minimize RSS! The estimated parameters are

$\\hat B\_1 = \\frac{\\sum\_{i=1}^{n}(x\_i - \\bar x)(y\_i - \\bar y)}{\\sum\_{i=1}^{n}(x\_i - \\bar x)^2}$
$\\hat B\_0 = \\bar y - \\hat B\_1 \\bar x$

where $\\bar y$ and $\\bar x$ are the averages over all x and y values respectively.

## Assessing Accuracy of Coefficient Estimates

Hypothesis testing: Let there be a null hypothesis $H\_0$ and an alternative hypothesis $H\_a$. After running some tests, if we find that there are results that are statisticially significant in favor of $H\_a$, then we reject $H\_0$.

In the case of assessing coefficient estimates, we let $H\_0: B\_1 = 0$ and $H\_a: B\_1 \\neq 0$.

In order to judge whether or not $B\_0$ might be 0, we can use a t-statistic, a measurement drawn from analysis of our estimations.

$t = \\frac{\\hat B\_1 - 0}{SE(\\hat B\_1)}$

The t-statistic is a measurement of the number of standard deviations $\\hat B\_1$ is away from 0 using a student's t-distribution. If $t$ is large, it indicates that $B\_1$ is likely non-zero. More precisely, if the p-value associated with the t-statistic is large, it is likely that $B\_1$ is non-zero. Generally we use $t > 2$ as our threshold for rejecting $H\_0$.

## Assessing Model Accuracy

We've rejected $B\_1 = 0$. Now we need to see how close our estimate is to the true $B\_1$.

Residual Standard Error: $RSE = \\sqrt{\\frac{1}{n-2}RSS}$; average amount that response will deviate from true regression line; is an estimate for standard deviation of $\\epsilon$.

$R^2$ statistic: $R^2 = \\frac{TSS - RSS}{TSS}$; proportion of variance explained by a linear model w.r.t to $TSS = \\sum(y\_i - \\bar y)^2$

• TSS is the total variability in the response with $\\bar y$ as our predictor.
• RSS explains some of the variability.
• TSS - RSS is the variability that's left over after regression

$R^2$ has an interperetational advantage over RSE since its always between 0 and 1.

Can use confidence intervals:

• 95% confident that the true value of $B\_1$ lies between [$\\hat B\_1 - 2SE(\\hat B\_1), \\hat B\_1 + 2SE(\\hat B\_1)$] since 2.5% of data lies above and below t = 2 in a student's t-distribution.

## Multiple Linear Regression

Now, predictor is of the form $\\hat Y = \\hat B\_0 + \\hat B\_1X\_1 + ... \\hat B\_pX\_p$

$B\_j$: average effect on Y for a unit increase in $X\_j$ holding all other regressors fixed

Why not just run a simple linear regression for each regressor?

1. Can't see effect of 1 regressor in presence of others
2. Each equation will ignore other regressors
3. Can't account for correlations between regressors

We estimate regression coefficients by minimizing RSS, given by

$RSS = \\sum\_{i = 1 }^{n}(Y\_i - \\hat f(x\_i))^2 = \\sum\_{i = 1 }^{n}(Y\_i - \\hat B\_0 - \\hat B\_1X\_{i1} - ... - \\hat B\_pX\_{ip})^2$

### Correlation between regressors is an issue. Example from ISLR:

Newspaper advertising showed high correlation with sales in a simple linear regression setting. However, newspaper advertising's effect was not statistically significant in the presence of radio and TV advertising. In fact, since newspaper advertising was heavily correllated with radio advertising, it recieved the credit for radio advertising's effect (in the simple linear regression setting).

We can use t-statistics to find out whether or not a regressor has a statistically significant effect in the presence of other regressors. As seen in the above table from ISLR.

## Answering: Is there any relationship between any response and any predictor?

Let $H\_0: B\_1 = B\_2 = ... = B\_p = 0$ and $H\_a: B\_j \\neq 0$ for some $j$.

We can compute the F-statistic. Then, we compute the corresponding p-value and then can discern whether or not we should reject the null hypothesis. Note we can't just use the F-statistic to reject $H\_0$.

$F = \\frac{(TSS - RSS) / p}{RSS/(n-p-1)}$

## Answering: Is there any relationship between a subset of regressors and the predictor?

• Select a subset of size $q$ from the coefficients.
• Run a hypothesis test where $H\_0: B\_{p-q+1} = B\_{p-q+1} = ... = B\_p = 0$.
• Compute an RSS_0 from a linear regression with all regressors except the $q$ removed.
• Use $F = \\frac{(RSS\_0 - RSS) / q}{RSS/(n-p-1)}$ and use the associated p-value to determine the statistical significance of the subset of regressors.

Note: Still need to look at overall F-statistic even if we have individual p-values with for each regressor. When $p$ is large, we will have some (5%) $p$ values that are small just by chance. However, F-statistic adjusts for number of predictors.

## Deciding on Important Regressors

• Forward Selection:
1. Begin with a model containing only the intercept
2. Fit p simple linear regressions
4. Repeat 2 - 4 (without repeating regressors) until some stopping rule satisfied
• Backward Selection:
1. Begin with all regressors
2. Run regression
3. Remove variable with largest p-value
4. Repeat 2 - 3 until stopping rule satisfied (i.e: all regressors meet a p-value threshold)
• Mixed Selection:
3. Since p-values can increase as new regressors added to model, remove p-values that exceed a certain threshold
4. Continue process until all regressors have sufficiently low p-value

## Model Fit

Note: $R^2$ will always increase when more regressors are added to the linear model. This is because it allows the model to more closely fit training data. One possible way to use this fact is if a regressors added to a model increases the model's $R^2$ by a miniscule amount, then it is possible that the variable does not have a statistically significant effect on the response when considering other regressors included in the model.

Note: $RSE = \\sqrt{\\frac{1}{n-p-1}RSS}$. Models with more regressors can have higher RSE if the decrease in RSS is small relative to the increase in $p$.

## Sources of Uncertainty in Prediction

1. Predicted least squares plane is only an estimate for true population regression plane.
1. We can use confidence intervals to determine how close $\\hat Y$ is to $f(x)$
2. Model bias - Assuming data is linear when it might not be
3. Response can't be predicted perfectly because of irreducible error

### Prediction intervals

We can construct confidence intervals to quantify uncertainty surrounding a measurement over a large number of observations. We can use a prediction interval to quantify uncertainty for a specific prediction.

## 1. Qualitative Predictors

Some regressors will be qualitative. We can capture the effect of these regressors using indicator variables. For example, if we wish to measure the effect of gender (a variable with only two levels male and female), we can create a variable $x\_i$, where

$x\_i = \\begin{cases} 1 & male\\\\ 0 & female \\end{cases}$

Then, the regression becomes

$y\_i = B\_0 + B\_1x\_i + \\epsilon\_i = \\begin{cases} B\_0 + B\_1 + \\epsilon\_i & male\\\\ B\_0 + \\epsilon\_i & female \\end{cases}$

Which level gets assigned 0 or 1 is somewhat arbitrary. This dummy variable models the effect of the difference between male and female. For example, if the p-value of $\\hat B\_1$ is not statistically significant, then we can accept the null hypothesis and say that an observation being male or female doesn't affect output (the difference between male and female is not statistically significant). In this case, $B\_0$ can be interpereted as the average output for females and $B\_1$ as the average amount males are above females.

However, choosing the numbers 0 and 1 has an effect on the interperetation of results. For example, we could have modeled the same data as

$y\_i = \\begin{cases} B\_0 + B\_1 + \\epsilon\_i & male\\\\ B\_0 - B\_1 + \\epsilon\_i & female \\end{cases}$

Now, $B\_0$ is the overall average measure of output. $B\_1$ tells us the amount that females are below the average and males are above the average.

Note: Different choices for encodings lead to different interperetations.

We can also encode qualitative predictors with more than two levels by using multiple dummy variables. For example, if we want to encode an observation being either an Engineering, College, or Wharton student, we can let:

$x\_{i1} = \\begin{cases} 1 & \\text{Engineering} \\\\0 & \\text{not Engineering}\\end{cases}$
$x\_{i2} = \\begin{cases} 1 & \\text{College} \\\\0 & \\text{not College}\\end{cases}$

Now, our regression model becomes

$y = \\begin{cases} B\_0 + B\_1 + \\epsilon\_i && \\text{if Engineering} \\\\ B\_0 + B\_2 + \\epsilon\_i && \\text{if College} \\\\ B\_0 + \\epsilon\_i && \\text{if Wharton} \\end{cases}$

In this case, the Wharton category is the baseline. $B\_0$ is the average output for Wharton students. $B\_1$ is the average that Engineering students are above Wharton students. Finally, $B\_2$ is the average amount (in units of response) that College students are above Wharton students.

## 2.Extending Linear Model

We can model interaction effects between different regressors by adding an interaction term. Let's say we begin with the model $Y = B\_0 + B\_1X\_1 + B\_2X\_2 + \\epsilon$. If $X\_1$ and $X\_2$ seem to have a synergistic effect, then we can add a third term $X\_1X\_2$ to the model. Our model now becomes $Y = B\_0 + B\_1X\_1 + B\_2X\_2 + B\_3X\_1X\_2 + \\epsilon$. Doing so relaxes the additive assumption in the following way:

Note that we can re-write our model as $Y = B\_0 + (B\_1 + B\_3X\_2)X\_1 + B\_2X\_2 + \\epsilon$. Thus, a unit increase in $X\_2$ actually affects the amount by which $X\_1$ affects $Y$.

If the effect of the interaction term is statistically significant and the true model is not purely additive, we should see an increase in the $R^2$ value of the model with the interaction term included.

Remember the hierarchical principle: include the base terms $X\_1$ and $X\_2$ in the model when we wish to include the interaction term $X\_1X\_2$ even if neither $X\_1$ nor $X\_2$ have statistically significant p-values.

Note: to include interaction effects between qualitative and quantitative variables, we can simply multiply their regressors to form an interaction term as before.

#### Non Linear Relationships

Simply include some transformation of an existing regressor.

## 3. Potential Problems

1. Non-linearity of response-predictor relationships
2. Correlation of error terms
1. Example of correlation in error terms time-series data
2. Correlation is an issue: if there is correlation, estimated standard errors tend to underestimate true standard errors, resulting in confidence intervals that are narrower than they should be.
3. Non-constant variance of error terms
1. Can perform transformations on $\\hat Y$ to make variance in error terms constant (when residuals increase w.r.t $x$).
4. Outliers
1. Possible that outliers may not have a large effect on least squares slope or intercept. However, they might have a significant effect on the RSE or $R^2$ statistic.
2. Can identify outliers by looking at residual plots. Use studentized residuals and if an observation has a residual $\\geq 3$, we can classify as outlier (just a general rule).
5. High-leverage points
1. Outliers are points for which $y\_i$ is unusual given an $x\_i$. High leverage points for which $x\_i$ is unusual. Removing high-leverage points can have substantial impact on slope and intercept of least squares line.

2. Fairly easy to identify high-leverage points in simple linear regression.

3. In multiple linear regression, it's important to look at predictors whollistically to find high leverage points.

Notice how the red point does not have an unusual value for neither $X\_1$ not $X\_2$. However, the combined $X\_1,X\_2$ value is unusual.

4. Leverage statistic given by

$h\_i = \\frac{1}{n} + \\frac{(x\_i - \\bar x)^2}{\\sum\_{j=1}^{n}(x\_j - \\bar x)^2}$

Noting that the average leverage for all observations is $\\frac{p+1}{n}$, an observation with a leverage statistic that greatly exceeds $\\frac{p+1}{n}$ could be classified as a high-leverage point.

6. Collinearity

## Comparing KNN with Linear Regression

KNN Regression: Given a value for $K$ and a prediction point $x\_0$, $KNN$ regression identifies the $K$ training observations that are closest to $x\_0$ (represented by $N\_0$). It then estimates $f(x\_0)$ using the average of all training responses in $N\_0$. Formally, $f(x\_0) = \\frac{1}{K}\\sum\_{x\_i\\in N\_0}y\_i$

Optimal value for $K$ will depend on the bias variance tradeoff. Usually, for small $K$ (i.e $K=1$), the variance is high but bias extremely low. The variance is high because any prediction (when $K=1$) will be based entirely on just one training observation.

Parametric methods outperform non-parametric methods when parametric methods make correct assumptions of the form of the data. For example, if we fit a linear model to a fairly linear data-set, then a least squares regression might fit the data better than KNN regression. However, with increasing non-linearity, non-parametric methods, such as $KNN$ regression better fit the data.

Note: $KNN$ regression suffers from the curse of dimensionality. As dimensions increase, the $N\_0$ observations closest to $x\_0$ become further and further away from $x\_0$ as the number of dimensions increases.