Chapter 3
In linear regression we assume that true $f(x)$ is linear, or can be modeled by $f(x) = B\_0 + B\_1X\_1 + ... B\_pX\_p$. We estimate $f(x)$ with $\\hat f(x) = \\hat B\_0 + \\hat B\_1X\_1 + ... \\hat B\_pX\_p$.
Estimating Coefficients
Simple linear regression is a parametric problem, in that we need to estimate parameters $B\_0$ and $B\_1$.
Simple linear regression makes a line that is as close to datapoints as possible. We quantify closeness with RSS, defined as:
$RSS = \\sum\_{i = 1 }^{n}(Y\_i  \\hat f(x\_i))^2 = \\sum\_{i = 1 }^{n}(Y\_i  \\hat B\_0  \\hat B\_1X\_1)^2$We want to find the parameters that minimize RSS! The estimated parameters are
$\\hat B\_1 = \\frac{\\sum\_{i=1}^{n}(x\_i  \\bar x)(y\_i  \\bar y)}{\\sum\_{i=1}^{n}(x\_i  \\bar x)^2}$$\\hat B\_0 = \\bar y  \\hat B\_1 \\bar x$where $\\bar y$ and $\\bar x$ are the averages over all x and y values respectively.
Assessing Accuracy of Coefficient Estimates
Hypothesis testing: Let there be a null hypothesis $H\_0$ and an alternative hypothesis $H\_a$. After running some tests, if we find that there are results that are statisticially significant in favor of $H\_a$, then we reject $H\_0$.
In the case of assessing coefficient estimates, we let $H\_0: B\_1 = 0$ and $H\_a: B\_1 \\neq 0$.
In order to judge whether or not $B\_0$ might be 0, we can use a tstatistic, a measurement drawn from analysis of our estimations.
$t = \\frac{\\hat B\_1  0}{SE(\\hat B\_1)}$The tstatistic is a measurement of the number of standard deviations $\\hat B\_1$ is away from 0 using a student's tdistribution. If $t$ is large, it indicates that $B\_1$ is likely nonzero. More precisely, if the pvalue associated with the tstatistic is large, it is likely that $B\_1$ is nonzero. Generally we use $t > 2$ as our threshold for rejecting $H\_0$.
Assessing Model Accuracy
We've rejected $B\_1 = 0$. Now we need to see how close our estimate is to the true $B\_1$.
Residual Standard Error: $RSE = \\sqrt{\\frac{1}{n2}RSS}$; average amount that response will deviate from true regression line; is an estimate for standard deviation of $\\epsilon$.
$R^2$ statistic: $R^2 = \\frac{TSS  RSS}{TSS}$; proportion of variance explained by a linear model w.r.t to $TSS = \\sum(y\_i  \\bar y)^2$
 TSS is the total variability in the response with $\\bar y$ as our predictor.
 RSS explains some of the variability.
 TSS  RSS is the variability that's left over after regression
$R^2$ has an interperetational advantage over RSE since its always between 0 and 1.
Can use confidence intervals:
 95% confident that the true value of $B\_1$ lies between [$\\hat B\_1  2SE(\\hat B\_1), \\hat B\_1 + 2SE(\\hat B\_1)$] since 2.5% of data lies above and below t = 2 in a student's tdistribution.
Multiple Linear Regression
Now, predictor is of the form $\\hat Y = \\hat B\_0 + \\hat B\_1X\_1 + ... \\hat B\_pX\_p$
$B\_j$: average effect on Y for a unit increase in $X\_j$ holding all other regressors fixed
Why not just run a simple linear regression for each regressor?
 Can't see effect of 1 regressor in presence of others
 Each equation will ignore other regressors
 Can't account for correlations between regressors
We estimate regression coefficients by minimizing RSS, given by
$RSS = \\sum\_{i = 1 }^{n}(Y\_i  \\hat f(x\_i))^2 = \\sum\_{i = 1 }^{n}(Y\_i  \\hat B\_0  \\hat B\_1X\_{i1}  ...  \\hat B\_pX\_{ip})^2$Correlation between regressors is an issue. Example from ISLR:
Newspaper advertising showed high correlation with sales in a simple linear regression setting. However, newspaper advertising's effect was not statistically significant in the presence of radio and TV advertising. In fact, since newspaper advertising was heavily correllated with radio advertising, it recieved the credit for radio advertising's effect (in the simple linear regression setting).
We can use tstatistics to find out whether or not a regressor has a statistically significant effect in the presence of other regressors. As seen in the above table from ISLR.
Answering: Is there any relationship between any response and any predictor?
Let $H\_0: B\_1 = B\_2 = ... = B\_p = 0$ and $H\_a: B\_j \\neq 0$ for some $j$.
We can compute the Fstatistic. Then, we compute the corresponding pvalue and then can discern whether or not we should reject the null hypothesis. Note we can't just use the Fstatistic to reject $H\_0$.
$F = \\frac{(TSS  RSS) / p}{RSS/(np1)}$Answering: Is there any relationship between a subset of regressors and the predictor?
 Select a subset of size $q$ from the coefficients.
 Run a hypothesis test where $H\_0: B\_{pq+1} = B\_{pq+1} = ... = B\_p = 0$.
 Compute an RSS_0 from a linear regression with all regressors except the $q$ removed.
 Use $F = \\frac{(RSS\_0  RSS) / q}{RSS/(np1)}$ and use the associated pvalue to determine the statistical significance of the subset of regressors.
Note: Still need to look at overall Fstatistic even if we have individual pvalues with for each regressor. When $p$ is large, we will have some (5%) $p$ values that are small just by chance. However, Fstatistic adjusts for number of predictors.
Deciding on Important Regressors
 Forward Selection:
 Begin with a model containing only the intercept
 Fit p simple linear regressions
 Add regressor with smallest RSS to model
 Repeat 2  4 (without repeating regressors) until some stopping rule satisfied
 Backward Selection:
 Begin with all regressors
 Run regression
 Remove variable with largest pvalue
 Repeat 2  3 until stopping rule satisfied (i.e: all regressors meet a pvalue threshold)
 Mixed Selection:
 Start with no regressors and only intercept
 Add regressors with minimum RSS
 Since pvalues can increase as new regressors added to model, remove pvalues that exceed a certain threshold
 Continue process until all regressors have sufficiently low pvalue
Model Fit
Note: $R^2$ will always increase when more regressors are added to the linear model. This is because it allows the model to more closely fit training data. One possible way to use this fact is if a regressors added to a model increases the model's $R^2$ by a miniscule amount, then it is possible that the variable does not have a statistically significant effect on the response when considering other regressors included in the model.
Note: $RSE = \\sqrt{\\frac{1}{np1}RSS}$. Models with more regressors can have higher RSE if the decrease in RSS is small relative to the increase in $p$.
Sources of Uncertainty in Prediction
 Predicted least squares plane is only an estimate for true population regression plane.
 We can use confidence intervals to determine how close $\\hat Y$ is to $f(x)$
 Model bias  Assuming data is linear when it might not be
 Response can't be predicted perfectly because of irreducible error
Prediction intervals
We can construct confidence intervals to quantify uncertainty surrounding a measurement over a large number of observations. We can use a prediction interval to quantify uncertainty for a specific prediction.
Other Considerations in Regression Model
1. Qualitative Predictors
Some regressors will be qualitative. We can capture the effect of these regressors using indicator variables. For example, if we wish to measure the effect of gender (a variable with only two levels male and female), we can create a variable $x\_i$, where
$x\_i = \\begin{cases}
1 & male\\\\
0 & female
\\end{cases}$Then, the regression becomes
$y\_i = B\_0 + B\_1x\_i + \\epsilon\_i = \\begin{cases}
B\_0 + B\_1 + \\epsilon\_i & male\\\\
B\_0 + \\epsilon\_i & female
\\end{cases}$Which level gets assigned 0 or 1 is somewhat arbitrary. This dummy variable models the effect of the difference between male and female. For example, if the pvalue of $\\hat B\_1$ is not statistically significant, then we can accept the null hypothesis and say that an observation being male or female doesn't affect output (the difference between male and female is not statistically significant). In this case, $B\_0$ can be interpereted as the average output for females and $B\_1$ as the average amount males are above females.
However, choosing the numbers 0 and 1 has an effect on the interperetation of results. For example, we could have modeled the same data as
$y\_i = \\begin{cases}
B\_0 + B\_1 + \\epsilon\_i & male\\\\
B\_0  B\_1 + \\epsilon\_i & female
\\end{cases}$Now, $B\_0$ is the overall average measure of output. $B\_1$ tells us the amount that females are below the average and males are above the average.
Note: Different choices for encodings lead to different interperetations.
We can also encode qualitative predictors with more than two levels by using multiple dummy variables. For example, if we want to encode an observation being either an Engineering, College, or Wharton student, we can let:
$x\_{i1} = \\begin{cases} 1 & \\text{Engineering} \\\\0 & \\text{not Engineering}\\end{cases}$$x\_{i2} = \\begin{cases} 1 & \\text{College} \\\\0 & \\text{not College}\\end{cases}$Now, our regression model becomes
$y = \\begin{cases}
B\_0 + B\_1 + \\epsilon\_i && \\text{if Engineering} \\\\
B\_0 + B\_2 + \\epsilon\_i && \\text{if College} \\\\
B\_0 + \\epsilon\_i && \\text{if Wharton}
\\end{cases}$In this case, the Wharton category is the baseline. $B\_0$ is the average output for Wharton students. $B\_1$ is the average that Engineering students are above Wharton students. Finally, $B\_2$ is the average amount (in units of response) that College students are above Wharton students.
2.Extending Linear Model
We can model interaction effects between different regressors by adding an interaction term. Let's say we begin with the model $Y = B\_0 + B\_1X\_1 + B\_2X\_2 + \\epsilon$. If $X\_1$ and $X\_2$ seem to have a synergistic effect, then we can add a third term $X\_1X\_2$ to the model. Our model now becomes $Y = B\_0 + B\_1X\_1 + B\_2X\_2 + B\_3X\_1X\_2 + \\epsilon$. Doing so relaxes the additive assumption in the following way:
Note that we can rewrite our model as $Y = B\_0 + (B\_1 + B\_3X\_2)X\_1 + B\_2X\_2 + \\epsilon$. Thus, a unit increase in $X\_2$ actually affects the amount by which $X\_1$ affects $Y$.
If the effect of the interaction term is statistically significant and the true model is not purely additive, we should see an increase in the $R^2$ value of the model with the interaction term included.
Remember the hierarchical principle: include the base terms $X\_1$ and $X\_2$ in the model when we wish to include the interaction term $X\_1X\_2$ even if neither $X\_1$ nor $X\_2$ have statistically significant pvalues.
Note: to include interaction effects between qualitative and quantitative variables, we can simply multiply their regressors to form an interaction term as before.
Non Linear Relationships
Simply include some transformation of an existing regressor.
3. Potential Problems
 Nonlinearity of responsepredictor relationships
 Correlation of error terms
 Example of correlation in error terms timeseries data
 Correlation is an issue: if there is correlation, estimated standard errors tend to underestimate true standard errors, resulting in confidence intervals that are narrower than they should be.
 Nonconstant variance of error terms
 Can perform transformations on $\\hat Y$ to make variance in error terms constant (when residuals increase w.r.t $x$).
 Outliers
 Possible that outliers may not have a large effect on least squares slope or intercept. However, they might have a significant effect on the RSE or $R^2$ statistic.
 Can identify outliers by looking at residual plots. Use studentized residuals and if an observation has a residual $\\geq 3$, we can classify as outlier (just a general rule).
 Highleverage points

Outliers are points for which $y\_i$ is unusual given an $x\_i$. High leverage points for which $x\_i$ is unusual. Removing highleverage points can have substantial impact on slope and intercept of least squares line.

Fairly easy to identify highleverage points in simple linear regression.

In multiple linear regression, it's important to look at predictors whollistically to find high leverage points.
Notice how the red point does not have an unusual value for neither $X\_1$ not $X\_2$. However, the combined $X\_1,X\_2$ value is unusual.

Leverage statistic given by
$h\_i = \\frac{1}{n} + \\frac{(x\_i  \\bar x)^2}{\\sum\_{j=1}^{n}(x\_j  \\bar x)^2}$Noting that the average leverage for all observations is $\\frac{p+1}{n}$, an observation with a leverage statistic that greatly exceeds $\\frac{p+1}{n}$ could be classified as a highleverage point.
 Collinearity
Comparing KNN with Linear Regression
KNN Regression: Given a value for $K$ and a prediction point $x\_0$, $KNN$ regression identifies the $K$ training observations that are closest to $x\_0$ (represented by $N\_0$). It then estimates $f(x\_0)$ using the average of all training responses in $N\_0$. Formally, $f(x\_0) = \\frac{1}{K}\\sum\_{x\_i\\in N\_0}y\_i$
Optimal value for $K$ will depend on the bias variance tradeoff. Usually, for small $K$ (i.e $K=1$), the variance is high but bias extremely low. The variance is high because any prediction (when $K=1$) will be based entirely on just one training observation.
Parametric methods outperform nonparametric methods when parametric methods make correct assumptions of the form of the data. For example, if we fit a linear model to a fairly linear dataset, then a least squares regression might fit the data better than KNN regression. However, with increasing nonlinearity, nonparametric methods, such as $KNN$ regression better fit the data.
Note: $KNN$ regression suffers from the curse of dimensionality. As dimensions increase, the $N\_0$ observations closest to $x\_0$ become further and further away from $x\_0$ as the number of dimensions increases.