Chapter 3

In linear regression we assume that true f(x)f(x) is linear, or can be modeled by f(x)=B0+B1X1+...BpXpf(x) = B\_0 + B\_1X\_1 + ... B\_pX\_p. We estimate f(x)f(x) with f^(x)=B^0+B^1X1+...B^pXp\\hat f(x) = \\hat B\_0 + \\hat B\_1X\_1 + ... \\hat B\_pX\_p.

Estimating Coefficients

Simple linear regression is a parametric problem, in that we need to estimate parameters B0B\_0 and B1B\_1.

Simple linear regression makes a line that is as close to datapoints as possible. We quantify closeness with RSS, defined as:

RSS=i=1n(Yif^(xi))2=i=1n(YiB^0B^1X1)2RSS = \\sum\_{i = 1 }^{n}(Y\_i - \\hat f(x\_i))^2 = \\sum\_{i = 1 }^{n}(Y\_i - \\hat B\_0 - \\hat B\_1X\_1)^2

We want to find the parameters that minimize RSS! The estimated parameters are

B^1=i=1n(xix¯)(yiy¯)i=1n(xix¯)2\\hat B\_1 = \\frac{\\sum\_{i=1}^{n}(x\_i - \\bar x)(y\_i - \\bar y)}{\\sum\_{i=1}^{n}(x\_i - \\bar x)^2}
B^0=y¯B^1x¯\\hat B\_0 = \\bar y - \\hat B\_1 \\bar x

where y¯\\bar y and x¯\\bar x are the averages over all x and y values respectively.

Assessing Accuracy of Coefficient Estimates

Hypothesis testing: Let there be a null hypothesis H0H\_0 and an alternative hypothesis HaH\_a. After running some tests, if we find that there are results that are statisticially significant in favor of HaH\_a, then we reject H0H\_0.

In the case of assessing coefficient estimates, we let H0:B1=0H\_0: B\_1 = 0 and Ha:B10H\_a: B\_1 \\neq 0.

In order to judge whether or not B0B\_0 might be 0, we can use a t-statistic, a measurement drawn from analysis of our estimations.

t=B^10SE(B^1)t = \\frac{\\hat B\_1 - 0}{SE(\\hat B\_1)}

The t-statistic is a measurement of the number of standard deviations B^1\\hat B\_1 is away from 0 using a student's t-distribution. If tt is large, it indicates that B1B\_1 is likely non-zero. More precisely, if the p-value associated with the t-statistic is large, it is likely that B1B\_1 is non-zero. Generally we use t>2t > 2 as our threshold for rejecting H0H\_0.

Assessing Model Accuracy

We've rejected B1=0B\_1 = 0. Now we need to see how close our estimate is to the true B1B\_1.

Residual Standard Error: RSE=1n2RSSRSE = \\sqrt{\\frac{1}{n-2}RSS}; average amount that response will deviate from true regression line; is an estimate for standard deviation of ϵ\\epsilon.

R2R^2 statistic: R2=TSSRSSTSSR^2 = \\frac{TSS - RSS}{TSS}; proportion of variance explained by a linear model w.r.t to TSS=(yiy¯)2TSS = \\sum(y\_i - \\bar y)^2

R2R^2 has an interperetational advantage over RSE since its always between 0 and 1.

Can use confidence intervals:

Multiple Linear Regression

Now, predictor is of the form Y^=B^0+B^1X1+...B^pXp\\hat Y = \\hat B\_0 + \\hat B\_1X\_1 + ... \\hat B\_pX\_p

BjB\_j: average effect on Y for a unit increase in XjX\_j holding all other regressors fixed

Why not just run a simple linear regression for each regressor?

  1. Can't see effect of 1 regressor in presence of others
  2. Each equation will ignore other regressors
  3. Can't account for correlations between regressors

We estimate regression coefficients by minimizing RSS, given by

RSS=i=1n(Yif^(xi))2=i=1n(YiB^0B^1Xi1...B^pXip)2RSS = \\sum\_{i = 1 }^{n}(Y\_i - \\hat f(x\_i))^2 = \\sum\_{i = 1 }^{n}(Y\_i - \\hat B\_0 - \\hat B\_1X\_{i1} - ... - \\hat B\_pX\_{ip})^2

Correlation between regressors is an issue. Example from ISLR:

Newspaper advertising showed high correlation with sales in a simple linear regression setting. However, newspaper advertising's effect was not statistically significant in the presence of radio and TV advertising. In fact, since newspaper advertising was heavily correllated with radio advertising, it recieved the credit for radio advertising's effect (in the simple linear regression setting).

We can use t-statistics to find out whether or not a regressor has a statistically significant effect in the presence of other regressors. As seen in the above table from ISLR.

Answering: Is there any relationship between any response and any predictor?

Let H0:B1=B2=...=Bp=0H\_0: B\_1 = B\_2 = ... = B\_p = 0 and Ha:Bj0H\_a: B\_j \\neq 0 for some jj.

We can compute the F-statistic. Then, we compute the corresponding p-value and then can discern whether or not we should reject the null hypothesis. Note we can't just use the F-statistic to reject H0H\_0.

F=(TSSRSS)/pRSS/(np1)F = \\frac{(TSS - RSS) / p}{RSS/(n-p-1)}

Answering: Is there any relationship between a subset of regressors and the predictor?

Note: Still need to look at overall F-statistic even if we have individual p-values with for each regressor. When pp is large, we will have some (5%) pp values that are small just by chance. However, F-statistic adjusts for number of predictors.

Deciding on Important Regressors

Model Fit

Note: R2R^2 will always increase when more regressors are added to the linear model. This is because it allows the model to more closely fit training data. One possible way to use this fact is if a regressors added to a model increases the model's R2R^2 by a miniscule amount, then it is possible that the variable does not have a statistically significant effect on the response when considering other regressors included in the model.

Note: RSE=1np1RSSRSE = \\sqrt{\\frac{1}{n-p-1}RSS}. Models with more regressors can have higher RSE if the decrease in RSS is small relative to the increase in pp.

Sources of Uncertainty in Prediction

  1. Predicted least squares plane is only an estimate for true population regression plane.
    1. We can use confidence intervals to determine how close Y^\\hat Y is to f(x)f(x)
  2. Model bias - Assuming data is linear when it might not be
  3. Response can't be predicted perfectly because of irreducible error

Prediction intervals

We can construct confidence intervals to quantify uncertainty surrounding a measurement over a large number of observations. We can use a prediction interval to quantify uncertainty for a specific prediction.

Other Considerations in Regression Model

1. Qualitative Predictors

Some regressors will be qualitative. We can capture the effect of these regressors using indicator variables. For example, if we wish to measure the effect of gender (a variable with only two levels male and female), we can create a variable xix\_i, where

xi={1male0femalex\_i = \\begin{cases} 1 & male\\\\ 0 & female \\end{cases}

Then, the regression becomes

yi=B0+B1xi+ϵi={B0+B1+ϵimaleB0+ϵifemaley\_i = B\_0 + B\_1x\_i + \\epsilon\_i = \\begin{cases} B\_0 + B\_1 + \\epsilon\_i & male\\\\ B\_0 + \\epsilon\_i & female \\end{cases}

Which level gets assigned 0 or 1 is somewhat arbitrary. This dummy variable models the effect of the difference between male and female. For example, if the p-value of B^1\\hat B\_1 is not statistically significant, then we can accept the null hypothesis and say that an observation being male or female doesn't affect output (the difference between male and female is not statistically significant). In this case, B0B\_0 can be interpereted as the average output for females and B1B\_1 as the average amount males are above females.

However, choosing the numbers 0 and 1 has an effect on the interperetation of results. For example, we could have modeled the same data as

yi={B0+B1+ϵimaleB0B1+ϵifemaley\_i = \\begin{cases} B\_0 + B\_1 + \\epsilon\_i & male\\\\ B\_0 - B\_1 + \\epsilon\_i & female \\end{cases}

Now, B0B\_0 is the overall average measure of output. B1B\_1 tells us the amount that females are below the average and males are above the average.

Note: Different choices for encodings lead to different interperetations.

We can also encode qualitative predictors with more than two levels by using multiple dummy variables. For example, if we want to encode an observation being either an Engineering, College, or Wharton student, we can let:

xi1={1Engineering0not Engineeringx\_{i1} = \\begin{cases} 1 & \\text{Engineering} \\\\0 & \\text{not Engineering}\\end{cases}
xi2={1College0not Collegex\_{i2} = \\begin{cases} 1 & \\text{College} \\\\0 & \\text{not College}\\end{cases}

Now, our regression model becomes

y={B0+B1+ϵiif EngineeringB0+B2+ϵiif CollegeB0+ϵiif Whartony = \\begin{cases} B\_0 + B\_1 + \\epsilon\_i && \\text{if Engineering} \\\\ B\_0 + B\_2 + \\epsilon\_i && \\text{if College} \\\\ B\_0 + \\epsilon\_i && \\text{if Wharton} \\end{cases}

In this case, the Wharton category is the baseline. B0B\_0 is the average output for Wharton students. B1B\_1 is the average that Engineering students are above Wharton students. Finally, B2B\_2 is the average amount (in units of response) that College students are above Wharton students.

2.Extending Linear Model

We can model interaction effects between different regressors by adding an interaction term. Let's say we begin with the model Y=B0+B1X1+B2X2+ϵY = B\_0 + B\_1X\_1 + B\_2X\_2 + \\epsilon. If X1X\_1 and X2X\_2 seem to have a synergistic effect, then we can add a third term X1X2X\_1X\_2 to the model. Our model now becomes Y=B0+B1X1+B2X2+B3X1X2+ϵY = B\_0 + B\_1X\_1 + B\_2X\_2 + B\_3X\_1X\_2 + \\epsilon. Doing so relaxes the additive assumption in the following way:

Note that we can re-write our model as Y=B0+(B1+B3X2)X1+B2X2+ϵY = B\_0 + (B\_1 + B\_3X\_2)X\_1 + B\_2X\_2 + \\epsilon. Thus, a unit increase in X2X\_2 actually affects the amount by which X1X\_1 affects YY.

If the effect of the interaction term is statistically significant and the true model is not purely additive, we should see an increase in the R2R^2 value of the model with the interaction term included.

Remember the hierarchical principle: include the base terms X1X\_1 and X2X\_2 in the model when we wish to include the interaction term X1X2X\_1X\_2 even if neither X1X\_1 nor X2X\_2 have statistically significant p-values.

Note: to include interaction effects between qualitative and quantitative variables, we can simply multiply their regressors to form an interaction term as before.

Non Linear Relationships

Simply include some transformation of an existing regressor.

3. Potential Problems

  1. Non-linearity of response-predictor relationships
  2. Correlation of error terms
    1. Example of correlation in error terms time-series data
    2. Correlation is an issue: if there is correlation, estimated standard errors tend to underestimate true standard errors, resulting in confidence intervals that are narrower than they should be.
  3. Non-constant variance of error terms
    1. Can perform transformations on Y^\\hat Y to make variance in error terms constant (when residuals increase w.r.t xx).
  4. Outliers
    1. Possible that outliers may not have a large effect on least squares slope or intercept. However, they might have a significant effect on the RSE or R2R^2 statistic.
    2. Can identify outliers by looking at residual plots. Use studentized residuals and if an observation has a residual 3\\geq 3, we can classify as outlier (just a general rule).
  5. High-leverage points
    1. Outliers are points for which yiy\_i is unusual given an xix\_i. High leverage points for which xix\_i is unusual. Removing high-leverage points can have substantial impact on slope and intercept of least squares line.

    2. Fairly easy to identify high-leverage points in simple linear regression.

    3. In multiple linear regression, it's important to look at predictors whollistically to find high leverage points.

      Notice how the red point does not have an unusual value for neither X1X\_1 not X2X\_2. However, the combined X1,X2X\_1,X\_2 value is unusual.

    4. Leverage statistic given by

      hi=1n+(xix¯)2j=1n(xjx¯)2h\_i = \\frac{1}{n} + \\frac{(x\_i - \\bar x)^2}{\\sum\_{j=1}^{n}(x\_j - \\bar x)^2}

      Noting that the average leverage for all observations is p+1n\\frac{p+1}{n}, an observation with a leverage statistic that greatly exceeds p+1n\\frac{p+1}{n} could be classified as a high-leverage point.

  6. Collinearity

Comparing KNN with Linear Regression

KNN Regression: Given a value for KK and a prediction point x0x\_0, KNNKNN regression identifies the KK training observations that are closest to x0x\_0 (represented by N0N\_0). It then estimates f(x0)f(x\_0) using the average of all training responses in N0N\_0. Formally, f(x0)=1KxiN0yif(x\_0) = \\frac{1}{K}\\sum\_{x\_i\\in N\_0}y\_i

Optimal value for KK will depend on the bias variance tradeoff. Usually, for small KK (i.e K=1K=1), the variance is high but bias extremely low. The variance is high because any prediction (when K=1K=1) will be based entirely on just one training observation.

Parametric methods outperform non-parametric methods when parametric methods make correct assumptions of the form of the data. For example, if we fit a linear model to a fairly linear data-set, then a least squares regression might fit the data better than KNN regression. However, with increasing non-linearity, non-parametric methods, such as KNNKNN regression better fit the data.

Note: KNNKNN regression suffers from the curse of dimensionality. As dimensions increase, the N0N\_0 observations closest to x0x\_0 become further and further away from x0x\_0 as the number of dimensions increases.