Chapter 2 Notes

Fundamental Problem

Y=f(x)+ϵY = f(x) + \\epsilon is a function we don't know but want to estimate. xx is a regressor vector, YY is our output, and ϵ\\epsilon is our irreducible error that results for either missing regressors in xx or from unmeasurable variance in nature.

Two classes of problems

Prediction: Estimate YY with Y^=f^(x)+ϵ\\hat{Y} = \\hat f(x) + \\epsilon.

Inference: see how YY changes with changes in X1...XpX\_1 ... X\_p

Classes of Problems

Regression problems deal with quantitative responses.

  1. parametric: reduces problem of estimating ff to a problem of estimating parameters
    1. i.e: When we want to use linear model f(x)=B0+B1X1+...+BpXpf(x) = B\_0 + B\_1X\_1 + ... + B\_pX\_p, we only have the estimate B0+B1+...+BpB\_0 + B\_1 + ... + B\_p
    2. Generally more flexible - gets us further from true form of ff
    3. More restrictive but generally more interperable than non-parametric models
  2. non-parametric:
    1. no assumption about shape/form of ff
    2. tries to use f^\\hat f that is as close to data-points as possible
    3. Needs very large number of observations

Classification problems deal with categorical or qualitative responses. Note that the type of predictor indicates what type of problem we are trying to solve.

Assessing Model Accuracy

Measuring Quality of Fit - to quantify how "off" predictions are from true response data

We can use mean squared error

MSE=1ni=1n(yif^(xi))2MSE = \\frac{1}{n}\\sum\_{i=1}^{n}(y\_i - \\hat{f}(x\_i))^2

Bias-Variance Tradeoff

E[testMSE]=Var(f^(x))+[Bias(f^(x0))]2+Var(ϵ)E[testMSE] = Var(\\hat f(x)) + [Bias(\\hat{f}(x\_0))]^2 + Var(\\epsilon)

Bias: Error introduced by approximating real-life problem with simple models (i.e: linear model).

Variance: the amount by which f^\\hat{f} would change if it was estimated using a different training data set; ideally, f^\\hat f shouldn't vary much across training sets


Training Error Rate

Use training error rate to quantify accuracy of f^\\hat f:

1ni=1nI(YiY^i)\\frac{1}{n}\\sum\_{i=1}^{n}I(Y\_i \\neq \\hat Y\_i)

where II is an indicator random variable that is 1 if YiY^iY\_i \\neq \\hat Y\_i and 0 otherwise

Basically, training error rate averages all misclassifications across nn observations

Test Error Rate

Ave(I(YiY^i))Ave(I(Y\_i \\neq \\hat Y\_i)) - a good classifier minimizes this

The Bayes classifier

Assign a test observation x0x\_0 a class jj for which P(Y=jX=x0)P(Y = j | X = x\_0) is the highest.

Bayes Error Rate: 1E(maxjPr(Y=jX))1 - E(max\_jPr(Y = j | X)) - expectation just averages probability over all possible X


Brief Description: For an observation ii, we look at the class of its KK nearest neighbors' classes. The class kk with the highest propotion wins and we assign ii the class kk.

Classifier that classifies according to

Pr(Y=jX=x)=1KiN0I(yi=j)Pr(Y=j|X=x) = \\frac{1}{K}\\sum\_{i\\in N\_0}I(y\_i = j)