Chapter 4: Classification


We generally try to find the probabilities that a given observation x0x\_0 belongs to a class kk.

Pitfalls of Linear Regression

  1. We can use linear regression to encode variables that have binary encodings (or just 2 levels). We can use linear regression to encode variables that have some inherent ordering or equal relative distance. Beyond these restrictions, however, it's difficult to use linear regression.
  2. Linear regression produces a line, which can produce negative output or outputs greater than 1 for some xx values. This violates probabilities being on [0,1].

Logistic Regression

Linear regression models predicted output given regressors. Logistic Regression models probability that the output is within a class given regressors.

Logistic model

Let the odds of an event be defined as p1p\\frac{p}{1-p}.

Let logit(p)=ln(p1p)logit(p) = ln(\\frac{p}{1-p}). We want to be able to map logit(p)logit(p) to some linear combinations of regressors. Let ln(p1p=αln(\\frac{p}{1-p} = \\alpha, where α\\alpha is the linear combination of regressors.

Note that logit1(p)=eα1+eαlogit^{-1}(p) = \\frac{e^{\\alpha}}{1+e^{\\alpha}}. Notice that this function now has a range restricted to [0,1]. Let

p(x)=eα1+eα=eB0+B1X11+eB0+B1X1p(x) = \\frac{e^{\\alpha}}{1+e^{\\alpha}} = \\frac{e^{B\_0 + B\_1X\_1}}{1+e^{B\_0 + B\_1X\_1}}

Notice that ln(p1p)=B0+B1Xln(\\frac{p}{1-p}) = B\_0 + B\_1X, a unit increase in xx yields a B1B\_1 increase in the log-odds.

We estimate logistic coefficients by seeking estimates for B0B\_0 and B1B\_1 such that P(xi)P(x\_i) yields close to 1 for observations of class 1 and close to 0 for observations of class 0. We find these coefficients by maximizing the likelihood function:

l(B0,B1)=i:yi=1p(xi)j:yj=0(1p(xj))l(B\_0,B\_1) = \\prod\_{i:y\_i=1}p(x\_i) \\prod\_{j:y\_j=0}(1-p(x\_j))

We can measure accuracy of coefficient estimates by computing their standard errors, finding their z-statistics, and then finding their associated p-values. Large values for z-statistics indicates evidence against H0:B1=0H\_0: B\_1 = 0.

Multiple Logistic Regression

Now ln(p1p=B0+B1X1+...+BpXp)ln(\\frac{p}{1-p} = B\_0 + B\_1X\_1 + ... + B\_pX\_p) and thus

p(x)=eα1+eα=eB0+B1X1+...+BpXp1+eB0+B1X1+...+BpXpp(x) = \\frac{e^{\\alpha}}{1+e^{\\alpha}} = \\frac{e^{B\_0 + B\_1X\_1 + ... + B\_pX\_p}}{1+e^{B\_0 + B\_1X\_1 + ... + B\_pX\_p}}

Keep in mind that we are still restricting each XiX\_i to be indicator variables.

It's possible that we now find confounding between different variables in the multiple logistic regression setting.

Linear Discriminant Analysis

Important assumptions:

Note that the density function in one dimension is

fk(x)=12πσe12σ2(xμk)2f\_k(x) = \\frac{1}{\\sqrt{2\\pi}\\sigma}e^{\\frac{-1}{2\\sigma^2}(x-\\mu\_k)^2}

We let πk\\pi\_k be the proportion of training data points that lie within class kk. In other words, πk\\pi\_k is the probability that a randomly selected observation lies within the class kk.

Remember that Baye's theorem states that P(AB)=P(BA)P(A)P(B)P(A|B) = \\frac{P(B|A)P(A)}{P(B)}. From this, we see that

P(Y=kX=x)=P(X=xY=k)P(Y=k)i=1KP(X=xY=i)P(Y=i)=πkfk(x)i=1Kπifi(x)P(Y=k|X=x) = \\frac{P(X=x|Y=k)P(Y=k)}{\\sum\_{i=1}^{K}P(X=x|Y=i)P(Y=i)} = \\frac{\\pi\_k f\_k(x)}{\\sum\_{i=1}^{K}\\pi\_i f\_i(x)}

Note that if π1f1(x)>π2f2(x)\\pi\_1f\_1(x) > \\pi\_2f\_2(x) then log(π1f1(x))>log(π2f2(x))log(\\pi\_1f\_1(x)) > log(\\pi\_2f\_2(x))

Thus we can use the following linear equation to classify observations based on whether or not δ1(x)>δ2(x)\\delta\_1(x) > \\delta\_2(x) for some observation xx and two discriminant functions δ1\\delta\_1 and δ2\\delta\_2.

δk(x)=log(πkfk(x))=xμkσ2μk22σ2+log(πk)\\delta\_k(x) = log(\\pi\_k f\_k(x)) = x\\frac{\\mu\_k}{\\sigma^2} - \\frac{\\mu\_k^2}{2\\sigma^2} + log(\\pi\_k)

However, in order to find estimates for the discriminant functions, we need to find π^k\\hat \\pi\_k, μ^k\\hat \\mu\_k, σ^\\hat \\sigma. We can use

We let the decision boundary be the xx value for which δ1(x)=δ2(x)\\delta\_1(x) = \\delta\_2(x). Thus we will classify xx in class 1 for values of xx for which δ1(x)>δ2(x)\\delta\_1(x) > \\delta\_2(x). In general, we would like the LDA decision boundary be as close to the Bayes Decision boundary as possible, since the Bayes Decision boundary is the theoretically optimal decision boundary.

Linear Discriminant Analysis for p>1p>1

Assume that X=(X1,X2,,Xp)X = (X\_1,X\_2, \\ldots , X\_p) is drawn from a multivariate Gaussian distribution with a class-specific mean vector and common covariance matrix. We say that XN(μ,Σ)X \\sim N(\\mu, \\Sigma), which means that XX has a multivariate Gaussian distribution with a mean vector μ\\mu and a covariance matrix Σ\\Sigma. Here's a few quick notes

Multivariate Gaussian density:

f(x)=1(2π)p/2Σ1/2e12(xμ)TΣ1(xμ)f(x) = \\frac{1}{(2\\pi)^{p/2}|\\Sigma|^{1/2}}e^{-\\frac{1}{2}(x-\\mu)^{T}\\Sigma^{-1}(x-\\mu)}

LDA for p>1p>1 assumes that observations from each class kk are drawn from N(μk,Σ)N(\\mu\_k,\\Sigma). μk\\mu\_k is a class-specific mean vector. Σ\\Sigma is a covariance matrix common to all KK classes (assumption).

LDA Classifier:

δk(x)=xTΣ1μk12μkTΣ1μk+logπk\\delta\_k(x) = x^T\\Sigma^{-1}\\mu\_k-\\frac{1}{2}\\mu\_k^T\\Sigma^{-1}\\mu\_k+log\\pi\_k

Notice that this is linear since the only term with any xix\_i is xTx^T. This is a key observation when examining the differences between LDA and QDA.

Note: In LDA where p>1p>1, we have (p2)\\binom{p}{2} Bayes decision boundaries.

Interpereting confusion matrices in context of LDA performance on credit data set

The credit data set contains information about individuals including whether or not those individuals defaulted on their loans

In a confusion matrix, we can see the true and false positives and negatives that result from a model. The general form of a confusion matrix is:

Predicted Class No Yes Total
Predicted No True Neg. False Pos. N
Default Status Yes False Neg. True Pos. P
Total N* P*

For the default status in the credit card data set that results from running LDA, we obtain the following confusion matrix:

True Default status No Yes Total
Predicted No 9,644 252 9896
Default Status Yes 23 81 104
Total 9,667 333 10,000

We see that only 3.33% of individuals in the training sample defaulted. This means that even a null classifier that predicts every individual never defaults will have an error rate of 3.33%. LDA on this training set gives us an error rate of 2.75%, which is very close to the error rate of the null classifier. We can also see that of the 333 individuals that defaulted, LDA missed 252 of them (that's 75.7%!).

An important note to make here is that vanilla LDA will always minimize the total error rate by approximating the Bayes error rate (which stems from trying to minimize the total number of misclassifications). In this case, that's not the best idea. We might actually care more about catching defaulters than incorrectly labelling people who didn't default as defaulters.

We can alleviate this issue by changing the threshold for classifications. For LDA, if P(Y=kX=x)>0.5P(Y=k|X=x) > 0.5 for a two-class setting, then we classify the observation xx into class kk. We can reduce this threshold to assign more observations to class kk, though some of these observations might have probabilities slightly lower than 0.50.5. It's important to note that the Bayes classifier uses 0.50.5 as its threshold since this threshold minimizes the total error rate in a two-class setting. Specifically in the context of predicting defaulters on credit card loans, we can change the condition satisfied for predicting default = yes from P(defaultX=x)>0.5P(default|X=x) > 0.5 to P(defaultX=x)>0.2P(default|X=x) > 0.2.

After reducing the threshold away from the threshold used by the Bayes classifier, our total error rate has to increase. However, the error rate we cared about (misclassifications of defaulters as non-defaulters) should have decreased. This is exactly what happened. LDA now misclassifies only 41% of defaulters as non-defaulters but the overall error rate increases to 3.73% (since some non-defaulters now get classified as defaulters).

ROC Curve

Check out this ROC Curve. It has False Positive Rate on the x-axis and True Positive Rate on the y-axis. The ROC Curve is generated with respect to the performance of a single model tested at various thresholds.

Some key observations about the curve. Let there be class kk for whatever class is positive.

Quadratic Discriminant Analysis

QDA is pretty much the same thing as LDA but we relax a critical assumption made during LDA. We now allow individual classes to have their own covariance matrices. In other words, instead of Σ=Σi=Σj\\Sigma = \\Sigma\_i = \\Sigma\_j for all classes ii and jj, we now let each class retain its own covariance matrix.

The discriminant function (which is really long to type - just look at section 4.4.4 for the equation) now has quadratic and cross terms. Why? When we apply the log transformation to the Gaussian function that models our predictors, we end up with the term 12xTΣk1x-\\frac{1}{2}x^T\\Sigma\_k^{-1}x in the resulting discriminant function. This key observation is what makes our decision boundaries quadratic, hend the Q in QDA.

A few important things about QDA:

Naive Bayes

This is a model that works surprisingly well when pp is large.

Key assumption: All predictors are independent and uncorrellated with eachother. In other words, all Σk\\Sigma\_k are diagonal matrices. The implication of this is that Naive Bayes assumes that all predictors are uncorrellated with eachother. This might seem crazy to do; here's an intuitive reason for why it works well for large p:

Since we've assumed that the predictors are uncorrellated, we know that Xi=xiY=kX\_i = x\_i | Y=k is independent of Xj=xjY=kX\_j=x\_j| Y=k where XiX\_i and XjX\_j are individual predictors in a predictor vector XX and xix\_i and xjx\_j are realizations of their respective predictors. Remember that if events A and B are independent, then P(AB)=P(A)×P(B)P(A\\cap B) = P(A) \\times P(B). Thus, P(Y=kX=x)j=1pfkj(xj)P(Y=k|X=x) \\propto \\prod\_{j=1}^{p}f\_{kj}(x\_j).

Finally, we find that the discriminant function follows the form:

δk(x)log[πkj=1pfkj(xj)]\\delta\_k(x)\\propto log\\left[\\pi\_k\\prod\_{j=1}^p f\_{kj}(x\_j)\\right]

Real Quick: LDA vs Logistic