We generally try to find the probabilities that a given observation

- We can use linear regression to encode variables that have binary encodings (or just 2 levels). We can use linear regression to encode variables that have some inherent ordering or equal relative distance. Beyond these restrictions, however, it's difficult to use linear regression.
- Linear regression produces a line, which can produce negative output or outputs greater than 1 for some
$x$ values. This violates probabilities being on [0,1].

Linear regression models predicted output given regressors. Logistic Regression models probability that the output is within a class given regressors.

Let the odds of an event be defined as

Let

Note that

Notice that

We estimate logistic coefficients by seeking estimates for

We can measure accuracy of coefficient estimates by computing their standard errors, finding their z-statistics, and then finding their associated p-values. Large values for z-statistics indicates evidence against

Now

Keep in mind that we are still restricting each

It's possible that we now find confounding between different variables in the multiple logistic regression setting.

Important assumptions:

- Assume that all observations from a class
$k$ follow a normal distribution. - Shared variance of distribution of observations within each of
$K$ classes.

Note that the density function in one dimension is

We let

Remember that Baye's theorem states that

Note that if

Thus we can use the following linear equation to classify observations based on whether or not

However, in order to find estimates for the discriminant functions, we need to find

$\\hat \\pi\_k = \\frac{n\_k}{n}$ where$n$ is the number of observations and$n\_k$ is the number of observations within a class$k$ .$\\hat \\mu\_k = \\frac{1}{n\_k}\\sum\_{i:y\_i = k}x\_i$ .$\\hat \\sigma^2 = \\frac{1}{n-K}\\sum\_{k=1}^{K}\\sum\_{i:y\_i=k}(x\_i - \\hat \\mu\_k)^2$

We let the decision boundary be the

Assume that

$\\mu = E[X] = (E[X\_1],E[X\_2],\\ldots,E[X\_p])$ $\\Sigma$ - Covariance extends idea of
*variance*to multiple dimensions $\\Sigma$ is a$p \\times p$ matrix where$i,j$ cell is the covariance between predictors$X\_i$ and$X\_j$ . Remember that$Cov(X,Y) = E[(X-E[X])(Y-E[Y])]$ - Each cell along the diagonal of covariance matrix represents the covariance of a predictor
$X\_i$ with itself, which is nothing but the variance of$X\_i$ since$Cov(X,X) = E[(X-E[X])(X-E[X])] = E[(X-E[X])^2] = Var[X]$ - Let
$E[X] = E[Y] = \\mu\_x = \\mu\_y = 0$ by centering the distributions of X and Y. In this case,$Cov(X,Y) = E[(X)(Y)] \\approx \\frac{1}{n}\\sum\_i^n X\_i\*Y\_i$ .

- Covariance extends idea of

Multivariate Gaussian density:

LDA for

LDA Classifier:

Notice that this is linear since the only term with any

*Note:* In LDA where

`credit`

data setThe credit data set contains information about individuals including whether or not those individuals defaulted on their loans

In a *confusion matrix*, we can see the true and false positives and negatives that result from a model. The general form of a confusion matrix is:

Predicted Class |
No | Yes | Total | |
---|---|---|---|---|

Predicted |
No | True Neg. | False Pos. | N |

Default Status |
Yes | False Neg. | True Pos. | P |

Total | N* | P* |

For the default status in the credit card data set that results from running LDA, we obtain the following confusion matrix:

True Default status |
No | Yes | Total | |
---|---|---|---|---|

Predicted |
No | 9,644 | 252 | 9896 |

Default Status |
Yes | 23 | 81 | 104 |

Total | 9,667 | 333 | 10,000 |

We see that only 3.33% of individuals in the training sample defaulted. This means that even a *null* classifier that predicts every individual never defaults will have an error rate of 3.33%. LDA on this training set gives us an error rate of 2.75%, which is *very* close to the error rate of the *null* classifier. We can also see that of the 333 individuals that defaulted, LDA missed 252 of them (that's 75.7%!).

An important note to make here is that vanilla LDA will always minimize the *total* error rate by approximating the Bayes error rate (which stems from trying to minimize the total number of misclassifications). In this case, that's not the best idea. We might actually care more about catching defaulters than incorrectly labelling people who didn't default as defaulters.

We can alleviate this issue by changing the threshold for classifications. For LDA, if `default = yes`

from

After reducing the threshold away from the threshold used by the Bayes classifier, our total error rate has to increase. However, the error rate we cared about (misclassifications of defaulters as non-defaulters) should have decreased. This is exactly what happened. LDA now misclassifies only 41% of defaulters as non-defaulters but the overall error rate increases to 3.73% (since some non-defaulters now get classified as defaulters).

Check out this ROC Curve. It has *False Positive Rate* on the x-axis and *True Positive Rate* on the y-axis. The ROC Curve is generated with respect to the performance of a single model tested at various thresholds.

Some key observations about the curve. Let there be class *positive*.

- At the point (0,0), our threshold is 1. This means that an observations needs to have a probability of 1 to be classified to
$k$ . Since this will never happen, no observation is ever classified to$k$ and thus our True and False positive rates are 0. - At the point (1,1), our threshold is 0. This means that every observation gets classified to
$k$ . Intuitively, this implies that all observations that are not in$k$ are classified to$k$ (which makes the False Positive Rate skyrocket to 1). Similarly, all observations that are truly in$k$ are classified to$k$ (which makes the True Positive Rate also equal to 1). - The optimal classifier performs at (0,1). This means that we have 0 false positive readings and only true positive readings. Thus, ideally, we want this curve to hug the top left corner.
- Since we want the curve to hug the top left corner, models with ROC curves of higher area (under the curve) are considered "good."
- The dashed line in the middle is the line of the model that randomly assigns an observation to
$k$ . - The worst models classify observations in direct opposition to good models. In other words, if a good model classifies to
$k$ , a bad model would deliberately classify the same observation to another class. These models have areas lower than the area under the dashed line.

QDA is pretty much the same thing as LDA but we relax a critical assumption made during LDA. We now allow individual classes to have their own covariance matrices. In other words, instead of

The discriminant function (which is really long to type - just look at section 4.4.4 for the equation) now has quadratic and cross terms. Why? When we apply the log transformation to the Gaussian function that models our predictors, we end up with the term *quadratic*, hend the Q in QDA.

A few important things about QDA:

- Uses class-specific covariance matrices in the discriminant functions
- Has an error rate
$\\leq$ LDA model. This is because LDA does not include the quadratic terms or the cross-terms found in QDA. - QDA results in an elliptical Bayesian decision boundary.
- QDA generally has higher variance than LDA since it is more flexible but has lower bias.

This is a model that works surprisingly well when

Key assumption: All predictors are independent and uncorrellated with eachother. In other words, all *p*:

- Generally, correllation between predictors changes the shapes and orientations of Gaussian surfaces. However, the maximum posterior probability
$P(Y=k|X=x)$ may not be highly dependent on the shape and orientations of the surfaces. We care about the maximum posterior probability since the class for which its associated posterior probability is largest for a given observation is the class we assign the observation to. Anyway, making this assumption allows for much easier computation (since*p*is large) and produces similar results for the maximum posterior probability. Specifically, this assumption allows us to compute$p$ covariances for the diagonal matrix$\\Sigma\_k$ as opposed to$p\\times p$ covariances.

Since we've assumed that the predictors are uncorrellated, we know that

Finally, we find that the discriminant function follows the form:

- The log-odds in both LDA and logistic is linear
$log(\\frac{p\_1}{1-p\_1})= B\_0 + B\_1x\_1 + \\ldots + B\_px\_p$ - In logistic regression, estimate the parameters by maximizing the likelihood function
- In LDA, we estimate the parameters by using estimates for
$\\hat{\\pi\_k}$ ,$\\hat{\\sigma\_k}$ , and$\\hat{\\mu\_k}$

- For an observation
$x$ that truly belongs to calss$k$ , both models strive to make the probability that$x$ belongs to$k$ the maximum. In other words, both models strive to correctly maximize the posterior probabilities. - Logistic regression assumes that
$P(Y=k|X=x)$ is sigmoidal. LDA assumes that$P(X=x|Y=k)$ is normal.