Statistical Learning Notes | LDA 1 - Linear Discriminant Analysis
ISLR4.4 - LDA
0. LDA vs Logistic Regression
- When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable
- LDA is more stable than Logistic Regression
- If n is small and distribution X ~ Normal in each of the classes,
- LDA is more accurate than Logistic Regression
- LDA is more common when we have more than two response classes ($K \geq 2 $ )
- because it also provides low-dimensional views of the data
1. Bayes' Theorem for Classification
Modeling the distribution of X in each of the classes separately,
- and then use Bayes Theorem to flip things around and obtain $Pr(Y|X)$
Using Normal(Gaussian) distributions for each class,
- leads to Linear or Quadratic Discriminant Analysis
Bayes Theorem for LDA:
Let $\pi_k$ represent the prior probability
- that a randomly chosen observation comes from the kth class;
Let $f_k(x)=\Pr(X=x|Y=k)$ denote the density function of X
-
for an observation that comes from the kth class.
-
larger $f_k(x)$ => higher probability that an observation in the kth class has $X\approx x$
-
$\Pr(Y=k\ |\ X=x)=\frac{Pr(Y=k) · Pr(X=x|Y=k)}{Pr(X=x)}$ $$\Downarrow$$ $$p_k(x)=\frac{\pi_kf_k(x)}{\sum_{l=1}^K\pi_lf_l(x)}$$
-
$p_k(x)=\Pr(Y=k\ |\ X=x)$ is the posterior probability that an observation $X=x$ belongs to the kth class
Estimating $\pi_k$ is Easy if we have a random sample from the population:
- simply compute the fraction of the training observations that belong to the kth class
Estimating $f_k(x)$ is Challenging
- have to make some simplifying assumptions
- LDA, QDA and Naive Bayes are three classifiers that use different estimates of $f_k(x)$ to approximate the Bayes classifier
2. LDA for Single Predictor
LDA Classifier results from assuming that $f_k(k)$ has the form of Normal (Gaussian) Density:
$$f_k(x)=\frac{1}{\sqrt{2\pi} \ \sigma_k}\exp{(-\frac{1}{2\sigma_k^2}(x-\mu_k)^2)}$$
- $\mu_k$ is the mean, $\sigma_k^2$ is the variance in class k
- assume that all $\sigma_k^2 = \sigma^2$ are the same
- different class-specific variance => QDA
So the posterior probability is: $$p_k(x)=\frac{\pi_k \frac{1}{\sqrt{2\pi} \sigma}e^{(-\frac{1}{2\sigma^2}(x-\mu_k)^2)}}{\sum_{l=1}^K \pi_l \frac{1}{\sqrt{2\pi} \sigma}e^{(-\frac{1}{2\sigma^2}(x-\mu_l)^2)}}$$
- The Bayes Classifier involves assigning an observation $X = x$ to the class for which $p_k(x)$ is Largest
- Simplify: Taking Logs, and discarding terms that do not depend on k, we simplify the goal to get the Largest Discriminant Score
$$\delta_k(x)=x\cdot\frac{\mu_k}{\sigma^2}-\frac{\mu_k^2}{2\sigma^2}+\ln(\pi_k)$$
Example
If K = 2, and $\pi_1=\pi_2=0.5$, the Bayes decision boundary is at $$x = \frac{\mu_1^2-\mu_2^2}{2(\mu_1-\mu_2)}=\frac{\mu_1+\mu_2}{2}$$
- Let $\mu_1 = -1.25, \mu_2=1.25 \ and \ \sigma=1$, Bayes classifier
- assigns class 1 if x < 0
- assigns class 2 if x > 0
In real-life we don’t know the distribution and parameters, we JUST have the training data
- not able to calculate the Bayes Classifier
- need to estimate the parameters and approximate the optimal Bayes Classifier
Estimating the Parameters
The LDA method approxiates the Bayes Classifier by using the following estimates:
$$\hat{\pi}k=n_k/n$$
$$\hat{\mu}k=\frac{1}{n_k}\sum{i:y_i-k}x_i$$
$$\hat{\sigma}^2=\frac{1}{n-K}\sum^K{k=1}\sum_{i:y_i=k}(x_i-\hat{\mu}_k)^2$$
$$=\sum_{k=1}^K\frac{n_k-1}{n-K}\cdot\hat{\sigma}_k^2$$
where estimated variance in the kth class
$$\hat{\sigma}_k^2=\frac{1}{n_k-1}\sum_{i:y_i=k}(x_i-\hat{\mu}_k)^2$$
- LDA estimates $\pi_k$ using the proportion of the training observations that belong to the kth class
- the estimate for $\mu_k$ is the average of all the training data from the kth class
- $\hat{\sigma}^2$ can be seen as a weighted average of the sample variances for each of the K classes.
Discriminant Functions
LDA classifiers plugs the estimates given above into the discriminant functions $\delta_k(x)$ $$\hat{\delta}_k(x)=x\cdot\frac{\hat{\mu}_k}{\hat{\sigma}^2}-\frac{\hat{\mu}_k^2}{2\hat{\sigma}^2}+\ln(\hat{\pi}_k)$$
- and _assigns an observation X=x to the class k for which $\hat{\delta}k(x)$ is largest
- $\hat{\delta}_k(x)$ are linear functions of x
Evaluation
3. LDA for Multiple Predictors
Assume that $X=(X_1,X_2,\cdots,X_p) $ in the kth class is drawn from a MultiVariate Gaussian (normal) Distribution, $X \sim N(\mu_k,\Sigma)$:
$$f_k(x)=\frac{1}{(2\pi)^{p/2}|\Sigma|^{1/2}}e^{(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu))}$$
- $\mu_k=E(X)$ is a class-specific mean vector of X
- $\Sigma = Cov(X)$ is the $p \times p$ covariance matrix that is common to ALL K classes
MultiVariate Gaussian Distribution assumes that each individual predictor follows a 1-dimensional normal distribution
- with some correlation between each pair of predictors
Plugging the density function for the kth class, $f_k(X=x)$ into Bayes Theorem and performing algebra reveals that
- Bayes Classifier assigns an observation X=x to the class for which the Discriminant function is largest: $$\delta_k(x)=x^T\Sigma^{-1}\mu_k-\frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k+\log\pi_k$$
- by Estimating unknown Parameters in the same way in 1-dim case
Next:
LDA on Credit Dataset, ROC, AUC