Gaussian Discriminant Analysis and Logistic Regression

There are many ways to classify machine learning algorithms: supervised/unsupervised, regression/classification,… . For myself, I prefer to distinguish between Discriminative model and Generative model. In this article, I will discuss the relationship between these 2 families, using Gaussian Discriminant Analysis and Logistic Regression as example.

Quick review: Discriminative methods model p(y \mid x) . In classification task, these models search a hyperplane (a decision boundary) seperating different classes. The majority of popular algorithms  belongs to this family: logistic regression, svm, neural net, … On the other hand, Generative methods model p(x \mid y) (and p(y) ). This means that it will give us a probability distribution for each class in the classification problem. This give us an idea of how the data is generated. This type of model relies heavily on Bayes rule to update the prior and derive the posterior. Some well-known examples are Naive Bayes, Gaussian Discriminant Analysis, …

discriminative_vs_generative
Discriminative vs Generative. Source: evolvingai.org.

There are quite some reasons why discriminative models are more popular among machine learning practitioner: they are more flexible, more robust and less sensitive to incorrect modeling assumptions. Generative models, on the other hand, require us to define the distribution of our prior, which can be quite challenging for many situations. However, this is also their advantage: they have more “information” about the data than discriminative model, and thus can perform quite well with limited data if the assumption is correct.

In this article, I will demonstrate the point above by proving that Gaussian Discriminant Analysis (GDA) will eventually lead to Logistic Regression, and thus Logistic Regression is more “general”.

For binary classification, GDA assumes that the prior follows a Bernoulli distribution and the likelihood follows a multivariate Gaussian distribution:

y \sim Bernoulli(\phi)

x \mid y=0 \sim N(\mu_0, \sum)

x \mid y=1 \sim N(\mu_1, \sum)

Let’s write down their mathematical formula:

p(y) = \phi^{y} \times (1-\phi)^{1-y}

p(x \mid y=0) = \frac{1}{(2\pi)^{n/2} |\sum|^{1/2} }  \times exp(- \frac{(x - \mu_0)^T(x - \mu_0)}{2 \sum })

p(x \mid y=1) = \frac{1}{(2\pi)^{n/2} |\sum|^{1/2} }  \times exp(- \frac{(x - \mu_1)^T(x - \mu_1)}{2 \sum })

As mentioned above, the discriminative model (here is the logistic regression) try to find p(y \mid x) , so what we try to prove is that:

p(y=1 \mid x) = \frac{1}{1 + exp(-\theta^T x)}

which is the sigmoid function of logistic regression, where \theta is some function of of \phi, \mu_0, \mu_1 and \sum.

Ok let’s roll up our sleeves and do some maths:

p(y=1 \mid x)

=\frac{p(x \mid y=1) \times p(y=1)}{p(x)}

= \frac{p(x \mid y=1) \times p(y=1)}{p(x \mid y=1) \times p(y=1) +p(x \mid y=0) \times p(y=0)}

= \frac{1}{1 + \frac{p(x \mid y=0) \times p(y=0)}{p(x \mid y=1) \times p(y=1)}}

This equation seems very much like what we are look for, let’s take a closer look at the fraction \frac{p(x \mid y=0) \times p(y=0)}{p(x \mid y=1) \times p(y=1)} :

\frac{p(x \mid y=0) \times p(y=0)}{p(x \mid y=1) \times p(y=1)} 

= exp(-\frac{(x - \mu_0)^2}{2 \sum} + \frac{(x-\mu_1)^2}{2 \sum} ) \times \frac{1 - \phi}{\phi}

= exp(\frac{(x-\mu_1)^2 - (x-\mu_0)^2}{2 \sum}) \times exp(\log(\frac{1-\phi}{\phi}))

= exp(\frac{(\mu_0 - \mu_1)(2x-\mu_0-\mu_1)}{2\sum}) \times exp(\log(\frac{1-\phi}{\phi}))

= exp(\frac{2(\mu_0 - \mu_1)x - (\mu_0 - \mu_1)(\mu_0 + \mu_1)}{2 \sum} +\log(\frac{1-\phi}{\phi}))

= exp(\log(\frac{1-\phi}{\phi}) - \frac{\mu_0^2 + \mu_1^2}{2\sum} + \frac{\mu_0 - \mu_1}{\sum} \times x )

= exp[ (\log(\frac{1-\phi}{\phi}) - \frac{\mu_0^2 + \mu_1^2}{2\sum}) \times x_0 + \frac{\mu_0 - \mu_1}{\sum} \times x ]

In the last equation, we add x_0 = 1 so that we can have the desired form \theta^Tx. The former equation is then:

\frac{1}{1 + \frac{p(x \mid y=0) \times p(y=0)}{p(x \mid y=1) \times p(y=1)}} = \frac{1}{1 +exp[ (\log(\frac{1-\phi}{\phi}) - \frac{\mu_0^2 + \mu_1^2}{2\sum}) \times x_0 + \frac{\mu_0 - \mu_1}{\sum} \times x ] }

And there it is, we just proved that the result of a Gaussian Discriminant Analysis is indeed a Logistic Regression, our vector \theta is:

\theta = \begin{bmatrix}\log(\frac{1-\phi}{\phi}) - \frac{\mu_0^2 + \mu_1^2}{2\sum} \\  \frac{\mu_0 - \mu_1}{\sum} \end{bmatrix}

The converse, is not true though:  p(y \mid x) being a logistic function  does not imply p(x \mid y) is mutlivariate gaussian. This observation shows us that GDA has a much stronger assumption than Logistic Regression. In fact, we can go 1 step further and prove that if p(x \mid y) belongs to any member of the Exponential Family (Gaussian, Poisson, …), its posterior is a logistc regression. We see now a reason why Logistic Regression is widely used: it is a very general, robust algorithm that works for many underlying assumptions. GDA (and Generative models in general), in the other hand, makes much stronger assumption, and thus is not ideal for non-Gaussian or some-crazy-undefined-distribution data 

(The problem of GDA, or generative model, can be solved with a class of Bayesian Machine Learning that uses Markov Chain Monte Carlo to sample data from their posterior distribution. This is a very exciting method that I’m really into, so I will save it for a future post.)

Advertisements

[Fr] Programmation Linéaire et Machine Learning

La Recherche Opérationnelle est une domaine qui m’intéresse beaucoup et qui est très proche de celle de la Machine Learning et l’Optimisation. L’idée principale est de trouver le maximum/minimum d’une fonction objectif en prenant compte des contraintes.

Un des algorithmes le plus basique et les plus utilisé en Recherche Opérationnelle est la méthode de Simplexe. Pour plus de détails de ce qui concerne cette méthode, vous pouvez trouver facilement des articles sur Google. Pour ceux qui veulent creuser et comprendre cette belle domaine , je vous encourage à commencer avec ce livre: Operations Research: Applications and Algorithms.

Dans mon article ci-dessous, je discute un cas spécifique qui montre le lien entre la Recherche Opérationnelle et le Machine Learning: la diagnostique du cancer du sein.

Le cancer du sein est une tumeur maligne de la glande mammaire. Son premier indice est une grosseur au sein. Pourtant, la majorité des grosseurs sont bénignes, alors il faut avoir des techniques pour faire la distinction entre les bénignes et les malignes.

Une des techniques les plus utilisées est la biopsie d’aspiration à l’aiguille fine (BAAF). Dans le temps, cette technique avait une degré de variablité assez élevée (de 68% à 98%).

Pour améliorer les résultats de BAAF, les chercheurs ont proposé une méthode utilisant le Machine Learning et la Programmation Linéaire pour classifier automatiquement un nouveau sujet dans un des deux classes (bénignes ou malignes). Chaque sujet qui est déja diagnotiqué (malin ou pas) est représenté sous forme un vecteur de différents “features”. En basant sur ces sujets (ou on les appelle encore “training set” en machine learning), on peut trouver un hyperplan qui va permettre à distinguer entre les deux classes.

Voilà, enjoy : )

Application de la Programmation Linéaire pour la Classification des motifs en Machine Learning