# On seeking truth and what we can learn from science

In the era where fake news is the new news,  everybody seems to be constantly seeking for (their version of)  “truth”, or “fact”. It is quite ironic that while we are living in a world where information is more accessible than ever, we are less “informed” than ever. However, today, let’s take a break from all the chaos in real life and see if we can learn something about “truth” from the scientific world.

3 years ago, I spent my summer interning at a laboratory on social psychology, where I was exposed, for the very first time, to statistics and the scientific methodology. That was also when my naive image of “science” shattered. It was not noble as I thought, but, as real life, messy and complicated. My colleagues showed me the dark corners of psychology research, where people are willing to do whatever it takes to have a perfect p-value and get their paper published (for example, the scandal of Harvard professor of psychology Marc Hauser). If you are somebody working in the science world, you must not be so surprised: social science, and psychology in particular, is plagued with misconduct and fraud. Reproducibility is a huge problem, as proved by the reproducibility project, where 100 psychological findings were subjected to replication attempts. The results of this project were less than a ringing endorsement of research in the field: of the expected 89 replications, only 37 were obtained and the average size of the effects fell dramatically. At the end of the internship, I wrote an article with the title “Is psychology a science ?”, where I stated in my conclusion “Pyschology remains a young field in search for a solid theoretical base, but that can not justify for the lack of rigor in the research method. Science is about truth and not about being able to publish.”

Is it ?

This seemingly evident statement came back to hunt me 3 years later, when I came across this quote by Neil deGrasse Tyson, a scientist that I respect a lot:

The good thing about science is that it’s true whether or not you believe in it.

That was his answer for the question “What do you think about the people who don’t believe in evolution ?”.

If we put it in the context, where he already commented on the scientific method, this phrase becomes less troubling. However, I still find it very extreme and even misleading for layman people, just like my statement 3 years ago.

For me, science is about skepticism. It is more about not being wrong than being true.  A scientific theory never aspires to be “final “, it will always be subject to additional testing.

To better understand this statement, we need to go back to the root of the scientific method: statistics.

Let’s take an example in clinical trial: supposing that you want to test a new drug. You find a group of patients, give half of them the new drug and the rest a placebo. You measure the effect in each group and compare if the difference between those 2 groups is significant or not by using a hypothesis test.

This is where the famously misunderstood p-value comes into play.  It is the probability, under the assumption that there is no true effect or no difference,  of collecting data that shows a difference equal to or greater than what we observed. For many people (including researchers), this definition is very counter-intuitive because it does not do what they are expecting: p-value is not a measure about the effect size, it does not tell you how right you are or how big is the difference, it just shows you a level of skepticism. A small p-value simply states that we are quite surprised with the difference between 2 groups, given that we are not expecting it. It is only a probability, so if someone try a lot of hypothesis on the data, eventually they will get something  significant (this is what known as the problem of multiple comparisons, or  p-hacking).

The whole world of research is driven by this metric. For many journals, a p-value less than 5% is the first criteria for a paper to be reviewed. However, things are more complicated than that. As I mentioned earlier, p-value is about statistical significance, not practical significance. If a researcher collects enough data, he will eventually be able to lower the p-value and “discover” something, even if the scope of it is extremely tiny that it doesn’t make any impact in real life.  This is where we need to discuss about the effect size and more importantly, the power of a hypothesis test. The former, as it names suggests, is the size of the difference that we are measuring. The latter is the probability that a hypothesis test will yield a statistically significant outcome. It depends on the effect size and the sample size. If we have a large sample size and want to measure a reasonable effect size, the power of the test will be high and vice versa, if we don’t have enough data but aim for a small effect size, the power will be low, which is quite logical: we can’t detect a subtle difference if we don’t have enough data. We can’t just throw a coin 10 time and said that because there are 6 heads, the coin must be biased.

In fields where experiments are costly (social science, pharmaceutical,…), the small sample size led to a big problem of truth inflation (or type M error). This is when the hypothesis test has a weak power and thus can’t detect any reliable difference.

In the curve above, we see that our data needs to have an effect size 9 times greater than the actual effect size to be statistically significant.

The truth inflation problem turns out to be quite “convenient” for researchers: they get a significant result with a huge effect size! This is also what the journals are looking for: “groundbreaking” results (large effect size result in some research field with little prior research). And it is not rare.

All these discussions is to show you that scientific methodology is not definite. It is based on statistics, and statistics is all about uncertainty, and sometimes it gets very tricky to do it the right way. But it needs to be done right. Some days it is hard, some days it is nearly impossible, but that’s the way science works.

To conclude, I think that science is not solely about truth, but about evaluating observations. This is where we can go back to the real world: in this era where we are drowning in data, we also need to have a rigorous approach to process them: cross-check the information from multiple sources, be as skeptical as possible to avoid selection bias, try not to be wrong and most importantly, be honest to one self, because at the end of the day, truth is a subjective term.

# Gaussian Discriminant Analysis and Logistic Regression

There are many ways to classify machine learning algorithms: supervised/unsupervised, regression/classification,… . For myself, I prefer to distinguish between Discriminative model and Generative model. In this article, I will discuss the relationship between these 2 families, using Gaussian Discriminant Analysis and Logistic Regression as example.

Quick review: Discriminative methods model $p(y \mid x)$. In classification task, these models search a hyperplane (a decision boundary) seperating different classes. The majority of popular algorithms  belongs to this family: logistic regression, svm, neural net, … On the other hand, Generative methods model $p(x \mid y)$ (and $p(y)$). This means that it will give us a probability distribution for each class in the classification problem. This give us an idea of how the data is generated. This type of model relies heavily on Bayes rule to update the prior and derive the posterior. Some well-known examples are Naive Bayes, Gaussian Discriminant Analysis, …

There are quite some reasons why discriminative models are more popular among machine learning practitioner: they are more flexible, more robust and less sensitive to incorrect modeling assumptions. Generative models, on the other hand, require us to define the distribution of our prior, which can be quite challenging for many situations. However, this is also their advantage: they have more “information” about the data than discriminative model, and thus can perform quite well with limited data if the assumption is correct.

In this article, I will demonstrate the point above by proving that Gaussian Discriminant Analysis (GDA) will eventually lead to Logistic Regression, and thus Logistic Regression is more “general”.

For binary classification, GDA assumes that the prior follows a Bernoulli distribution and the likelihood follows a multivariate Gaussian distribution:

$y \sim Bernoulli(\phi)$

$x \mid y=0 \sim N(\mu_0, \sum)$

$x \mid y=1 \sim N(\mu_1, \sum)$

Let’s write down their mathematical formula:

$p(y) = \phi^{y} \times (1-\phi)^{1-y}$

$p(x \mid y=0) = \frac{1}{(2\pi)^{n/2} |\sum|^{1/2} } \times exp(- \frac{(x - \mu_0)^T(x - \mu_0)}{2 \sum })$

$p(x \mid y=1) = \frac{1}{(2\pi)^{n/2} |\sum|^{1/2} } \times exp(- \frac{(x - \mu_1)^T(x - \mu_1)}{2 \sum })$

As mentioned above, the discriminative model (here is the logistic regression) try to find $p(y \mid x)$, so what we try to prove is that:

$p(y=1 \mid x) = \frac{1}{1 + exp(-\theta^T x)}$

which is the sigmoid function of logistic regression, where $\theta$ is some function of of $\phi, \mu_0, \mu_1 and \sum$.

Ok let’s roll up our sleeves and do some maths:

$p(y=1 \mid x)$

$=\frac{p(x \mid y=1) \times p(y=1)}{p(x)}$

$= \frac{p(x \mid y=1) \times p(y=1)}{p(x \mid y=1) \times p(y=1) +p(x \mid y=0) \times p(y=0)}$

$= \frac{1}{1 + \frac{p(x \mid y=0) \times p(y=0)}{p(x \mid y=1) \times p(y=1)}}$

This equation seems very much like what we are look for, let’s take a closer look at the fraction $\frac{p(x \mid y=0) \times p(y=0)}{p(x \mid y=1) \times p(y=1)}$:

$\frac{p(x \mid y=0) \times p(y=0)}{p(x \mid y=1) \times p(y=1)}$

$= exp(-\frac{(x - \mu_0)^2}{2 \sum} + \frac{(x-\mu_1)^2}{2 \sum} ) \times \frac{1 - \phi}{\phi}$

$= exp(\frac{(x-\mu_1)^2 - (x-\mu_0)^2}{2 \sum}) \times exp(\log(\frac{1-\phi}{\phi}))$

$= exp(\frac{(\mu_0 - \mu_1)(2x-\mu_0-\mu_1)}{2\sum}) \times exp(\log(\frac{1-\phi}{\phi}))$

$= exp(\frac{2(\mu_0 - \mu_1)x - (\mu_0 - \mu_1)(\mu_0 + \mu_1)}{2 \sum} +\log(\frac{1-\phi}{\phi}))$

$= exp(\log(\frac{1-\phi}{\phi}) - \frac{\mu_0^2 + \mu_1^2}{2\sum} + \frac{\mu_0 - \mu_1}{\sum} \times x )$

$= exp[ (\log(\frac{1-\phi}{\phi}) - \frac{\mu_0^2 + \mu_1^2}{2\sum}) \times x_0 + \frac{\mu_0 - \mu_1}{\sum} \times x ]$

In the last equation, we add $x_0 = 1$ so that we can have the desired form $\theta^Tx$. The former equation is then:

$\frac{1}{1 + \frac{p(x \mid y=0) \times p(y=0)}{p(x \mid y=1) \times p(y=1)}} = \frac{1}{1 +exp[ (\log(\frac{1-\phi}{\phi}) - \frac{\mu_0^2 + \mu_1^2}{2\sum}) \times x_0 + \frac{\mu_0 - \mu_1}{\sum} \times x ] }$

And there it is, we just proved that the result of a Gaussian Discriminant Analysis is indeed a Logistic Regression, our vector $\theta$ is:

$\theta = \begin{bmatrix}\log(\frac{1-\phi}{\phi}) - \frac{\mu_0^2 + \mu_1^2}{2\sum} \\ \frac{\mu_0 - \mu_1}{\sum} \end{bmatrix}$

The converse, is not true though:  $p(y \mid x)$ being a logistic function  does not imply $p(x \mid y)$ is mutlivariate gaussian. This observation shows us that GDA has a much stronger assumption than Logistic Regression. In fact, we can go 1 step further and prove that if $p(x \mid y)$ belongs to any member of the Exponential Family (Gaussian, Poisson, …), its posterior is a logistc regression. We see now a reason why Logistic Regression is widely used: it is a very general, robust algorithm that works for many underlying assumptions. GDA (and Generative models in general), in the other hand, makes much stronger assumption, and thus is not ideal for non-Gaussian or some-crazy-undefined-distribution data

(The problem of GDA, or generative model, can be solved with a class of Bayesian Machine Learning that uses Markov Chain Monte Carlo to sample data from their posterior distribution. This is a very exciting method that I’m really into, so I will save it for a future post.)

# Stein’s paradox or the power of shrinkage

Today we will talk about a very interesting paradox: the Stein’s paradox. The first time I heard about it, my mind was completely blown. So here we go:

Let’s play a small game between us: supposing that there is an arbitrary distribution that we have no information about, except that it is symmetric. Now we are given a sample of that distribution. The rule is simple, for each round,  each one of us will, based on the given sample, guess what the distribution’s mean is, and whoever has the estimated point closer to the true mean get 1 point. The game has many rounds, and who wins more rounds will be the final winner.

The first time I heard about the game, I have no idea what is going on. The rule is dead simple, and it seems completely random, there is no information whatsoever to find the true mean. The only viable choice is to take the sample as our guess.

However, it turns out that there is a better strategy to win the game in a long run. And I warn you, it will sound totally ridiculous.

Ok you’re ready ? The strategy is to take an arbitrary point, yes, any point that you like, and “pull” the the sample value toward it. The new value will be our new guess.

So if you look at the image below, you can see that by pulling the given point, our new guess is closer to the mean, and thus I win!

But..but…you will tell me that had I chosen the arbitrary point on the right of the given point, I would have lost! That’s totally correct ahah!

However, let’s take a step back and look at the big picture: given the position of the arbitrary point, my strategy will beat the naive approach if the given sample is on the right of the true mean (yellow zone in the image bellow).

But that is still not enough to win the game in the long run you say ? Brace yourself, here comes the magical part: I will win too if the sample point is on the left of the arbitrary point, because in that situation, the sample will be pulled toward its right, and is thus closer to the true mean. So in long run, with my ridiculous strategy, I will win more time than you!

This paradox shows us the power of shrinkage: even if we shrink our “estimation” with an arbitrary, completely random value, we will still have a better estimation in the long run. That’s why shrinkage method is widely used in machine learning. It is just that magical!

# [Fr] Système de recommandation

While cleaning my hard drive, I stumbled upon my very first Data Science project. I implemented a recommander system for a website using Collaborative Filtering. It was fascinating and frustrating at the same time as I was struggling for a several weeks, trying to wrap my head around the techniques like Stochastic Gradient Descent and Alternative Least Square. Looking back, I have come quite a long way in 1 year. However, many exciting things are still waiting for me ahead, and I can’t wait to face my next challenge!

Recommender system: article. Enjoy!

# [Fr] Système de reconnaissance d’activité avec des données de capteurs

Lien Github: Human activity recognition

## 1. Introduction:

Le but de notre projet est de créer un système de recommandation intelligente pour les habitants dans une maison. Le système analyse et inter- prète les données des capteurs placés dans la maison et les combine avec des informations supplémentaires à l’extérieur (la météo,…) afin de proposer à l’utilisateur des suggestions pour améliorer leur vie quotidienne.

Le système comprend 2 parties :

1. “Rule-based system”:

C’est un système de recommandation “classique” qui se base sur des règles prédéfinies du type “Si…, alors…,sinon…”.
Exemple : si la température à l’extérieur est supérieure de 25oC et il ne pleut pas et les fênetres sont fermées, alors le système demande à l’utilisateur s’il veux les ouvrir.

Comme on peut constater facilement, ce système peut vite devenir très lourd à implémenter d’autant plus que c’est impossible de lister tous les scénarios dans la vie quotidienne. Pourtant, on a toujours l’intérêt à l’incorporer pour certaines situations spécifiques où des protocoles ou des processus sont fixés (l’incendie,…).

2. “Machine learning system”:

Le coeur du programme. On utilise les algorithmes d’apprentissage automatique pour comprendre et apprendre les comportements de l’utilisateur et pour lui aider dans sa vie quotidienne. 2 types de messages seront envoyés par le système :

—  “Prompting system”
Visé surtout pour les personnes âgées ou les personnes ayant des problèmes avec la mémoire, le système apprend leurs activités et leurs habitudes quotidiennes et leur envoie un message (sms,…) pour rappeler au cas où elles ont oublié à faire un tâche important. Les plus grandes contraintes techniques ici sont : la reconnais- sance des activités et la classification des tâches dans 2 catégories “important” et “pas important”. La deuxième contrainte est impor- tante parce que si le système envoie des messages à l’utilisateur chaque 5 minutes, il va poser une surcharge cognitive sur l’utili- sateur.

—  “Recommendation system”
En analysant les données dans la maison et en combinant avec des informations extérieure, le système peut donner à l’utilisateur des conseils pour améliorer leur vie. Il peut aussi détecter des anomalies dans la maison (vol, incendie, crise cardiaque,…).

Pour réaliser un tel système de recommandation, l’étape cruciale est de développer un système de reconnaissance des activités en basant sur les don- nées des capteurs. Dans la partie suivante, on va discuter les 3 approches que l’on a utilisé pour faire la reconnaissance : le modèle de Markov caché, l’algorithme k plus proche voisin et le réseau de neurones artificiels.

## 2. Reconnaissance des activités:

### 2.1 Jeu de données:

Notre projet est inspiré par l’article “Accurate Activity Recognition in a Home Setting” de Tim van Kasteren, Athanasios Noulas, Gwenn Englebienne et Ben Krose. On a utilisé leurs jeux de données pour tester les algorithmes.

Il comprend 14 capteurs qui collectent des données de 8 activités pendant 28 jours, ce qui donne 2120 évènements de capteur et 245 évènements d’activité correspondants. Chaque évènement de capteur a 4 champs : Date de début, Date de fin, ID de l’activité et une variable qui indique l’activation du capteur.

Pour éviter tous les bias et les faux positifs, on applique la méthode de cross-validation pour tester et valider les modèles. Le jeu de données a été partagés en 3 parties : training set, validation set, testing set. On utilise le training set pour l’apprentissage du modèle, après on utilise la validation set pour calibrer et régler les paramètres du modèle. Enfin, on teste le modèle sur le testing set pour trouver le taux d’erreur.

### 2.2 Chaine de Markov caché:

Modéliser les activités par une chaine de Markov caché est la méthode la plus classique. Les modèles de Markov caché sont beaucoup utilisés pour re- présenter des séquences d’observations (dans notre cas ce sont les données des capteurs). Grâce à ces modèles on peut construire une séquence de l’activité qui correspond le mieux aux données trouvés.

Il existe 2 étapes principales :

—  Chercher les paramètres du modèle de Markov caché en calculant la fréquence des observations, des transitions,…

—  Utiliser les algorithmes de Viterbi pour trouver la séquence d’activités la plus probable.

On a trouvé un taux d’erreur de 30%, ce qui n’est pas trop loin les résul- tats dans l’article (21%). Pourtant, le problème avec cet approche est que le modèle n’est pas stable. En obervant les figures ci-dessus, on peut remarque que les résultats varient beaucoup entre les différents jours. En plus, la ma- trice de confusion montre un faux positif significant liant à l’état 0 (nothing). Le problème vient du jeu de données, ou plus concrètement la façon dont on collecte les données : il faut rappeler que les capteurs étaient activés 24/24 pendant 28 jours, et donc la majorité du temps ils étaient à l’état “idle” – pas d’activité. Par conséquent, dans notre jeu de données, on a une très grande quantité de l’état 0 (nothing), ce qui rend les calculs biasés vers cet état.

### 2.3 k-Nearest Neighbors:

Le kNN est l’algorithme le plus simple pour faire la classification. On a utilisé le kNN avec la distance Mahalanobis et on a trouvé un taux d’erreur de 20%. Pourtant, l’amélioration la plus importante par rapport au modèle de Markov caché est la répartition des erreurs. Il existe toujours un bias vers l’état 0 mais ici la situation est déja beaucoup moins grave, avec seulement 2 activités qui sont souvent mal classifiés (go to bed et prepare dinner).

### 2.4 Artificial Neural Network:

Aujourd’hui, les réseaux de neurones artificiels apparaissent partout dans le monde de Machine Learning. On a tenté de le tester dans le cadre de notre projet. On a implémenté un réseau simple de type rétropropagation du gradient avec la fonction sigmoïde et 1 couche caché. Le nombre des neurones dans les 3 couches (entré, caché, sortie) est successivment 14 (le nombre de capteurs), 10 et 8 (le nombre de types d’activité). Les résultats trouvés pour le moment ne sont pas prometteurs. On a déja testé notre algorithme sur un autre jeu de données et on a reçu des bonnes résultats, ce qui montre que notre implémentation n’a pas d’erreurs. On pense que le problèmes se pose sur 2 points :

— Le choix des hyper-paramètres : nombre de neurones, taux d’appren- tissage de l’algorithme du gradient. Comme c’est la première fois que l’on utilise un réseau neurone, on n’est pas sur comment on peux bien régler tous ces paramètres.

— Le jeux de donnée n’est pas assez large pour un réseau de neurone.

### 2.5 Amélioration:

Pour l’instant on a un taux d’erreur de 20-30%, ce qui est acceptable.

Une idée pour améliorer les résultats du modèle de Markov caché est d’utiliser 2 modèles : l’un pour le matin et l’autre pour le soir. Cette sug- gestion vient du fait que notre algorithme dépend beaucoup de l’ordre des activités, ce qui est en même temps le point fort et le point faible de HMM. Les routines du matine et du soir sont très différents, et en les séparant on aura 2 modèles avec moins de bruits et plus de précision.

Pour le kNN, on peut utiliser 2 modèles aussi. L’avantage du modèle de Markov caché est qu’il permet de modéliser l’ordre des activités. On perd ces informations avec le kNN. Pourtant, on peut les incorporer en présentant un deuxième modèle qui va modéliser la probabilité que l’activité X est suivie par l’activité Y.

# [Fr] Speech/Music classification system

L’homme a la capacité de discriminer aisément les signaux de parole des signaux de musique en écoutant un court extrait du signal audio. Cependant, afin de rendre le traitement indépendant de l’intervention humaine, l’analyse de ces différents types de signaux est fondamentale afin d’extraire des caractéristiques permettant de les classifier.

Ce sujet a de nombreuses applications dont la transcription automatique des émissions de radio. Une telle discrimination une fois implémentée peut permettre de changer de fréquence radio lorsque des publicités (flux de parole) sont détectées par exemple ou encore déactiver un système de reconnaissance de parole lorsqu’un signal de musique est détecté.

Dans le cadre de ce projet, nous focaliserons notre attention sur la discrimination des signaux audio en deux classes: parole ou musique.

Ce traitement nécessite l’extraction, l’évaluation et la sélection de descripteurs acoustiques avant d’utiliser une méthode de classification.

Dans un premier temps, l’analyse d’un signal de parole et d’un signal de musique sera effectuée dans le domaine temporel et fréquentiel et différents descripteurs acoustiques seront étudiés.

Puis deux, voire trois de ces descripteurs les plus pertinents seront sélectionnés pour discriminer les deux types de signaux. En utilisant un ensemble de données dites d’apprentissage, un ensemble de signaux tests fournis par l’encadrant seront classifiés en signal de parole ou de musique en tirant profit d’algorithmes de classification comme le k-Nearest Neighbor ou l’analyse discriminante linéaire.

Lire l’article ici: Music Speech Classification

Lien du github: Music Speech Classification Project