On seeking truth and what we can learn from science

In the era where fake news is the new news,  everybody seems to be constantly seeking for (their version of)  “truth”, or “fact”. It is quite ironic that while we are living in a world where information is more accessible than ever, we are less “informed” than ever. However, today, let’s take a break from all the chaos in real life and see if we can learn something about “truth” from the scientific world.

3 years ago, I spent my summer interning at a laboratory on social psychology, where I was exposed, for the very first time, to statistics and the scientific methodology. That was also when my naive image of “science” shattered. It was not noble as I thought, but, as real life, messy and complicated. My colleagues showed me the dark corners of psychology research, where people are willing to do whatever it takes to have a perfect p-value and get their paper published (for example, the scandal of Harvard professor of psychology Marc Hauser). If you are somebody working in the science world, you must not be so surprised: social science, and psychology in particular, is plagued with misconduct and fraud. Reproducibility is a huge problem, as proved by the reproducibility project, where 100 psychological findings were subjected to replication attempts. The results of this project were less than a ringing endorsement of research in the field: of the expected 89 replications, only 37 were obtained and the average size of the effects fell dramatically. At the end of the internship, I wrote an article with the title “Is psychology a science ?”, where I stated in my conclusion “Pyschology remains a young field in search for a solid theoretical base, but that can not justify for the lack of rigor in the research method. Science is about truth and not about being able to publish.”

“Science is about truth”.

Is it ?

This seemingly evident statement came back to hunt me 3 years later, when I came across this quote by Neil deGrasse Tyson, a scientist that I respect a lot:

The good thing about science is that it’s true whether or not you believe in it.

That was his answer for the question “What do you think about the people who don’t believe in evolution ?”.

If we put it in the context, where he already commented on the scientific method, this phrase becomes less troubling. However, I still find it very extreme and even misleading for layman people, just like my statement 3 years ago.

For me, science is about skepticism. It is more about not being wrong than being true.  A scientific theory never aspires to be “final “, it will always be subject to additional testing.

To better understand this statement, we need to go back to the root of the scientific method: statistics.

Let’s take an example in clinical trial: supposing that you want to test a new drug. You find a group of patients, give half of them the new drug and the rest a placebo. You measure the effect in each group and compare if the difference between those 2 groups is significant or not by using a hypothesis test.

This is where the famously misunderstood p-value comes into play.  It is the probability, under the assumption that there is no true effect or no difference,  of collecting data that shows a difference equal to or greater than what we observed. For many people (including researchers), this definition is very counter-intuitive because it does not do what they are expecting: p-value is not a measure about the effect size, it does not tell you how right you are or how big is the difference, it just shows you a level of skepticism. A small p-value simply states that we are quite surprised with the difference between 2 groups, given that we are not expecting it. It is only a probability, so if someone try a lot of hypothesis on the data, eventually they will get something  significant (this is what known as the problem of multiple comparisons, or  p-hacking).

quote-if-you-torture-the-data-long-enough-it-will-confess-ronald-coase-59-32-83

The whole world of research is driven by this metric. For many journals, a p-value less than 5% is the first criteria for a paper to be reviewed. However, things are more complicated than that. As I mentioned earlier, p-value is about statistical significance, not practical significance. If a researcher collects enough data, he will eventually be able to lower the p-value and “discover” something, even if the scope of it is extremely tiny that it doesn’t make any impact in real life.  This is where we need to discuss about the effect size and more importantly, the power of a hypothesis test. The former, as it names suggests, is the size of the difference that we are measuring. The latter is the probability that a hypothesis test will yield a statistically significant outcome. It depends on the effect size and the sample size. If we have a large sample size and want to measure a reasonable effect size, the power of the test will be high and vice versa, if we don’t have enough data but aim for a small effect size, the power will be low, which is quite logical: we can’t detect a subtle difference if we don’t have enough data. We can’t just throw a coin 10 time and said that because there are 6 heads, the coin must be biased.

In fields where experiments are costly (social science, pharmaceutical,…), the small sample size led to a big problem of truth inflation (or type M error). This is when the hypothesis test has a weak power and thus can’t detect any reliable difference.

Screen-Shot-2014-11-17-at-11.19.42-AM
If we run a trial many times, we get a curve of the probability of each measured difference. The red part in the right are the required measure to have a significant result. Source: Andrew Gelman.

In the curve above, we see that our data needs to have an effect size 9 times greater than the actual effect size to be statistically significant.

The truth inflation problem turns out to be quite “convenient” for researchers: they get a significant result with a huge effect size! This is also what the journals are looking for: “groundbreaking” results (large effect size result in some research field with little prior research). And it is not rare.

All these discussions is to show you that scientific methodology is not definite. It is based on statistics, and statistics is all about uncertainty, and sometimes it gets very tricky to do it the right way. But it needs to be done right. Some days it is hard, some days it is nearly impossible, but that’s the way science works. 

quote-the-real-purpose-of-the-scientific-method-is-to-make-sure-nature-hasn-t-misled-you-into-robert-m-pirsig-45-97-68

To conclude, I think that science is not solely about truth, but about evaluating observations. This is where we can go back to the real world: in this era where we are drowning in data, we also need to have a rigorous approach to process them: cross-check the information from multiple sources, be as skeptical as possible to avoid selection bias, try not to be wrong and most importantly, be honest to one self, because at the end of the day, truth is a subjective term. 

Gaussian Discriminant Analysis and Logistic Regression

There are many ways to classify machine learning algorithms: supervised/unsupervised, regression/classification,… . For myself, I prefer to distinguish between Discriminative model and Generative model. In this article, I will discuss the relationship between these 2 families, using Gaussian Discriminant Analysis and Logistic Regression as example.

Quick review: Discriminative methods model p(y \mid x) . In classification task, these models search a hyperplane (a decision boundary) seperating different classes. The majority of popular algorithms  belongs to this family: logistic regression, svm, neural net, … On the other hand, Generative methods model p(x \mid y) (and p(y) ). This means that it will give us a probability distribution for each class in the classification problem. This give us an idea of how the data is generated. This type of model relies heavily on Bayes rule to update the prior and derive the posterior. Some well-known examples are Naive Bayes, Gaussian Discriminant Analysis, …

discriminative_vs_generative
Discriminative vs Generative. Source: evolvingai.org.

There are quite some reasons why discriminative models are more popular among machine learning practitioner: they are more flexible, more robust and less sensitive to incorrect modeling assumptions. Generative models, on the other hand, require us to define the distribution of our prior, which can be quite challenging for many situations. However, this is also their advantage: they have more “information” about the data than discriminative model, and thus can perform quite well with limited data if the assumption is correct.

In this article, I will demonstrate the point above by proving that Gaussian Discriminant Analysis (GDA) will eventually lead to Logistic Regression, and thus Logistic Regression is more “general”.

For binary classification, GDA assumes that the prior follows a Bernoulli distribution and the likelihood follows a multivariate Gaussian distribution:

y \sim Bernoulli(\phi)

x \mid y=0 \sim N(\mu_0, \sum)

x \mid y=1 \sim N(\mu_1, \sum)

Let’s write down their mathematical formula:

p(y) = \phi^{y} \times (1-\phi)^{1-y}

p(x \mid y=0) = \frac{1}{(2\pi)^{n/2} |\sum|^{1/2} }  \times exp(- \frac{(x - \mu_0)^T(x - \mu_0)}{2 \sum })

p(x \mid y=1) = \frac{1}{(2\pi)^{n/2} |\sum|^{1/2} }  \times exp(- \frac{(x - \mu_1)^T(x - \mu_1)}{2 \sum })

As mentioned above, the discriminative model (here is the logistic regression) try to find p(y \mid x) , so what we try to prove is that:

p(y=1 \mid x) = \frac{1}{1 + exp(-\theta^T x)}

which is the sigmoid function of logistic regression, where \theta is some function of of \phi, \mu_0, \mu_1 and \sum.

Ok let’s roll up our sleeves and do some maths:

p(y=1 \mid x)

=\frac{p(x \mid y=1) \times p(y=1)}{p(x)}

= \frac{p(x \mid y=1) \times p(y=1)}{p(x \mid y=1) \times p(y=1) +p(x \mid y=0) \times p(y=0)}

= \frac{1}{1 + \frac{p(x \mid y=0) \times p(y=0)}{p(x \mid y=1) \times p(y=1)}}

This equation seems very much like what we are look for, let’s take a closer look at the fraction \frac{p(x \mid y=0) \times p(y=0)}{p(x \mid y=1) \times p(y=1)} :

\frac{p(x \mid y=0) \times p(y=0)}{p(x \mid y=1) \times p(y=1)} 

= exp(-\frac{(x - \mu_0)^2}{2 \sum} + \frac{(x-\mu_1)^2}{2 \sum} ) \times \frac{1 - \phi}{\phi}

= exp(\frac{(x-\mu_1)^2 - (x-\mu_0)^2}{2 \sum}) \times exp(\log(\frac{1-\phi}{\phi}))

= exp(\frac{(\mu_0 - \mu_1)(2x-\mu_0-\mu_1)}{2\sum}) \times exp(\log(\frac{1-\phi}{\phi}))

= exp(\frac{2(\mu_0 - \mu_1)x - (\mu_0 - \mu_1)(\mu_0 + \mu_1)}{2 \sum} +\log(\frac{1-\phi}{\phi}))

= exp(\log(\frac{1-\phi}{\phi}) - \frac{\mu_0^2 + \mu_1^2}{2\sum} + \frac{\mu_0 - \mu_1}{\sum} \times x )

= exp[ (\log(\frac{1-\phi}{\phi}) - \frac{\mu_0^2 + \mu_1^2}{2\sum}) \times x_0 + \frac{\mu_0 - \mu_1}{\sum} \times x ]

In the last equation, we add x_0 = 1 so that we can have the desired form \theta^Tx. The former equation is then:

\frac{1}{1 + \frac{p(x \mid y=0) \times p(y=0)}{p(x \mid y=1) \times p(y=1)}} = \frac{1}{1 +exp[ (\log(\frac{1-\phi}{\phi}) - \frac{\mu_0^2 + \mu_1^2}{2\sum}) \times x_0 + \frac{\mu_0 - \mu_1}{\sum} \times x ] }

And there it is, we just proved that the result of a Gaussian Discriminant Analysis is indeed a Logistic Regression, our vector \theta is:

\theta = \begin{bmatrix}\log(\frac{1-\phi}{\phi}) - \frac{\mu_0^2 + \mu_1^2}{2\sum} \\  \frac{\mu_0 - \mu_1}{\sum} \end{bmatrix}

The converse, is not true though:  p(y \mid x) being a logistic function  does not imply p(x \mid y) is mutlivariate gaussian. This observation shows us that GDA has a much stronger assumption than Logistic Regression. In fact, we can go 1 step further and prove that if p(x \mid y) belongs to any member of the Exponential Family (Gaussian, Poisson, …), its posterior is a logistc regression. We see now a reason why Logistic Regression is widely used: it is a very general, robust algorithm that works for many underlying assumptions. GDA (and Generative models in general), in the other hand, makes much stronger assumption, and thus is not ideal for non-Gaussian or some-crazy-undefined-distribution data 

(The problem of GDA, or generative model, can be solved with a class of Bayesian Machine Learning that uses Markov Chain Monte Carlo to sample data from their posterior distribution. This is a very exciting method that I’m really into, so I will save it for a future post.)

Random thought on randomness or why people suck at long-term vision

Law of large number is one of the foundational theorem in probability theory. It says that the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed.

lawoflargenumbers
Example of Law of Large Number: tossing a coin

This theorem is very simple and intuitive. And perhaps because it is too intuitive, it becomes counter-intuitive. Why ? Let’s talk about the Gambler’s Fallacy: in a binary-result event, if there has been a long run of one outcome, an observer might reason that because the 2 outcomes are destined to come out in a given ratio in a lengthy set of trials, the outcome that has not appeared for a while is temporarily advantaged. Upon seing six straight occurences of black from spins of a roulette wheel, a gambler suffering from this illusion would confidently bet on red for the next spin.

Why is it fallacious to think that sequences will self-correct for temporary departures from the expected ratio of the respective out-comes ? Ignoring for a moment the statistically correct answer that each turn is independent from each other and imagining that the gambler’s illusion is real, we can still point out many problems with that logic. For example, how long will this effect last ? If we take the roulette ball and hide it for 10 years, when unearthed, how will it know to have a preference for red ? Obviously, the gambler’s fallacy can’t be right.

So why can’t the Law of Large Number be applied in the case of Gambler’s Fallacy ?

Short answer: Statistically speaking, humans are shortsighted creatures.

Long answer: People generally fail to appreciate that occasional long runs of one or the other outcome are a natural feature of random sequences. If you don’t buy it, let’s do a small game: take out a small piece of paper and write down a sequence of random binary number (1 or 0 for example). Once you are done, count the length of the longest run of either value. You will notice that that number is quite small. It has been demonstrated that we tend to avoid long runs. The sequences we write usually alternate back and forth too quickly between the two outcomes. This appears to be because people expect random outcomes to be representative of the process that generates them, so that if the trial-by-trial expectations for the two outcomes are 50/50, then we will try to make the series come out almost evenly divided. People generally assume too much local regularity in their concepts of chance, or in other terms, people are lousy random number generators.

So there you are, we can see that human, in nature, are statistically detail-oriented. We don’t usually consider the big picture but regconize only some “remarkable” details which will shape our point of view about the world. When we meet a new person, the observations of a few isolated behaviors leads directly to judgments of stable personal characteristics such as friendliness or introversion. Here it is likely that observations of the behavior of another person are not perceived as potentially variable samples over time, but as direct indicants of stable traits. This problem is usually described as “The Law of small number”, which refers to the tendency to impute too much stability to small-sample results.

Obviously, knowing about this won’t change our nature, but at least once we acknowledge about our bias, we can be more mindful of the situation and of our decisions.

Stein’s paradox or the power of shrinkage

Today we will talk about a very interesting paradox: the Stein’s paradox. The first time I heard about it, my mind was completely blown. So here we go:

Let’s play a small game between us: supposing that there is an arbitrary distribution that we have no information about, except that it is symmetric. Now we are given a sample of that distribution. The rule is simple, for each round,  each one of us will, based on the given sample, guess what the distribution’s mean is, and whoever has the estimated point closer to the true mean get 1 point. The game has many rounds, and who wins more rounds will be the final winner.

Stein.001

The first time I heard about the game, I have no idea what is going on. The rule is dead simple, and it seems completely random, there is no information whatsoever to find the true mean. The only viable choice is to take the sample as our guess.

However, it turns out that there is a better strategy to win the game in a long run. And I warn you, it will sound totally ridiculous.

Ok you’re ready ? The strategy is to take an arbitrary point, yes, any point that you like, and “pull” the the sample value toward it. The new value will be our new guess.

Stein.002

So if you look at the image below, you can see that by pulling the given point, our new guess is closer to the mean, and thus I win!

Stein.003

But..but…you will tell me that had I chosen the arbitrary point on the right of the given point, I would have lost! That’s totally correct ahah!

However, let’s take a step back and look at the big picture: given the position of the arbitrary point, my strategy will beat the naive approach if the given sample is on the right of the true mean (yellow zone in the image bellow).

Stein.004

But that is still not enough to win the game in the long run you say ? Brace yourself, here comes the magical part: I will win too if the sample point is on the left of the arbitrary point, because in that situation, the sample will be pulled toward its right, and is thus closer to the true mean. So in long run, with my ridiculous strategy, I will win more time than you!

Stein.005

This paradox shows us the power of shrinkage: even if we shrink our “estimation” with an arbitrary, completely random value, we will still have a better estimation in the long run. That’s why shrinkage method is widely used in machine learning. It is just that magical!

Multicollinearity in Regression Model

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response. This is a common situation in real life regression problem and depending on the goal of the user, it can be more or less troublesome.

If the goal is simply to predict Y from a set of X variables, then multicollinearity is not a problem. The predictions will still be accurate, and the overall R^2 (or adjusted R^2  ) quantifies how well the model predicts the Y values. If the goal is to understand how the various X variables impact Y, then multicollinearity is a big problem.

We begin with the linear regression model. The most basic one uses Ordinary Least Square technique to estimate the regression coefficients. The solution of a OLS linear regression is given by:

\hat{\beta} = (X^TX)^{-1} X^T y

Note here that if X is not full rank (ie. the predictors are not independent from each other), X^TX is not invertible, and thus there is no unique solution for \hat{\beta} . This is where all the troubles begin. One problem is that the individual P-values can be misleading (a P-value can be high, even though the variable is important). The second problem is that the confidence intervals on the regression coefficients will be very wide. The confidence intervals may even include zero, which means one can’t even be confident whether an increase in the X value is associated with an increase, or a decrease, in Y. Because the confidence intervals are so wide, excluding a subject (or adding a new one) can change the coefficients dramatically and may even change their signs.

The unstable P-value can lead to some misleading and confusing results. For example, if we have 2 correlated predictors, we might stumble upon an extreme situation in which the F-test confirms that the model is useful for predicting y, but the coefficient t-tests are non-significant, which suggest that none of the two predictors are significantly associated to y ! The explanation is quite simple: given that \beta_1 and \beta_2 are 2 the estimated coefficients, \beta_1 is the expected change in y due to x_1 given x_2 is already in the model and vice versa, \beta_2 is the expected change in y due to x_2 given x_1 is already in the model, since both x_1 and x_2 contribute redundant information about y once one of the predictors is in the model, the other one does not have much more to contribute. This is why the F−test indicates that at least one of the predictors is important yet the individual t-tests indicate that the contribution of the predictor, given that the other one has already been included, is not really important.

Statistically speaking, a high multicollinearity will inflate the standard error of estimates of the predictors (and thus decrease the reliability). Consequently, multicollinearity results in a decline in the t-statistic (because t = \frac{\hat{\beta}}{SE(\hat{\beta})}). This means that the power of the hypothesis test—the probability of correctly detecting a non-zero coefficient—is reduced by collinearity: a predictor with a small coefficient might be “masked”, even if it is statistically significant.

So we can see that the biggest problem with multicollinearity is that the predictor coefficients have a large variance, and thus it is very hard to interpret their effect.

There are many methods that have been proposed to overcome this problem. Some people suggest to drop the correlated variables. However this is a very risky method because

  1. As the variable coefficients are not stable, we are not sure which variables are the most suitable ones to be dropped, and moreover, removing one variable will cause the other correlated variable coefficients to change in an unpredicted way.
  2. If we use step-wise regression for variable selection, we risk to overfit the dataset. (in general, step-wise methods are not recommended.)

Another direction is to use shrinkage methods, especially ridge regression. The solution to the ridge regression problem is given by:

\hat{\beta} = (X^TX + \lambda I)^{-1} X^T y

At the beginning we have said that in the linear regression model, the OLS estimates do not always exist because X^TX is not always invertible. With ridge regression, the problem is solve: For any design matrix X, the quantity (X^TX + \lambda I) is always invertible; thus, there is always a unique solution \hat{\beta}

Ridge regression use the regularizer \lambda to penalize large coefficients. Being a biased estimator, it trades some degree of bias to reduce the variance, and therefore results in more stable estimates.

Chi-squared distribution revisited

Today, while reviewing the regression techniques in the ESL book (Element of Statistical Learning, btw this book is pure gold, I highly recommend it!), I stumbled upon the chi-squared distribution. Concretely, the author shows that:

(N-p-1) \hat{\sigma}^2 \sim \sigma^2 \chi^2_N-p-1

a chi-squared distribution with N-p-1 degrees of freedom, with \hat{\sigma}^2 an unbiased estimate of \sigma^2 .

They use these distributional properties to form tests of hypothesis and confidene interval for the parameters \beta_j .

It has been a very long time since I saw the chi-squared ditribution, and of course my understanding of it becomes quite rusty. So I think this is a good chance to revisit this important distribution.

Before talking about the chi-squared distribution, we need to review some notions. First, we have the gamma function:

For \alpha > 0, the gamma function \Gamma(\alpha) is defined by:

\Gamma(\alpha) = \int_0^\infty x^{\alpha-1}e^{-x}dx

and for any positive integer n, we have: \Gamma(n) = (n-1)!.

Now let:

f(x,a) = \frac{x^{\alpha-1}e^{-x}}{\Gamma(\alpha)}  if x \geq 0

and

f(x,a) = 0 otherwise.

Then f(x,a) \geq 0. The gamma function implies that:

\int_0^\infty f(x,a)dx = \frac{\Gamma(\alpha)}{\Gamma(\alpha)} = 1

Thus f(x,a) satisfies the 2 basic properties of a probability distribution function.

We will now use this function to define the Gamma distribution and then the Chi-squared distribution. 

A continuous random variable X is said to have a Gamma distribution if the pdf of X is:

f(x;\alpha,\beta) = \frac{1}{\beta^\alpha \Gamma(\alpha)} x^{\alpha-1} e^{-x/\beta} for x \geq 0

and

f(x;\alpha,\beta) = 0 otherwise

where the parameters \alpha and \beta are positive. The standard Gamma distribution has \beta = 1, so the pdf of a standard gamma is given by the f(x,a) given above.

The Gamma distribution is widely used to model the extent of degradation such as corrosion, creep, wear or survival time.

The Gamma distribution is a family of distribution. Both the Exponential distribution and Chi-squared distribution are special case of the Gamma.

As we can see, the gamma distribution takes two arguments. The first (\alpha) defines the shape. If shape is close to zero, the gamma is very similar to the exponential. If shape is large, then the gamma is similar to the chi-squared distribution.

Now we will define the chi-squared distribution.

Let \nu be a positive integer. Then a random variable X is said to have a chi­-squared distribution with parameter n if the pdf of X is the gamma density with \alpha = \nu /2 and \beta = 2. The pdf of a chi-squared rv is thus:

f(x,\nu) = \frac{1}{2^{\nu/2}\Gamma(\nu/2)}x^{\nu/2}e^{-x/2} for x \geq 0

and

f(x,\nu) = 0 otherwise

The parameter \nu is called the degrees of freedom df of X.

The chi-squared distribution is important because it is the basis for a number of procedures in statistical inference. The central role played by the chi-squared distribution in inference springs from its relationship to normal distributions. Concretely, the chi-squared distribution is the distribution of a sum of the squares of k independent standard normal random variables. For example, in the case of linear regression, the variable (y_i - \hat{y}_i) follow a normal distribution, thus to model the variance of X, which is the sum of the square of these values, we use a chi-square distribution.

The chi-squared distribution is the foundation of chi-squared tests. There are 2 types:

  • The goodness-of-fit test.
  • The test of independence.

Perhaps we will look at these tests in details in another post.

The prosecutor’s fallacy

Conditional probability is one of the most important concept in Statistics. All the field of Bayesian statistics is based on it. However, it is also one of the most misleading concept, the one that many people misuse in real life. The prosecutor’s fallacy is one good example illustrating the severe consequences that a wrong interpretation of the conditional probability can lead to.

We begin with a simple example: we can be almost certain that if it is raining (hypothesis A), there will be cloudy (hypothesis B). In other words, the probability P(B|A) is nearly 100%. So the question now is: if it is cloudy, will it rain soon ? Here we search P(A|B), and we can easily see that this probability will be much lower than that of the first case. Lesson learned: P(B|A) is not the same thing as P(A|B).

That conclusion looks ridiculously simple, any students who have taken STAT101 know about it. However, when being placed in a much more complicated situation as the one we will show below, the conditional probability can be easily mis-interpreted.

The prosecutor’s fallacy:

Suppose that police pick up a suspect and match his or her DNA to evidence collected at a crime scene. Suppose that the likelihood of a match, purely by chance, is only 1 in 10,000. Is this also the chance that he is innocent? It’s easy to make this leap, but you shouldn’t.

Here’s why. Suppose the city in which the person lives has 500,000 adult inhabitants. Given the 1 in 10,000 likelihood of a random DNA match, you’d expect that about 50 people in the city would have DNA that also matches the sample. So the suspect is only 1 of 50 people who could have been at the crime scene. Based on the DNA evidence only, the person is almost certainly innocent, not certainly guilty.”

The generic insight is that the probability of the hypothesis given the available evidence is not equal to the probability of the evidence assuming the hypothesis is true.

In bayesian term, the likelihood is P(B|A) with B: person whose DNA is matched, and A: the person is innocent, P(A) is the prior and P(B) is the evidence. What we are looking for is P(A|B): the posterior probability.

To see some real life examples where this has really happened read about the Sally Clark case in Britain (1998) , the OJ Simpson case (1995) and People vs. Collins (1968).

This fallacy is seen in medical fields too, especially in drug/disease testing, where many people cannot distinguish the difference between: the probability that the test is positive given that the person is really sick, P(positive|sick), and the probability that the person is sick given that the test is positive, P(sick|positive). For example, we usually hear that “The discover rate of the test is 90%”, which means that P(positive|sick) = 90%. However, what we really care about is P(sick|positive), which shows the reliability of the test, because in real life, we don’t know whether the person is sick or not and we need to know if we can trust the result of the test. We can compute it using Bayes Theorem and usually the result is much more lower than the number “90%”.

Frequentist vs Bayesian: the case of the Confidence Interval

There are 2 kinds of people in the (statistics) world: frequentists and bayesians.

Frequentist statistics is the classical branch, this is what we learnt in our elementary statistics course at school. For a frequentist, all the parameters are constant, and in order to find its value, we need to consider not only the data that we possess but also the hypothetical data, so that we don’t miss any cases. In other words, frequentist is focused on repetition, thus their name.

A bayesian, in the other hand, believes that everything, including the parameters, is a random variable, which means that the parameters are represented by a distribution, not only a value. This (hypothetical) distribution is updated continuously with the data. Only in bayesian statistics that we can write P(H|D) (probability of a hypothetical distribution given the data), because for a frequentist, a parameter is a constant, and constant doesn’t have a distribution.

Back to the confidence interval, its definition in frequentist statistics is quite tricky. Many people misinterpret that a 95% frequentist CI means that there is 95% chance that our parameter is in that interval. This, under a frequentist viewpoint, makes no sense! Because, as we have pointed out earlier, for a frequentist, a parameter is a constant, thus he can be sure that a parameter is either contained in that interval or it isn’t, it is either 0 or 1, there is no uncertainty, no variation here! So when we say “95% frequentist CI”, what that refers to is not the probability for the data we have, but it is in fact under hypothetical repetition: if we repeat the same procedure over and over again, in 95% cases the interval will contain the parameter.  It doesn’t say anything on the specific problem that we are working on. Bayesian, on the other hand, is more focused on the problem at hand, and its interpretation is usually closer to our intuition: a 95% bayesian CI means that there is 95% chance that the parameter falls in the interval. 

Hypothesis testing in Bayesian statistics

In Bayesian statistics, the data D is in favor of the hypotheses H if p(H|D) > p(D), which means that the posterior is greater than the prior. This is true if p(D|H) > p(D|~H), or p(D|H) / p(D|~H) > 1. The latter is called the likelihood ratio, or the Bayes factor; it measures the strength of the evidence.

We can not use the data to form the hypotheses because this process would be very biased, and thus almost any data would be evidence in favor of our hypotheses. Therefore, we need to choose the hypotheses before we see the data. We can have a suite of hypotheses, but in that case we will average over the suite.

Talking about hypotheses (or the prior), they are very subjective. Take for example an experiment with a coin, where we have 140 heads and 110 tail after 250 turns. The question now is: Do these data give evidence that the coin is biased rather than fair? The conclusion, as we will see, depends a lot on our definition of “bias”.

We start we the most simple case. Given x the probability of head (0-100), we define the hypotheses:

  • H0: the coin is unbiased, the probability of x = 50 is 100%
  • H1: the coin is biased, for x != 50, the probability of x follows a uniform distribution.

Here the language seems a little bit confusing, so I will add some comments: In the simple case, we suppose that a coin is biased if its probability is not equal to 50, and for other cases, we consider it biased, and the probability of its level of bias is uniformly distributed, which means that the odd of having a 90%-bias coin is as high as the one with 60%-bias. Under these hypotheses, we found that the likelihood ratio is about 0.47, which suggest a favor toward the unbiased hypothesis H0.

We can see that this is not the way we usually define a biased coin in reality. Intuitively, we know that it is very difficult, or even impossible, to find an absolute biased coin (>80%), while the 60%-bias ones are much more popular. In other words, x is not uniformly distributed. Therefore, a more logical way to define our hypotheses is:

The likelihood ratio in this case is .83, still in favor of the hypothesis H0, but it is already weaker than in the first place.

If we decide that everything under 25% or over 75% is impossible, our hypotheses become:

And boom, the likelihood jump up to 1.12, which means that our coin is biased (even though the evidence is quite weak).

To summarize, the Bayes factor depends on the definition of the priors, or the hypotheses. In our example, the evidence is weak either way (between 0.5 and 1.5)

Normality dilemma – should I test it or not ?

This morning I came across an article on normality testing, one problem that I had thought about a lot when I was doing a project for my Statistics class last year. I did an analysis on the the airplane accidents, more precisely, I was comparing the fatalities numbers between 2 periods : 1994-2001 and 2001-2009 ( I was trying to find a difference in safety before and after 11/09 ). The distribution was showed below:

Density plot of Fatalities
Density plot of Fatalities

As we can see, the density plot is completely skewed, and normality is out of the table. In case of small sample, this could mean that we can’t use some parametric test ( Student t-test…) with normality hypothesis. For large sample, thanks to the Central Limit Theorem, we can ignore this condition. So the question that pop up in my heart was: Is normality really helpful ? Especially in case of large sample ?

The article ( Is normality testing ‘essentially useless’?) is a question on CrossValidated. The author quoted his colleague’s argument:

We usually apply normality tests to the results of processes that, under the null, generate random variables that are only asymptotically or nearly normal (with the ‘asymptotically’ part dependent on some quantity which we cannot make large); In the era of cheap memory, big data, and fast processors, normality tests should always reject the null of normal distribution for large (though not insanely large) samples. And so, perversely, normality tests should only be used for small samples, when they presumably have lower power and less control over type I rate.

This is also my though about the normality test. The answers mostly back up this argument. There was one answer that explained in more detail the true purpose of normality test:

The question normality tests answer: Is there convincing evidence of any deviation from the Gaussian ideal ? With moderately large real data sets, the answer is almost always yes.

The question scientists often expect the normality test to answer: Do the data deviate enough from the Gaussian ideal to “forbid” use of a test that assumes a Gaussian distribution? Scientists often want the normality test to be the referee that decides when to abandon conventional (ANOVA, etc.) tests and instead analyze transformed data or use a rank-based nonparametric test or a resampling or bootstrap approach. For this purpose, normality tests are not very useful.

So there it is, the ugly truth: if we want to know whether a parametric test with normality hypothesis can be applied or not, normality testing is not the way to go. The problem now is: what do we need to do to answer that question ? Some answers suggested “seeing and trying” method: investigate visually the normality of the sample. However, in some cases, this could be very difficult and time-consuming…