Today we will talk about a very interesting paradox: the Stein’s paradox. The first time I heard about it, my mind was completely blown. So here we go:

Let’s play a small game between us: supposing that there is an arbitrary distribution that we have no information about, except that it is symmetric. Now we are given a sample of that distribution. The rule is simple, for each round, each one of us will, based on the given sample, guess what the distribution’s mean is, and whoever has the estimated point closer to the true mean get 1 point. The game has many rounds, and who wins more rounds will be the final winner.

The first time I heard about the game, I have no idea what is going on. The rule is dead simple, and it seems completely random, there is no information whatsoever to find the true mean. The only viable choice is to take the sample as our guess.

However, it turns out that there is a better strategy to win the game in a long run. And I warn you, it will sound totally ridiculous.

Ok you’re ready ? The strategy is to take an arbitrary point, yes, any point that you like, and “pull” the the sample value toward it. The new value will be our new guess.

So if you look at the image below, you can see that by pulling the given point, our new guess is closer to the mean, and thus I win!

But..but…you will tell me that had I chosen the arbitrary point on the right of the given point, I would have lost! That’s totally correct ahah!

However, let’s take a step back and look at the big picture: given the position of the arbitrary point, my strategy will beat the naive approach if the given sample is on the right of the true mean (yellow zone in the image bellow).

But that is still not enough to win the game in the long run you say ? Brace yourself, here comes the magical part: I will win too if the sample point is on the left of the arbitrary point, because in that situation, the sample will be pulled toward its right, and is thus closer to the true mean. So in long run, with my ridiculous strategy, I will win more time than you!

This paradox shows us the power of shrinkage: even if we shrink our “estimation” with an arbitrary, completely random value, we will still have a better estimation in the long run. That’s why shrinkage method is widely used in machine learning. It is just that magical!