[latexpage]
An interesting paper in PNAS from a few years back that I missed, “Strong profiling is not mathematically optimal for discovering rare malfeasors” by William H. Press.
Suppose you have a large population of people and a single bad guy who you want to catch. You could look through the entire population to find the bad guy, or you could set up checkpoints (err I mean airline security screening areas) to look for people, sampling only some small fraction of the population that goes through the checkpoint. Now, if you don’t know anything about the population you’re sampling, you might as well just sample them randomly until you find your baddie, since you don’t have any other information that could help you.
But suppose that you are able to come up with some prior probabilities for different people to be the bad guy. I.e. you’ve developed a model for what makes someone more likely to be a bad guy, and further assume that this model is really pretty accurate. To each person you can assign a probability $p_i$ that the person is indeed the bad guy. Now you could continue to sample uniformly, but you have this great model that you want to use to help catch the bad guy. What do you do?
It turns out that the wrong thing to do is to sample in proportion to the probability $p_i$. To figure out the correct strategy, suppose that you sample from the population with probability $q_i$. Then if $k$ is your man, the probability that you get him in one sample is $q_k$. Or, another way to say it is that the mean number of screenings you’ll need to find the baddie is $1/q_k$. Assuming your model is correct, the mean number of people you will have to sample is $sum_i p_i/q_i$. So now to calculate the optimum we need to minimize this expression for $q_i$ subject to the constraint that $sum_i q_i =1$.
To calculate this optimum, you use a Lagrange multiplier
$${partial over partial q_j} left[ sum_i {p_i over q_i} + lambda (sum_i q_i -1)right]=0$$
or
$$ – {p_j over q_j^2} + lambda = 0$$
Which, in order to satisfy our contraints (and also positive probabilities) gives us the answer for the optimum of
$$ q_j = { sqrt{p_j} over sum_i sqrt{p_i}}$$
Or, in other words, you should sample proportional to the square root of the probabilities. Pretty cool, a nice easy, yet surprising answer.
Even more awesome is that we got some square roots of probabilities in there. Quantum probability amplitudes are, of course, like square roots of probabilities. Now if only we could massage this into insight into quantum theory. Do it. Or the terrorist win.
There’s a more direct geometric intuition for this result. There’s a natural map that takes points in the simplex to points on the positive orthant of the sphere (one dimension higher) by taking the square root of each coordinate. Now, the resulting value of the q corresponds to taking the ray from the origin of the sphere through the mapped point for p, and seeing where it cuts the simplex. Essentially, you’re projecting from the sphere back down to the simplex, finding the “nearest” point. This sort of makes sense in the context of information geometry.
Which is not to say that there isn’t a quantum interpretation.
Thanks Suresh. I think the quantum connection is actually to some old work of Bill Wootters, but I can’t seem to find the relevant paper right now.