As part of my plan to get back into blogging, here's a repost from 2011 discussing a classic statistical problem. I followed it up a year later, but will post that next week.
The German tank problem is a fav of mine. The wikipedia page on it is a little long winded, but I think it can be looked at a lot faster with a little numerical mucking about.
The problem is quite simple. The enemy are producing tanks, and each has a sequential serial number (for simplicity, let's assume that the numbers are reset every month). You encounter this scene on the battle field;
and we see that this is tank number, say, 15 of a particular months production. How many tanks were produced in that month? Can we even answer the question?
This is the problem that faced the Allies in WWII; you really wanted to know how many panzers are out there. Intelligence officers were reporting productions of more than a 1000 tanks per month, but based on statistics, the predicted number was significantly fewer than that, in the hundreds. After the war, the numbers were checked against records and the statistical answer was amazingly correct (read the wikipedia page for more details).
But let's see the how we can calculate this. We'll adopt the Bayesian approach (because that's the correct thing to do it :). So, let's assume that the number of tanks actually made is a number N, and let's assume that we guess that the maximum number of tanks that could be possibly be made is M (we'll insert some real number in here soon).
On the battle field, we find a tank with a serial number, A. What is your estimate of the number of tanks made in that months (let's call this X)? We want to make a probability distribution, and where this peaks, this is our best estimate for the number of tanks build.
Clearly, the minimum number of tanks is A (because you have the serial number you have). What about the rest of the probability distribution? If you think about it, if the total number of tanks is X, then the probability of randomly selecting tank A is simply 1/X. So the probability distribution look like this;
This is the case where the actual number of tanks produced was 274, and the maximum we think they could produce is 1000, and the serial number of the one tank found was 217. So the most likely number of tanks is 217, but there is still a lot of probability that there could be 900 or 1000.
Now for the cool part. You hear a report that another tank has been knocked out, this time serial number 91. You might think that tells you nothing new, as you know the minimum number is 217, but 91 has a similar probability distribution to 217, and to get the resultant distribution for the total number of tanks you multiply these together.
I've brushed over some of the key Bayesian words and concepts here, but this is basically what it boils down to; we get more evidence and we update our beliefs. So, what's the result of now finding tank 91? The result is the red curve below.
Notice that the most likely number of tanks is still 217, but knowing 91 as well has really started to suppress the numbers up near 1000.
Reports come in that three more tanks have been knocked out, 256, 248 and 61. What's the resultant distribution look like?
Again, each of the blue curves is the probability distribution for each tank, whereas the red is the total. Notice that the peak is now at 256, and the chances of more that 600 tanks being produced per month is pretty small, and 1000 is negligible.
Report come in of 5 more tanks, number 250, 172, 189, 29 and 170. What's the distribution now?
For clarity, I've left out the blue curves, but you can see that with just 10 tanks, we know the number produced is more than 256, but quite probably less than 400.
We can continue to play this game, and with 25 tanks knocked out, we get
Notice that I've changed the scale on the x-axis. We can be quite confident that less than 300 were made.
Now I think that is cool. And that's how information should be used.