## Monday, March 15, 2010

### Example 7.27: probability question reconsidered

In Example 7.26, we considered a problem, from the xkcd blog:

Suppose I choose two (different) real numbers, by any process I choose. Then I select one at random (p= .5) to show Nick. Nick must guess whether the other is smaller or larger. Being right 50% of the time is easy. Can he do better?

Randall Munroe offers a solution which we tested in the previous example. However, his logic may not be obvious to all readers. Instead, Martin Kulldorff, author of the SaTScan software for cluster detection, offers the following intuition. Suppose Nick chooses a number k. If the first number is bigger than k, he'll guess that the hidden number is smaller, and vice versa. Let's condition to see what will happen. If both numbers are bigger than k, Nick will be right 50% of the time. If both are smaller than k, he'll also be right 50% of the time. But whenever the two numbers straddle k, he'll be right 100% of the time. So in the margin, he'll be right more than 50% of the time (because the two numbers are not identical). This strategy will sometimes work if I don't know Nick is using it and he makes a lucky choice of k. But even if I know Nick's strategy, he will still gain a rate over 50% by simply choosing k at random.

Munroe's strategy is the same as that outlined here, if we choose

k = log(u/(1-u))

where u is a Uniform(0,1) random variate, though this has a similar problem to Munroe's approach, in practice-- k rarely (only .000000002 of the time) exceeds about 20 in absolute value. A more effective strategy would be to choose k from a distribution with values more uniform across the reals. In the following code we use a Cauchy distribution (see section 1.10.1) spreading it out even further by multiplying each pseudo-random number by 1000.

SAS

The code is similar to that shown in the previous example, but does not require computing the function of the observed value.

`data test2;  do i = 1 to 100000;    real1 = uniform(0) * 1000 + 1000;    real2 = uniform(0) * 1000 + 1000; if uniform(0) gt .5 then do;      env1 = real1;   env2 = real2;   end; else do;   env1 = real2;   env2 = real1;   end; if (env1 gt (rand("CAUCHY") * 1000)) then guess = "l"; else guess = "h"; correct = (((env1 < env2) and (guess eq "h")) or               ((env1 > env2) and (guess eq "l"))) ; output; end;run;proc freq data = test2;tables correct / binomial (level='1');run;The FREQ Procedure                                    Cumulative    Cumulativecorrect    Frequency     Percent     Frequency      Percent------------------------------------------------------------      0       48315       48.32         48315        48.32      1       51685       51.69        100000       100.00Proportion                0.5169ASE                       0.001695% Lower Conf Limit      0.513895% Upper Conf Limit      0.5199Exact Conf Limits95% Lower Conf Limit      0.513795% Upper Conf Limit      0.5200ASE under H0              0.0016Z                        10.6569One-sided Pr >  Z         <.0001Two-sided Pr > |Z|        <.0001Sample Size = 100000`

R

`mk1 = function(n) {   real1 = runif(n) * 1000 + 1000   real2 = runif(n) * 1000 + 1000   pickenv = (runif(n) < .5)   env1 = ifelse(pickenv,real1,real2)   env2 = ifelse(!pickenv,real1,real2)   k = rcauchy(n) * 1000   guess = ifelse(env1 > k,"lower","higher")   correct = ((env1 < env2) & (guess == "higher"))  | ((env1 > env2) &      (guess == "lower"))   return(correct)}> binom.test(sum(mk1(100000)),100000)        Exact binomial testdata:  sum(mk1(1e+05)) and 1e+05 number of successes = 51764, number of trials = 1e+05, p-value <2.2e-16alternative hypothesis: true probability of success is not equal to 0.5 95 percent confidence interval: 0.5145375 0.5207415 sample estimates:probability of success                0.51764 `

This approach is so much better than the other that we can detect the effect in only 100,000 trials!