SAS and R: multinomial observations

Showing posts with label multinomial observations. Show all posts

Wednesday, January 20, 2010

Example 7.23: the Monty Hall problem

The Monty Hall problem illustrates a simple setting where intuition often leads to a solution different from formal reasoning. The situation is based on the game show Let's Make a Deal. First, Monty puts a prize behind one of three doors. Then the player chooses a door. Next, (without moving the prize) Monty opens an unselected door, revealing that the prize is not behind it. The player may then switch to the other non-selected door. Should the player switch?

Many people see that there are now two doors to choose between, and feel that since Monty can always open a non-prize door, there's still equal probability for each door. If that were the case, the player might as well keep the original door.

This approach is so attractive that when Marilyn Vos Savant asserted that the player should switch (in her Parade magazine column), there were reportedly 10,000 letters asserting she was wrong. There was also an article published in The American Statistician. As reported on the Wikipedia page, out of 228 subjects in one study, only 13% chose to switch (Granberg and Brown, Personality and Social Psychology Bulletin 21(7): 711-729). In her book, The Power of Logical Thinking, vos Savant quotes cognitive psychologist Massimo Piattelli-Palmarini as saying "... no other statistical puzzle comes so close to fooling all the people all the time" and "that even Nobel physicists systematically give the wrong answer, and that they insist on it, and they are ready to berate in print those who propose the right answer."

One correct intuitive route is to observe that Monty's door is fixed. The probability that the player has the right door is 1/3 before Monty opens the non-prize door, and remains 1/3 after that door is open. This means that the probability the prize is behind one of the other doors is 2/3, both before and after Monty opens the non-prize door. After Monty opens the non-prize door, the player gets a 2/3 chance of winning by switching to the remaining door. If the player wants to win, they should switch doors.

The excellent Wikipedia entry referenced above provides additional intuitive tools, as well as variants and history.

One way to prove to yourself that switch is best is through simulation. In fact, even deciding how to code the problem may be enough to convince yourself to switch.

SAS
In SAS, we assign the prize to a door, then make an initial guess. If the guess was right, Monty can open either door. We'll switch to the other door. Rather than have Monty choose a door, we'll choose one, under the assumption that Monty opened the other one. We do this with a do until loop (section 1.11.1). If our initial guess was wrong, Monty will open the only remaining non-prize door, and when we switch we'll be choosing the prize door.


data mh;
do i = 1 to 100000;
  prize = rand("TABLE",.333,.333);
* Monty puts the prize behind a random door;
  initial_guess = rand("TABLE",.333,.333);
* We make a random initial guess;
* if the initial guess is right, Monte can 
                  open either of the others;
* which means that player can switch to either 
                  of the other doors;
  if initial_guess eq prize then do;
    new_guess = initial_guess;
    * choose a door until it's different from
                  the initial guess-- that's
                  the door we switch to;
    do until (new_guess ne initial_guess);
      new_guess = rand("TABLE",.333,.333);
   end;
    end;
* If the initial guess was wrong, Monte can rule 
          out only one of the other doors;
* which means that we must switch to the right door;
  if initial_guess ne prize then new_guess = prize;
  output;
  end;
run;

To see what happened, we'll summarize the resulting data set. If our initial guess was right, we'll win by keeping it. If switching leads to a win, the new guess is where the prize is. We create new variables to indicate winning using a logical test (as in section 1.4.9).


data mh2;
set mh;
win_by_keep = (initial_guess eq prize);
win_by_switch = (new_guess eq prize);
run;

proc means data = mh2 mean;
var win_by_keep win_by_switch;
run;


The MEANS Procedure

Variable                 Mean
-----------------------------
win_by_keep         0.3332700
win_by_switch       0.6667300
-----------------------------

Note that these two values sum to exactly one. If we chose the prize initially, we always win by keeping it, and if we did not choose the prize initially, we always win by switching. In any event, the simulation supports the conclusion that we should switch.

R

In R, we write two functions. In one, Monty opens a door, choosing at random among the non-chosen doors if the initial choice was correct, or choosing the one non-selected non-prize door if the initial choice was wrong. The other function returns the door chosen by swapping. We use the sample() function (section 1.5.2) to randomly pick one value. We then use these functions on each trial with the apply() statement (section B.5.3).


numsim = 100000
doors = 1:3
opendoor = function(x) { 
 if (x[1]==x[2])
    return(sample(doors[-c(x[1])], 1)) 
 else return(doors[-c(x[1],x[2])])
}
swapdoor = function(x) { return(doors[-c(x[1], x[2])]) }
winner = sample(doors, numsim, replace=TRUE)
choice = sample(doors, numsim, replace=TRUE)
open = apply(cbind(winner, choice), 1, opendoor)
newchoice = apply(cbind(open, choice), 1, swapdoor)

To display the results, we use the cat() statement to show how text and variables can be integrated.


> cat("without switching, won ",
+ round(sum(winner==choice)/numsim*100,1)," 
+ percent of the time.\n", sep="")
without switching, won 33.2 percent of the time.
> cat("with switching, won ",
+ round(sum(winner==newchoice)/numsim*100,1)," 
+ percent of the time.\n", sep="")
with switching, won 66.8 percent of the time.

Friday, January 8, 2010

Example 7.21: Write a function to simulate categorical data

In example 7.20, we showed how to simulate categorical data. But we might anticipate needing to do that frequently. If a SAS function weren't built in and an equivalent R function not available in a package, we could build them from scratch.

SAS

The SAS code is particularly tortured, since we must parse the parameter string to extract the table probabilities. We do this with the count and find (section 1.4.6) and substr (section 1.4.3) functions. We also turn the characters into numbers (section 1.4.2). The SAS Macro (section A.8.1) runs a data step, so it cannot be called from within another data step, as would usually be desirable. Instead, you would have to generate the data and then merge it into any additional data that was required. On the whole, it might make more sense to treat every application as a special case and hand-code each one, rather than to use the macro, were a built-in function not available.


%macro rtable(outdata=, outvar=, n=1, p=);
data &outdata (keep=&outvar);
/* In the first section of code, we determine the number
   of categories and set up an array to hold the 
   probabilities for each category.  A maximum number of 
   catagories must be provided; here we use 100. */
nprobs = count(&p,",");
lengthprobs = length(&p);
array probs [100] _temporary_;

/* In the next section we parse the parameter string to 
   extract the probabilities for each category. */
lastcomma = 0;
lastprob = 0;
do prob = 1 to nprobs;
 nextcomma = find(&p,",",lastcomma +1);
 probs[prob] = lastprob + 
     substr(&p,lastcomma+1, (nextcomma-lastcomma -1));
 lastcomma = nextcomma;
 lastprob = probs[prob];
 end;
/* Finally, we generate a uniform random variate and
   use it to find the appropriate category. */
do i = 1 to &n;
 __x = uniform(0);
 &outvar = 1;
 do j = 1 to nprobs;
   &outvar = &outvar + (__x gt probs[j]);
       end;
 output;
 end;
run;
%mend rtable;

We've left this code so that an extra comma must be included in the probability parameter string. This could be fixed with an additional line or two in the data step.

We can then run the macro and test the resulting data using proc freq.


%rtable(outdata=test2,outvar= mycat, n = 10000, 
                                   p = ".1,.2,.3,");

proc freq data = test2; tables mycat; run;

                                Cumulative  Cumulative
mycat    Frequency   Percent    Frequency     Percent
------------------------------------------------------
  1        1032       10.32        1032        10.32
  2        2013       20.13        3045        30.45
  3        2956       29.56        6001        60.01
  4        3999       39.99       10000       100.00

R

In contrast, the R code to write a function (section B.5.2) is trivial, using the cut() function as suggested by Douglas Rivers, discussed in section 1.4.10.


rtable1 <- function(cuts,n) {
   output <- cut(runif(n), cuts, labels=FALSE)
   return(output)
}

In fact, the function hardly seems worth writing. To use it, you need to include 0 and 1 as the first and last elements of the cuts vector. Results can be examined using the table() function (section 2.2.1)


> mycat1 <- rtable1(c(0,.1,.3,.6,1),10000)
> summary(factor(mycat1))
   1    2    3    4 
1058 1989 2978 3975

To add a little bit of programming to it, suppose we want to list the probabilities per category, rather than the cutpoints. We'll omit the final category probability, assuming the input vector sums to less than 1 and has 1 fewer elements than the desired number of categories.


rtable2 <- function(p, n) {
   cuts <- numeric(length(p) + 2)
   cuts[1] <- 0
   cuts[length(p) + 2] <- 1
   for (k in 1:length(p))  {
      cuts[k+1] <- sum(p[1:k])
   }
   output <- cut(runif(n), cuts, labels=FALSE)
   return(output)
}

The results are similar to the simpler function. We can see the results tabulated without explicitly making the result a factor by using the table() function (section 2.2).


> mycat2 <- rtable2(c(.1, .2, .3), 10000)
> table(mycat2)
mycat2
   1    2    3    4 
 961 1986 3058 3995

Monday, January 4, 2010

Example 7.20: Simulate categorical data

Both SAS and R provide means of simulating categorical data (see section 1.10.4). Alternatively, it is trivial to write code to do this directly. In this entry, we show how to do it once. In a future entry, we'll demonstrate writing a SAS Macro (section A.8.1) and a function in R (section B.5.2) to do it repeatedly.

SAS


data test;
p1 = .1; p2 = .2; p3 = .3;
do i = 1 to 10000;
  x = uniform(0);
  mycat = (x ge 0) + (x gt p1) + (x gt p1 + p2) 
                                + (x gt p1 + p2 + p3);
  output;
  end;
run;

Here the parenthetical logical tests in the mycat = line resolve to 1 if the test is true and 0 otherwise, as discussed in section 1.4.9.
The (x ge 0) makes the categories range from 1 to 4, rather than 0 to 3.

The results can be assessed using proc freq:


proc freq data=test; tables mycat; run;

                                  Cumulative    Cumulative
mycat    Frequency     Percent     Frequency      Percent
----------------------------------------------------------
    1         947        9.47           947         9.47
    2        2061       20.61          3008        30.08
    3        3039       30.39          6047        60.47
    4        3953       39.53         10000       100.00

R

In contrast, the R syntax to get the results is rather dense.


p <- c(.1,.2,.3)
x <- runif(10000)
mycat <- numeric(10000)
for (i in 0:length(p)) { 
    mycat <- mycat + (x >= sum(p[0:i])) 
    }

We can display the results using the summary() function.


summary(factor(mycat))
  1    2    3    4
 990 2047 2978 3985

Monday, December 14, 2009

Example 7.18: Displaying missing value categories in a table

When displaying contingency tables (section 2.3.1), there are times when it is useful to either show or hide the missing data category. Both SAS and the typical R command default to displaying the table only for observations where both factors are observed.

In this example, we generate some multinomial data (section 1.10.4) and then produce tables with and without missing data categories.

SAS

Generate the multinomial data, uniform data, and use the latter to censor the former:


data blog;
do i = 1 to 300;
   x = rand("TABLE",.3,.4);
   y = rand("TABLE",.3,.4);
   if uniform(0) gt .8 then x = .;
   if uniform(0) gt .8 then y = .;
   output;
   end;
run;

Print the default table with only complete data. Note the options used to reduce output, as in section 4.6.9.


proc freq data=blog;
   tables y*x / norow nocol;
run;

This produces:


Table of y by x

y         x

Frequency|
Percent  |       1|       2|       3|  Total
---------+--------+--------+--------+
       1 |     16 |     13 |     18 |     47
         |   8.16 |   6.63 |   9.18 |  23.98
---------+--------+--------+--------+
       2 |     18 |     32 |     22 |     72
         |   9.18 |  16.33 |  11.22 |  36.73
---------+--------+--------+--------+
       3 |     28 |     31 |     18 |     77
         |  14.29 |  15.82 |   9.18 |  39.29
---------+--------+--------+--------+
Total          62       76       58      196
            31.63    38.78    29.59   100.00

Frequency Missing = 104

The missing categories are included through the missprint option.


proc freq data = blog;
   tables y*x / norow nocol missprint;
run;

This produces:


Table of y by x

y         x

Frequency|
Percent  |       .|       1|       2|       3|  Total
---------+--------+--------+--------+--------+
       . |     12 |     12 |     20 |     14 |      .
         |      . |      . |      . |      . |      .
---------+--------+--------+--------+--------+
       1 |     10 |     16 |     13 |     18 |     47
         |      . |   8.16 |   6.63 |   9.18 |  23.98
---------+--------+--------+--------+--------+
       2 |     17 |     18 |     32 |     22 |     72
         |      . |   9.18 |  16.33 |  11.22 |  36.73
---------+--------+--------+--------+--------+
       3 |     19 |     28 |     31 |     18 |     77
         |      . |  14.29 |  15.82 |   9.18 |  39.29
---------+--------+--------+--------+--------+
Total           .       62       76       58      196
                .    31.63    38.78    29.59   100.00
Frequency Missing = 104

Note that if there are no missing values, SAS will not print the rows and columns headed with a '.' which is analogous to the "ifany" option in R shown below.

R

First, generate the data:


library(Hmisc)
x <- rMultinom(matrix(c(.3,.3,.4),1,3),300)
y <- rMultinom(matrix(c(.3,.3,.4),1,3),300)

Then, generate some random Uniforms to censor some of the observed data:


censprobx <- runif(300)
censproby <- runif(300)

Censor the data:


x[censprobx > .8] <- NA
y[censproby > .8] <- NA

Produce the default table (omits any missing data):


table(y,x)

   x
y    1  2  3
  1 18 18 29
  2 17 21 22
  3 20 30 40

Make the table which includes the missing category:


table(y, x, useNA="ifany")

      x
y       1  2  3   NA
  1    18 18 29    9
  2    17 21 22   17
  3    20 30 40   17
  NA   14  5 14    9

The useNA option also allows the values "no" and "always". The value "no" corresponds to the default behavior in R or SAS, while the "always" option is not available in SAS. SAS, however, shows the total number missing in any case.

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts