## Monday, June 18, 2012

### Example 9.35: Discrete randomization and formatted output

A colleague asked for help with randomly choosing a kid within a family. This is for a trial in which families are recruited at well-child visits, but in each family only one of the children having a well-child visit that day can be in the study. The idea is that after recruiting the family, the research assistant needs to choose one child, but if they make that choice themselves, the children are unlikely to be representative. Instead, we'll allow them to make a random decision through an easily used slip that can be put into sealed envelopes. The envisioned process is that the RA will recruit the family, determine the number of eligible children, then open the envelope to find out which child was randomly selected.

One thought here would be to generate separate stacks of envelopes for each given family size, and have the research assistant open an envelope from the appropriate stack. However, this could be logistically challenging, especially since the RAs will spend weeks away from the home office. Instead, we'll include all plausible family sizes on each slip of paper. It seems unlikely that more than 5 children in a family will have well-child visits on the same day.

SAS
We'll use the SAS example to demonstrate using SAS macros to write SAS code, as well as showing a plausible use for SAS formats (section 1.4.12) and making use of proc print.
`/* the following macro will write out equal probabilities for selecting each integer between 1 and the argument, in the format needed for the rand function.  E.g., if the argument is 3, it will write out1/3,1/3,1/3*/%macro tbls(n);%do i = 1 %to &n;1/&n %if &i < &n %then ,%end;%mend tbls;/* then we can use the %tbls macro to create the randomizationvia rand("TABLE") (section 1.10.4). */ data kids;do family = 1 to 10000;  nkids = 2; chosen = rand("TABLE",%tbls(2)); output;  nkids = 3; chosen = rand("TABLE",%tbls(3)); output;  nkids = 4; chosen = rand("TABLE",%tbls(4)); output;  nkids = 5; chosen = rand("TABLE",%tbls(5)); output;end;run;/* check randomization */proc freq data = kids;table nkids * chosen / nocol nopercent;run;    nkids     chosen   Frequency|   Row Pct  |       1|       2|       3|       4|       5|  Total   ---------+--------+--------+--------+--------+--------+          2 |  50256 |  49744 |      0 |      0 |      0 | 100000            |  50.26 |  49.74 |   0.00 |   0.00 |   0.00 |   ---------+--------+--------+--------+--------+--------+          3 |  33429 |  33292 |  33279 |      0 |      0 | 100000            |  33.43 |  33.29 |  33.28 |   0.00 |   0.00 |   ---------+--------+--------+--------+--------+--------+          4 |  25039 |  24839 |  25245 |  24877 |      0 | 100000            |  25.04 |  24.84 |  25.25 |  24.88 |   0.00 |   ---------+--------+--------+--------+--------+--------+          5 |  19930 |  20074 |  20188 |  20036 |  19772 | 100000            |  19.93 |  20.07 |  20.19 |  20.04 |  19.77 |   ---------+--------+--------+--------+--------+--------+   Total      128654   127949    78712    44913    19772   400000`

Looks pretty good. Now we need to make the output usable to the research assistants, by formatting the results into English. We'll use the same format for each number of kids. This saves some keystrokes now, but may possibly cause the RAs some confusion-- it means that we might refer to the "4th oldest" of 4 children, rather than the "youngest". We could fix this using a different format for each number of children, analogous to the R version below.
`proc format;value chosen1 = "oldest"2 = '2nd oldest'3 = '3rd oldest'4 = '4th oldest'5 = '5th oldest';run; /* now, make a text variable the concatenates (section 1.4.5) the variables and some explanatory text */data k2;set kids;if nkids eq 2 then  t1 = "If there are " || strip(nkids) ||" children then choose the " ||       strip(put(chosen,chosen.)) || " child.";else  t1 = "             " || strip(nkids) ||" ________________________ " ||       strip(put(chosen,chosen.));run;/* then we print.  Notice the options to print in plain text, shorten the page length and width, and remove the date and page number from the SAS output, aswell as in the proc print statement to remove the observation number andshow the line number, with a few other tricks */options nonumber nodate ps = 60 ls = 68;OPTIONS FORMCHAR="|----|+|---+=|-/\<>*";proc print data = k2 (obs = 3) noobs label sumlabel;by family;var t1;label t1 = '00'x family = "Envelope";run;---------------------------- Envelope=1 ----------------------------     If there are 2 children then choose the 2nd oldest child.                  3 ________________________ 3rd oldest                  4 ________________________ 4th oldest                  5 ________________________ 5th oldest---------------------------- Envelope=2 ----------------------------     If there are 2 children then choose the 2nd oldest child.                  3 ________________________ oldest                  4 ________________________ oldest                  5 ________________________ 3rd oldest---------------------------- Envelope=3 ----------------------------     If there are 2 children then choose the 2nd oldest child.                  3 ________________________ 2nd oldest                  4 ________________________ 3rd oldest                  5 ________________________ 2nd oldest`

R
For R, we leave some trial code in place, to demonstrate how one might discover, test, and build R code in this setting. Most results have been omitted.
`sample(5, size = 1)   # choose a (discrete uniform) random integer between 1 and 5apply(matrix(2:5),1,sample,size=1)   # choose a random integer between 1 and 2, then between 1 and 3, etc., # using apply() to repeat the call to sample() with different maximum number# apply() needs a matrix or array input# result of this is the raw data needed for one familyreplicate(3,apply(matrix(2:5),1,sample,size=1))# replicate() is in the apply() family and just repeats the # function n times     [,1] [,2] [,3][1,]    2    1    2[2,]    2    1    2[3,]    2    2    2[4,]    3    5    4`

Now we have the raw data for the envelopes. Before formatting it for printing, let's check it to make sure it works correctly.
`test=replicate(100000, apply(matrix(2:5), 1, sample, size=1))apply(test, 1, summary)        [,1] [,2]  [,3]  [,4]Min.     1.0    1 1.000 1.0001st Qu.  1.0    1 1.000 2.000Median   1.0    2 2.000 3.000Mean     1.5    2 2.492 3.0033rd Qu.  2.0    3 3.000 4.000Max.     2.0    3 4.000 5.000# this is not so helpful-- need the count or percent for each number# this would be the default if the data were factors, but they aren't# check to see if we can trick summary() into treating these integers# as if they were factorsmethods(summary)# yes, there's a summary() method for factors-- let's apply it# there's also apply(test,1,table) which might be better, if you remember itapply(test, 1, summary.factor)[[1]]    1     2 50025 49975 [[2]]    1     2     3 33329 33366 33305 [[3]]    1     2     3     4 25231 25134 24849 24786 [[4]]    1     2     3     4     5 19836 20068 20065 20022 20009 # apply(test,1,table) will give similar results, if you remember it`

Well, that's not too pretty, but it's clear that the randomization is working. Now it's time to work on formatting the output.
`mylist=replicate(5, apply(matrix(2:5), 1, sample, size=1))# brief example data set# We'll need to use some formatted values (section 1.14.12), as in SAS. # Here, we'll make new value labels for each number of children,# which will make the output easier to read.  We add in an envelope # number and wrap it all into a data frame.df = data.frame(envelope = 1:5,   twokids=factor(mylist[1,],1:2,labels=c("youngest","oldest")),  threekids=factor(mylist[2,],1:3,labels=c("youngest", "middle", "oldest")),  fourkids=factor(mylist[3,],1:4,labels=c("youngest", "second youngest",       "second oldest", "oldest")),  fivekids=factor(mylist[4,],1:5,labels=c("youngest", "second youngest",       "middle", "second oldest", "oldest")))# now we need a function to take a row of the data frame and make a single slip# the paste() function (section 1.4.5) puts together the fixed and variable # content of each row, while the cat() function will print it without quotesslip = function(kidvec) {  cat(paste("------------- Envelope", kidvec[1], "------------------"))  cat(paste("\nIf there are", 2:5, " children, select the", kidvec[2:5],"child"))  cat("\n \n \n")}# test it on one rowslip(df[1,])# looks good-- now we can apply() it to each row of the data frameapply(df, 1, slip)------------- Envelope 1 ------------------If there are 2  children, select the youngest child If there are 3  children, select the youngest child If there are 4  children, select the second youngest child If there are 5  children, select the youngest child  ------------- Envelope 2 ------------------If there are 2  children, select the youngest child If there are 3  children, select the youngest child If there are 4  children, select the second oldest child If there are 5  children, select the middle child------------- Envelope 3 ------------------If there are 2  children, select the youngest child If there are 3  children, select the youngest child If there are 4  children, select the youngest child If there are 5  children, select the second youngest child# and so forth# finally, we can save the result in a file with# capture.output()capture.output(apply(df,1,slip), file="testslip.txt")`