SAS and R: June 2012

Monday, June 25, 2012

Example 9.36: Levene's test for equal variances

The assumption of equal variances among the groups in analysis of variance is an expression of the assumption of homoscedasticity for linear models more generally. For ANOVA, this assumption can be tested via Levene's test. The test is a function of the residuals and means within each group, though various modifications are used, including the Brown-Forsythe test. This uses the medians within group, rather than the mean, and is recommended when normality may be suspect.

We illustrate using the HELP data set available here, modeling depressive symptoms (assessed via CESD) as a function of abused substance.

SAS
In SAS, the tests are available as an option to the means statement in proc glm


data help;
set "C:\book\help.sas7bdat";
run;

proc glm data = help;
class substance;
model cesd = substance;
means substance / hovtest=levene(type=abs) hovtest=bf;
run;

The two requested tests are a version of Levene's test that is produced in R, below, and the aforementioned Brown-Forsythe test. The relevant results are shown below.


           Levene's Test for Homogeneity of CESD Variance
           ANOVA of Absolute Deviations from Group Means

                           Sum of        Mean
  Source           DF     Squares      Square    F Value    Pr > F

  SUBSTANCE         2       272.4       136.2       2.61    0.0747
  Error           450     23480.7     52.1793


     Brown and Forsythe's Test for Homogeneity of CESD Variance
          ANOVA of Absolute Deviations from Group Medians

                           Sum of        Mean
  Source           DF     Squares      Square    F Value    Pr > F

  SUBSTANCE         2       266.0       133.0       2.46    0.0864
  Error           450     24310.9     54.0243

There's some suggestion of a lack of homoscedasticity; it might be wise to consider methods robust to violations of this assumption.

R
In R, the test can be found in the levene.test() function in the lawstat package.


help = read.csv("http://www.math.smith.edu/r/data/help.csv")
library(lawstat)
with(help, levene.test(cesd, as.factor(substance), location="mean"))

 classical Levene's test based on the absolute deviations from the mean 
          ( none not applied because the location is not set to median )

data:  cesd 
Test Statistic = 2.6099, p-value = 0.07465

with(help, levene.test(cesd, as.factor(substance),location="median"))

 modified robust Brown-Forsythe Levene-type test based on the absolute 
           deviations from the median

data:  cesd 
Test Statistic = 2.462, p-value = 0.08641

An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Monday, June 18, 2012

Example 9.35: Discrete randomization and formatted output

A colleague asked for help with randomly choosing a kid within a family. This is for a trial in which families are recruited at well-child visits, but in each family only one of the children having a well-child visit that day can be in the study. The idea is that after recruiting the family, the research assistant needs to choose one child, but if they make that choice themselves, the children are unlikely to be representative. Instead, we'll allow them to make a random decision through an easily used slip that can be put into sealed envelopes. The envisioned process is that the RA will recruit the family, determine the number of eligible children, then open the envelope to find out which child was randomly selected.

One thought here would be to generate separate stacks of envelopes for each given family size, and have the research assistant open an envelope from the appropriate stack. However, this could be logistically challenging, especially since the RAs will spend weeks away from the home office. Instead, we'll include all plausible family sizes on each slip of paper. It seems unlikely that more than 5 children in a family will have well-child visits on the same day.

SAS
We'll use the SAS example to demonstrate using SAS macros to write SAS code, as well as showing a plausible use for SAS formats (section 1.4.12) and making use of proc print.


/* the following macro will write out equal probabilities for selecting 
each integer between 1 and the argument, in the format needed for the 
rand function.  E.g., if the argument is 3, 
it will write out
1/3,1/3,1/3
*/

%macro tbls(n);
%do i = 1 %to &n;
1/&n %if &i < &n %then ,
%end;
%mend tbls;

/* then we can use the %tbls macro to create the randomization
via rand("TABLE") (section 1.10.4). */ 
data kids;
do family = 1 to 10000;
  nkids = 2; chosen = rand("TABLE",%tbls(2)); output;
  nkids = 3; chosen = rand("TABLE",%tbls(3)); output;
  nkids = 4; chosen = rand("TABLE",%tbls(4)); output;
  nkids = 5; chosen = rand("TABLE",%tbls(5)); output;
end;
run;

/* check randomization */
proc freq data = kids;
table nkids * chosen / nocol nopercent;
run; 

   nkids     chosen

   Frequency|
   Row Pct  |       1|       2|       3|       4|       5|  Total
   ---------+--------+--------+--------+--------+--------+
          2 |  50256 |  49744 |      0 |      0 |      0 | 100000
            |  50.26 |  49.74 |   0.00 |   0.00 |   0.00 |
   ---------+--------+--------+--------+--------+--------+
          3 |  33429 |  33292 |  33279 |      0 |      0 | 100000
            |  33.43 |  33.29 |  33.28 |   0.00 |   0.00 |
   ---------+--------+--------+--------+--------+--------+
          4 |  25039 |  24839 |  25245 |  24877 |      0 | 100000
            |  25.04 |  24.84 |  25.25 |  24.88 |   0.00 |
   ---------+--------+--------+--------+--------+--------+
          5 |  19930 |  20074 |  20188 |  20036 |  19772 | 100000
            |  19.93 |  20.07 |  20.19 |  20.04 |  19.77 |
   ---------+--------+--------+--------+--------+--------+
   Total      128654   127949    78712    44913    19772   400000

Looks pretty good. Now we need to make the output usable to the research assistants, by formatting the results into English. We'll use the same format for each number of kids. This saves some keystrokes now, but may possibly cause the RAs some confusion-- it means that we might refer to the "4th oldest" of 4 children, rather than the "youngest". We could fix this using a different format for each number of children, analogous to the R version below.


proc format;
value chosen
1 = "oldest"
2 = '2nd oldest'
3 = '3rd oldest'
4 = '4th oldest'
5 = '5th oldest';
run;
 
/* now, make a text variable the concatenates (section 1.4.5) the variables 
and some explanatory text */
data k2;
set kids;
if nkids eq 2 then
  t1 = "If there are " || strip(nkids) ||" children then choose the " ||
       strip(put(chosen,chosen.)) || " child.";
else
  t1 = "             " || strip(nkids) ||" ________________________ " ||
       strip(put(chosen,chosen.));
run;

/* then we print.  Notice the options to print in plain text, shorten the 
page length and width, and remove the date and page number from the SAS output, as
well as in the proc print statement to remove the observation number and
show the line number, with a few other tricks */
options nonumber nodate ps = 60 ls = 68;
OPTIONS FORMCHAR="|----|+|---+=|-/\<>*";
proc print data = k2 (obs = 3) noobs label sumlabel;
by family;
var t1;
label t1 = '00'x family = "Envelope";
run;

---------------------------- Envelope=1 ----------------------------



     If there are 2 children then choose the 2nd oldest child.
                  3 ________________________ 3rd oldest
                  4 ________________________ 4th oldest
                  5 ________________________ 5th oldest


---------------------------- Envelope=2 ----------------------------



     If there are 2 children then choose the 2nd oldest child.
                  3 ________________________ oldest
                  4 ________________________ oldest
                  5 ________________________ 3rd oldest


---------------------------- Envelope=3 ----------------------------



     If there are 2 children then choose the 2nd oldest child.
                  3 ________________________ 2nd oldest
                  4 ________________________ 3rd oldest
                  5 ________________________ 2nd oldest

R
For R, we leave some trial code in place, to demonstrate how one might discover, test, and build R code in this setting. Most results have been omitted.


sample(5, size = 1)   
# choose a (discrete uniform) random integer between 1 and 5

apply(matrix(2:5),1,sample,size=1)   
# choose a random integer between 1 and 2, then between 1 and 3, etc., 
# using apply() to repeat the call to sample() with different maximum number
# apply() needs a matrix or array input
# result of this is the raw data needed for one family

replicate(3,apply(matrix(2:5),1,sample,size=1))
# replicate() is in the apply() family and just repeats the 
# function n times

     [,1] [,2] [,3]
[1,]    2    1    2
[2,]    2    1    2
[3,]    2    2    2
[4,]    3    5    4

Now we have the raw data for the envelopes. Before formatting it for printing, let's check it to make sure it works correctly.


test=replicate(100000, apply(matrix(2:5), 1, sample, size=1))
apply(test, 1, summary)
        [,1] [,2]  [,3]  [,4]
Min.     1.0    1 1.000 1.000
1st Qu.  1.0    1 1.000 2.000
Median   1.0    2 2.000 3.000
Mean     1.5    2 2.492 3.003
3rd Qu.  2.0    3 3.000 4.000
Max.     2.0    3 4.000 5.000
# this is not so helpful-- need the count or percent for each number
# this would be the default if the data were factors, but they aren't
# check to see if we can trick summary() into treating these integers
# as if they were factors
methods(summary)
# yes, there's a summary() method for factors-- let's apply it
# there's also apply(test,1,table) which might be better, if you remember it
apply(test, 1, summary.factor)
[[1]]
    1     2 
50025 49975 

[[2]]
    1     2     3 
33329 33366 33305 

[[3]]
    1     2     3     4 
25231 25134 24849 24786 

[[4]]
    1     2     3     4     5 
19836 20068 20065 20022 20009 
# apply(test,1,table) will give similar results, if you remember it

Well, that's not too pretty, but it's clear that the randomization is working. Now it's time to work on formatting the output.


mylist=replicate(5, apply(matrix(2:5), 1, sample, size=1))
# brief example data set

# We'll need to use some formatted values (section 1.14.12), as in SAS. 
# Here, we'll make new value labels for each number of children,
# which will make the output easier to read.  We add in an envelope 
# number and wrap it all into a data frame.
df = data.frame(envelope = 1:5,
   twokids=factor(mylist[1,],1:2,labels=c("youngest","oldest")),
  threekids=factor(mylist[2,],1:3,labels=c("youngest", "middle", "oldest")),
  fourkids=factor(mylist[3,],1:4,labels=c("youngest", "second youngest", 
      "second oldest", "oldest")),
  fivekids=factor(mylist[4,],1:5,labels=c("youngest", "second youngest", 
      "middle", "second oldest", "oldest"))
)

# now we need a function to take a row of the data frame and make a single slip
# the paste() function (section 1.4.5) puts together the fixed and variable 
# content of each row, while the cat() function will print it without quotes
slip = function(kidvec) {
  cat(paste("------------- Envelope", kidvec[1], "------------------"))
  cat(paste("\nIf there are", 2:5, " children, select the", kidvec[2:5],"child"))
  cat("\n \n \n")
}

# test it on one row
slip(df[1,])

# looks good-- now we can apply() it to each row of the data frame
apply(df, 1, slip)

------------- Envelope 1 ------------------
If there are 2  children, select the youngest child 
If there are 3  children, select the youngest child 
If there are 4  children, select the second youngest child 
If there are 5  children, select the youngest child
 
 
------------- Envelope 2 ------------------
If there are 2  children, select the youngest child 
If there are 3  children, select the youngest child 
If there are 4  children, select the second oldest child 
If there are 5  children, select the middle child


------------- Envelope 3 ------------------
If there are 2  children, select the youngest child 
If there are 3  children, select the youngest child 
If there are 4  children, select the youngest child 
If there are 5  children, select the second youngest child

# and so forth

# finally, we can save the result in a file with
# capture.output()
capture.output(apply(df,1,slip), file="testslip.txt")

An unrelated note about aggregators:
We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Tuesday, June 5, 2012

Example 9.34: Bland-Altman type plot

The Bland-Altman plot is a visual aid for assessing differences between two ways of measuring something. For example, one might compare two scales this way, or two devices for measuring particulate matter.

The plot simply displays the difference between the measures against their average. Rather than a statistical test, it is intended to demonstrate both typical differences between the measures and any patterns such differences may take. The utility of the plot, as compared with linear regression or sample correlation is that the plot is not affected by the range, while the sample correlation will typically increase with the range. In contrast, linear regression shows the strength of the linear association but not how closely the two measures agree. The Bland-Altman plot allows the user to focus on differences between the measures, perhaps focusing on the clinical relevance of these differences.

A peer reviewer recently asked a colleague to consider a Bland-Altman plot for two methods of assessing fatness: the familiar BMI (kg/m^2) and the actual fat mass measured by a sophisticated DXA machine. These are obviously not measures of the same thing, so a Bland-Altman plot is not exactly appropriate. But since the BMI is so simple to calculate and the DXA machine is so expensive, it would be nice if the BMI could be substituted for DXA fat mass.

For this purpose, we'll generate a modified Bland-Altman plot in which each measure is first standardized to have mean 0 and standard deviation 1. The resulting plot should be assessed for pattern as usual, but typical differences must be considered on the standardized scale-- that is, differences of a unit should be considered large, and good agreement might require typical differences of 0.2 or less.

SAS
Since this is a job we might want to repeat, we'll build a SAS macro to do it. This will also demonstrate some useful features. The macro accepts a data set name and the names of two variables as input. We'll comment on interesting features in code comments. If you're an R coder, note that SAS macro variables are merely text, not objects. We have to manually assign "values" (i.e., numbers represented as text strings) to newly created macro variables.


%macro baplot(datain=,x=x,y=y);

/* proc standard  standardizes the variables and saves the results in the
   same variable names in the output data set.  This means we can continue 
   using the input variable names throughout. */
proc standard data = &datain out=ba_std mean=0 std=1;
var &x &y;
run;

/* calculate differences and averages */
data ba;
set ba_std;
bamean = (&x + &y)/2;;
badiff = &y-&x;
run;

ods output summary=basumm;
ods select none;
proc means data = ba mean std;
var badiff;
run;
ods select all;

/* In the following, we take values calculated from a data set for the 
   confidence limits and store them in macro variables.  That's the 
   only way to use them later in code.
   The syntax is: call symput('varname', value);
   Note that 'bias' is purely nominal, as the standardization means that 
   the mean difference is 0. */
data lines;
set basumm;
call symput('bias',badiff_mean);
call symput('hici',badiff_mean+(1.96 * badiff_stddev));
call symput('loci',badiff_mean-(1.96 * badiff_stddev));
run;

/* We use the macro variables just created in the vref= option below;
   vref draws reference line(s) on the vertical axis.  lvref specifies
   a line type. */
symbol1 i = none v = dot h = .5;
title "Bland-Altman type plot of &x and &y";
title2 "&x and &y standardized";
proc gplot data=ba;
plot badiff * bamean / vref = &bias &hici &loci lvref=3;
label badiff = "difference" bamean="mean";
run;
%mend baplot;

Here is a fake sample data set, with the plot resulting from the macro shown above. An analysis would suggest that despite the correlation of 0.59 and p-value for the linear association < .0001, that these two measures don't agree too well.


data fake;
do i = 1 to 50;
/* the "42" in the code below sets the seed for the pseudo-RNG
   for this and later calls.  See section 1.10.9. */
  x = normal(42);
  y = x + normal(0);
  output;
  end;
run;

%baplot(datain=fake, x=x, y=y);

R
Paralleling SAS, we'll write a small function to draw the plot, annotating within to highlight some details. If you're primarily a SAS coder, note the syntax needed to find the name of an object submitted to a function. In contrast, assigning values to new objects created with the function is entirely natural. The resulting plot is shown below.


# set seed, for replicability
set.seed(42)
x = rnorm(50)
y = x + rnorm(50) 

baplot = function(x,y){
  xstd = (x - mean(x))/sd(x)
  ystd = (y - mean(y))/sd(y)
  
  bamean = (xstd+ystd)/2
  badiff = (ystd-xstd)
  
  plot(badiff~bamean, pch=20, xlab="mean", ylab="difference")
# in the following, the deparse(substitute(varname)) is what retrieves the
# name of the argument as data
  title(main=paste("Bland-Altman plot of x and y\n",
    deparse(substitute(x)), "and", deparse(substitute(y)),
    "standardized"), adj=".5")
#construct the reference lines on the fly: no need to save the values in new 
# variable names
  abline(h = c(mean(badiff), mean(badiff)+1.96 * sd(badiff),
    mean(badiff)-1.96 * sd(badiff)), lty=2)
} 

baplot(x,y)

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts