Tuesday, November 29, 2011

Example 9.16: Small multiples



Small multiples are one of the great ideas of graphics visionary Edward Tufte (e.g., in Envisioning Information). Briefly, the idea is that if many variations on a theme are presented, differences quickly become apparent. Today we offer general guidance on creating figures with small multiples.

As an example, we'll show graphics for the popularity, salary, and unemployment rates for college majors. This data was discussed here where a scatterplot graphic was presented. We draw on data and code presented there as well. The scatterplot does not show the unemployment rate, and the width of the field names and arbitrarily sized text makes it difficult to determine the popularity ranking. In contrast, the small multiples plot, demonstrated above, makes each of these features easy to see. (Click on the picture for a clearer image.)

R

The graphics options in R, particularly par("mfrow") or par("mfcol"), are well-suited to small multiples. The main tip here is to minimize the space reserved for margins and titles. In the example below, we do this with the mar, mgp, and oma settings. We'll begin by setting up the data in a process that relies heavily on the code here. (Note that the zip file referred to includes data already in the R format-- since our point today is not data management, we don't replicate the process used to make this out of the raw data.)

clean = function(x) as.numeric(gsub("\\$|, ", "", x))
clean2 = function(x) as.numeric(gsub("%", "", x))
load("table.rda")
X[,2] = clean2(X[,2])
for (i in 3:5) X[,i] = clean(X[,i])


X$cols = NA
X$cols[grep("BUSI|ACC|FINAN",X[,1])] = 1
X$cols[grep("ENGINEERING",X[,1])] = 2
X$cols[grep("STAT|COMPU",X[,1])] = 3
X$cols[grep("BIOL|CHEMI|PHYSICS|MATHEM",X[,1])] = 4
X$cols[grep("ENGLISH|HISTORY|LANG|FINE|MUSIC|PHILOS|DRAMA|LIBERAL|ARCH|THEO",X[,1])] = 5
X$cols[grep("SOCIO|PSYCH|POLI|ECONO|JOURNAL",X[,1])] = 6
X$cols[grep("COMMUN|MARKET|COMMER|MULTI|MASS|ADVERT",X[,1])] = 7
X$cols[grep("NURS|CRIM|EDU|PHYSICAL|FAMI|SOCIAL|TREAT|HOSP|HUMAN",X[,1])] = 8

This removes some funny characters and groups the fields together in a coherent manner. Then we write a function to set the par() values we need to change, and plot a pie for each row of the data set. Here a for loop is used-- we're not aware of a vectorized version of the pie() function. Colors for each pie are assigned via the colors.

sortx = X[order(X$Popularity),]

smajors = function(mt) {
old.par = par(no.readonly=TRUE)
on.exit(par(old.par))
nrows = sqrt(nrow(mt)) + (ceiling(sqrt(nrow(mt))) != sqrt(nrow(mt)))
par(mfrow=c(nrows,nrows), mar=c(1,0,0,0), oma=c(0,0,0,0), mgp=c(0,0,0))
colors = c("blue", "purple", "purple", "gray", "orange", "gray", "red", "black")
for (i in 1:nrow(mt)) {
pie(c(mt[i,2], 100 - mt[i,2]), labels=NA, radius=mt[i,4]/max(mt[,4]),
col = c("white",colors[mt[i,7]]))
mtext(substr(mt[i,1],1,19), side=1, cex=.8)
}
}

smajors(sortx[1:49,])

The resulting plot is shown above. There may be too many colors, though statistics and computing were assigned the same color as engineering. We can easily read from the plot that computing, statistics, and engineering (purple) are the largest circles, and thus the best paying. Similarly, the humanities (orange) are the worst paying. They are also not terribly popular-- the first orange pie appears in the second row. The "professions" (nursing, teaching, policing, therapy) don't pay well but are fairly popular. Most pies have roughly equal unemployment, though nursing and the professions generally are notable for near full employment. All in all, the rank, salary, unemployment percentage, and field are all clearer than in the scatterplot.

SAS

In SAS, one can use the greplay procedure to reproduce images in miniature. One has to define the size and shape allotted for each replayed plot, in a stored "template." This allows enormous control, but at the cost of some complexity. For example, one can create a scatterplot matrix using proc gplot instead of proc sgplot, as in this implementation by Michael Friendly. If you can generate your multiple images with a by statement, the coding for this is not too painful. However, in this case, it was not obvious how to change the color and radius for each pie using a by statement in proc gchart which includes pie charts, and would thus have been the obvious choice. In such cases, it may be easier to plot the figures directly using an annotate data set.

However, having demonstrated the use of annotate previously (e.g., Example 8.13), we show here an application using greplay, though the coding is a little bit involved. In outline, we use a macro to call for a pie to be made from each observation of the original data set. Then we use a template-making macro found here to generate the 7X7 grid template. Finally, we replay the pies into the grid.

/* read the data-- note that text file edited to remove spaces in
variable names */
proc import datafile =
'c:\sas-r dictionary\p\book\web\blog\majors\majors\table.txt'
out = majors;
getnames = yes;
run;

/* set up the field categories and colors */
data m2;
set majors;
cols=" "; /* make a missing character variable to hold them */
mf = majorfield; /* just a copy to save keystrokes */
/* the find() function is discussed in section 1.4.6 */
if sum(find(mf,"BUSI"), find(mf,"ACC"), find(mf,"FINAN")) ne 0 then cols = "blue";
if find(mf,"ENGINEERING") ne 0 the cols = "purple";
if sum(find(mf,"STAT"), find(mf,"COMPU"), find(mf,"SYSTEMS")) ne 0 then cols = "purple";
if sum(find(mf,"BIOL"), find(mf,"CHEMI"), find(mf,"PHYSICS"), find(mf,"MATH")) ne 0 then cols = "gray";
if sum(find(mf,"ENGL"), find(mf,"HIST"), find(mf,"FRENCH"), find(mf,"FINE"),
find(mf,"MUSIC"), find(mf,"PHIL"), find(mf,"DRAMA"), find(mf,"LIBERAL"),
find(mf,"ARCH"), find(mf,"THEO")) ne 0 then cols = "orange";
if sum(find(mf,"SOCI"), find(mf,"PSYCH"), find(mf,"POLI"), find(mf,"ECON"),
find(mf,"JOURN"), find(mf,"LIBERAL")) ne 0 then cols = "gray";
if sum(find(mf,"COMMU"), find(mf,"MARKET"), find(mf,"COMMER"), find(mf,"MULTI"),
find(mf,"MASS"), find(mf,"ADVERT")) ne 0 then cols = "red";
if sum(find(mf,"NURS"), find(mf,"CRIM"), find(mf,"EDU"), find(mf,"PHYSICAL"),
find(mf,"FAMI"), find(mf,"SOCIAL"), find(mf,"TREAT"),
find(mf,"HOSP"), find(mf,"HUMAN")) ne 0 then cols = "black";
drop MF; /* get rid of that keystroke-saver */
run;

/* order by popularity */
proc sort data = m2; by popularity; run;

The next macro takes a line number as input. A data step then reads that line from the real data set and uses call symput (section A.8.2) to extract as macro variables the median earnings used for the radius, the color, and the major name. It then produces two rows of data-- one with the unemployed percent and the other with the employed percent. We need this for the proc gchart input, as shown in the second half of the macro.

%macro smpie(obs);
data t1;
set m2 (firstobs = &obs obs = &obs);
call symput('rm', medianearnings);
call symput('color', cols);
call symput('name', strip(substr(majorfield,1,19)));
employed = "No"; percent = unemploymentpercent; output;
employed = "Yes"; percent = 100 - unemploymentpercent; output;
run;

pattern1 color= white v=solid;
pattern2 color= &color v=solid; /* only pattern2 should be needed, I think, but */
pattern3 color= &color v=solid; /* sometimes SAS required pattern3, in my trials */
title h=7pct "&name";
proc gchart data = t1 gout = kkpies;
pie percent / group = majorfield sumvar = percent value = none
noheading nogroupheading radius = %sysevalf((&rm * 45)/ 105000)
name = "PIE&obs" ;
run;quit;
%mend smpie;

Of particular note in the forgoing are the gout and name options. The former specifies a location where output plots should be stored. The latter assigns a name to this particular plot. In addition, the %sysevalf macro function is needed to perform mathematical functions on the radius variable.

Next, we write and call another macro to call the first repeatedly. Making the image small to begin with (using the goptions statement [section 5.3]) reduces quality loss when replaying as small multiples.

%macro pies;
goptions hsize=1in vsize=1in;
%do i = 1 %to 49;
%smpie(&i);
%end;
%mend;

%pies;

Finally, we can make the template to replay the images, and replay them, both using proc greplay.

/* key elements of the %makegridtemplate macro:
how many rows and columns (down and across).
name for the template.
Note that the macro is called *inside* the proc greplay statement,
which allows the user access to pro greplay statment options */
proc greplay nofs tc=work.templates;
%makeGridTemplate (across=7, down=7, ordering=LRTB,
hgap=0, vgap=0, gapT=0, gapL=0, gapr=0, name=sevensq,bordercolor=white)
quit;

/* this macro just types out text for us the sequence 1:pie1 2:pie2 ... 49:pie49 */
/* we need that to replay the figures in proc greplay */
%macro pielist;
%do i = 1 %to 49; &i:pie&i %end;;
%mend;

filename pies "c:\temp\pies2.png";
goptions hsize=7in vsize=7in gsfname=pies device=png;

/* now ready to replay the plots
The proc greplay options say what template to use and where to find it,
and where to find the input and place the output */
/* The treplay statement plays the old plots to the locations specified
in the template
proc greplay template=sevensq tc=work.templates nofs gout=kkpies igout=kkpies;
treplay %pielist;
run;
quit;

The net outcome of this is shown below. The image is pretty disappointing-- the circles are not round,and the text is pretty blurry. However, the message is as clear as in the prettier R version.

Tuesday, November 22, 2011

Example 9.15: Bar chart with error bars ("Dynamite plot")



The "dynamite plot", a bar chart plotting the a mean with a error bar, is one of the most reviled types of image among statisticians. Reasons to dislike them are numerous, and are nicely summarized here. (Edward Tufte also suggests they be avoided.)

Nonetheless, as consulting statisticians, we're often required to meet the needs of our collaborators, or of the reviewers who will judge their submissions. If you need to do it, here's how. We demonstrate with the HELP data set.

SAS

The plot can be created easily with proc gchart (section 5.1.3).

proc gchart data = "c:\book\help.sas7bdat";
vbar substance /
group=female
type=mean
sumvar=sexrisk
errorbar=top;
run;

The syntax requests the basic chart variable be substance, in groups defined by female, where the mean of sexrisk is plotted. The errorbar option can be used to make the full CI or just the top lines; the default is a 95% standard error of the mean, but that can also be adjusted. The result is shown below.


R

There are several easily googlable examples of how to do a basic dynamite plot in R, including one in example(barchart). However, a grouped example with a nice legend is a bit harder to find. The bargraph.CI() function in the sciplot package fits the bill.

library(sciplot)
bp = with(ds, bargraph.CI(x.factor=female, group=substance, response=sexrisk,
lc=FALSE, xlab="Female",
legend=TRUE, x.leg=3.3, cex.leg=1.3, cex.names=1.5, cex.lab = 1.5,
ci.fun=function(x) {c(mean(x) - 1.96*se(x), mean(x) + 1.96*se(x))}))

The results are shown at the top. The first row of options describe the variables to be used; the default plot statistic is the mean. The second and third rows suppress the bottom of the confidence interval and customize the location, label, and font size in the legend and the x-axis label. The only tricky bit is the last row, which provides the 95% CI limits to be plotted, as opposed to the default 1 standard error.

Tuesday, November 15, 2011

Example 9.14: confidence intervals for logistic regression models

Recently a student asked about the difference between confint() and confint.default() functions, both available in the MASS library to calculate confidence intervals from logistic regression models. The following example demonstrates that they yield different results.

R

ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
library(MASS)
glmmod = glm(homeless ~ age + female, binomial, data=ds)

> summary(glmmod)
Call:
glm(formula = homeless ~ age + female, family = binomial, data = ds)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.3600 -1.1231 -0.9185 1.2020 1.5466

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.89262 0.45366 -1.968 0.0491 *
age 0.02386 0.01242 1.921 0.0548 .
female -0.49198 0.22822 -2.156 0.0311 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 625.28 on 452 degrees of freedom
Residual deviance: 617.19 on 450 degrees of freedom
AIC: 623.19

Number of Fisher Scoring iterations: 4

> exp(confint(glmmod))
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.1669932 0.9920023
age 0.9996431 1.0496390
female 0.3885283 0.9522567
> library(MASS)
> exp(confint.default(glmmod))
2.5 % 97.5 %
(Intercept) 0.1683396 0.9965331
age 0.9995114 1.0493877
female 0.3909104 0.9563045

Why are they different? Which one is correct?

SAS

Fortunately the detailed documentation in SAS can help resolve this. The logistic procedure (section 4.1.1) offers the clodds option to the model statement. Setting this option to both produces two sets of CL, based on the Wald test and on the profile-likelihood approach. (Venzon, D. J. and Moolgavkar, S. H. (1988), “A Method for Computing Profile-Likelihood Based Confidence Intervals,” Applied Statistics, 37, 87–94.)

ods output cloddswald = waldcl cloddspl = plcl;
proc logistic data = "c:\book\help.sas7bdat" plots=none;
class female (param=ref ref='0');
model homeless(event='1') = age female / clodds = both;
run;

Odds Ratio Estimates and Profile-Likelihood Confidence Intervals

Effect Unit Estimate 95% Confidence Limits

AGE 1.0000 1.024 1.000 1.050
FEMALE 1 vs 0 1.0000 0.611 0.389 0.952


Odds Ratio Estimates and Wald Confidence Intervals

Effect Unit Estimate 95% Confidence Limits

AGE 1.0000 1.024 1.000 1.049
FEMALE 1 vs 0 1.0000 0.611 0.391 0.956



Unfortunately, the default precision of the printout isn't quite sufficient to identify whether this distinction aligns with the differences seen in the two R methods. We get around this by using the ODS system to save the output as data sets (section A.7.1). Then we can print the data sets, removing the default rounding formats to find all of the available precision.

title "Wald CL";
proc print data=waldcl; format _all_; run;
title "PL CL";
proc print data=plcl; format _all_; run;

Wald CL
Odds
Obs Effect Unit RatioEst LowerCL UpperCL

1 AGE 1 1.02415 0.99951 1.04939
2 FEMALE 1 vs 0 1 0.61143 0.39092 0.95633


PL CL
Odds
Obs Effect Unit RatioEst LowerCL UpperCL

1 AGE 1 1.02415 0.99964 1.04964
2 FEMALE 1 vs 0 1 0.61143 0.38853 0.95226

With this added precision, we can see that the confint.default() function in the MASS library generates the Wald confidence limits, while the confint() function produces the profile-likelihood limits. This also explains the confint() comment "Waiting for profiling to be done..." Thus neither CI from the MASS library is incorrect, though the profile-likelihood method is thought to be superior, especially for small sample sizes. Little practical difference is seen here.

Tuesday, November 8, 2011

Example 9.13: Negative binomial regression with proc mcmc


In practice, data that derive from counts rarely seem to be fit well by a Poisson model; one more flexible alternative is a negative binomial model. In this SAS-only entry, we discuss how proc mcmc can be used for estimation. An overview of support for Bayesian methods in R can be found in the Bayesian Task View.

SAS

As noted in example 8.30, the SAS rand function lacks the option to input the mean directly, instead using the basic parameters of the probability of success and the number of successes k. (Though note the negative binomial has several formulations, which can cause problems when using multiple software systems.) As developed in that example, we use the the proc fcmp function to instead work with the mean.

proc fcmp outlib=sasuser.funcs.test;
function poismean_nb(mean, size);
return(size/(mean+size));
endsub;
run;

options cmplib=sasuser.funcs;
run;

With that preparation out of the way, we simulate some data--here an intercept of 0 and a slope of 1.

data test;
do i = 1 to 10000;
x = normal(0);
mu = exp(0 + x);
k = 2;
y = rand("NEGBINOMIAL", poismean_nb(mu, k),k);
output;
end;
run;

The proc mcmc code presents a slight difficulty: the k successes before the random number of failures ought to be an integer, and proc mcmc appears to lack an integer-valued distribution. The model will run with continuous values of k, but its behavior is strange. Instead, we put a prior on a new parameter, kstar and take k as the rounded value (section 1.8.4) of kstar; since the values must be > 0, we also add 1 to the rounded value.

proc mcmc data=test nmc=1000 thin=1 seed=10061966;
parms beta0 1 beta1 1 kstar 10;

prior b: ~ normal(0, var = 10000);
prior kstar ~ igamma(.01, scale=0.01);

k=round(kstar+1, 1);
mu = exp(beta0 + beta1 * x);

model y ~ negbin(k, poismean_nb(mu, k));
run;

The way the kstar and k business works is that SAS actually processes the programming statements in each iteration of the chain. Posterior summaries just below, sample diagnostic plot above.

Posterior Summaries

Parameter N Mean Standard Percentiles
Deviation 25% 50% 75%
beta0 10000 0.00712 0.0131 -0.00171 0.00721 0.0156
beta1 10000 0.9818 0.0128 0.9732 0.9814 0.9905
kstar 10000 0.9648 0.2855 0.7112 0.9481 1.1974

Posterior Intervals
Parameter Alpha Equal-Tail Interval HPD Interval
beta0 0.050 -0.0195 0.0321 -0.0182 0.0328
beta1 0.050 0.9569 1.0074 0.9562 1.0063
kstar 0.050 0.5208 1.4709 0.5001 1.4348

If a simple model like the one shown here is all you need, proc genmod's bayes statement can work for you. But the formulation demonstrated above would be useful for a generalized linear mixed model, for example.