Showing posts with label mosaic plot. Show all posts
Showing posts with label mosaic plot. Show all posts

Tuesday, December 6, 2011

Example 9.17: (much) better pairs plots


Pairs plots (section 5.1.17) are a useful way of displaying the pairwise relations between variables in a dataset. But the default display is unsatisfactory when the variables aren't all continuous. In this entry, we discuss ways to improve these displays that have been proposed by John Emerson, Walton Green, Barret Schloerke, Dianne Cook, Heike Hofmann, and Hadley Wickham in a manuscript under review entitled The Generalized Pairs Plot. http://www.blogger.com/img/blank.gif

Implementations of the methods in the paper are available in the gpairs and GGally packages; here we use the latter, which is based on the grammar of graphics and the ggplot2 package. This is an R-only entry: we are unaware of efforts to replicate this approach in SAS.

New users may find it easier to break process down into steps, rather than to do everything at once, as the R language allows. One way to do that is to make a smaller version of a dataset, with just the analysis variables included. here we use the HELP data set and choose two categorical variables (gender and housing status) and two continuous ones (the number of drinks per day and a measure of depressive symptoms). Once this new subset is created, the call to ggpairs() is straightforward.

R

library(GGally)
ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
ds$sex = as.factor(ifelse(ds$female==1, "female", "male"))
ds$housing = as.factor(ifelse(ds$homeless==1, "homeless", "housed"))
smallds = subset(ds, select=c("housing", "sex", "i1", "cesd"))
ggpairs(smallds, diag=list(continuous="density", discrete="bar"), axisLabels="show")

For users more comfortable with R, the ggpairs function allows you to select variables to include, via its columns option. The following line produces a plot identical to the above, without the subset().

ggpairs(ds, columns=c("housing", "sex", "i1", "cesd"),
diag=list(continuous="density", discrete="bar"), axisLabels="show")

Various options are available for the diagonal elements of the plot matrix, and the off-diagonals can be controlled with upper and lower options. The examples(ggpairs) command is very helpful for visualizing some of the possibilities.

Monday, July 18, 2011

Example 9.3: augmented display of contingency table


SAS and R often provide different levels of details from output. This is particularly true for the descriptive analysis of contingency tables, where SAS makes it easy to display tables with additional quantities (such as the observed cell count).

The mosaic package has added functionality to calculate these quantities in R. We demonstrate using an example from the HELP dataset.

R

ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
library(mosaic)
ds$gender = ifelse(ds$female==1, "female", "male")
ds$homeless = ifelse(ds$homeless==1, "homeless", "housed")
tab = xtabs(~ gender + homeless, data=ds)
> tab
homeless
gender homeless housed
female 40 67
male 169 177
> xchisq.test(tab)

Pearson's Chi-squared test with Yates' continuity correction

data: tab
X-squared = 3.8708, df = 1, p-value = 0.04913

40.00 67.00
( 49.37) ( 57.63)
[1.78] [1.52]
<-1.33> < 1.23>

169.00 177.00
(159.63) (186.37)
[0.55] [0.47]
< 0.74> <-0.69>

key:
observed
(expected)
[contribution to X-squared]


We see that there is a borderline statistically significant association between gender and homeless status in the HELP study. We interpret that we see fewer than expected females who are homeless, and more males who are homeless.

Another idea is to use graphical depictions of the association in this table. One approach is a mosaic plot (note: no relation to Project MOSAIC and the mosaic package). A mosaic plot starts as a square with area equal to one. It is divided into columns based on the prevalence in each of the values for the column variable (in this case, gender). Then each bar is divided vertically based on the conditional probability of the other variable within that category.

Another graphical display of a table is the association plot. In an association plot, there is also a box for each cell of the table. The area of the box is proportional to the difference between the observed and expected (assuming no association) frequencies. In a typical presentation, excess observed counts are black and above the line, while deficient counts are red and below the line.

Above, we show the mosaic plot (on the left) and association plot (on the right). Both of these displays demonstrate that there is an association. The mosaic plot indicates that only about a quarter of the sample is female (indicated by the width of the columns), and that homelessness is present in about half the subjects (area shaded in light grey). The slight association demonstrated is that there are fewer homeless women than expected (since the horizontal line moves down between the first and second column). Similarly, for the association plot we note that the expected cell count is less than the observed (indicated in red with values below the line) for the female homeless group.

par(mfrow=c(1,2))
mosaicplot(tab, color=TRUE, main="mosaic plot")
assocplot(tab)
title("association plot")


SAS
As in Example 8.32, we find SAS macros for mosaic plots among the contributions of Michael Friendly. In this complex case, they are somewhat more difficult to access than others. The code for the plots themselves can be downloaded here, while it's useful to also run a wrapper macro. After downloading the files, the following code can be used to make the figure below.

title 'Install mosaic modules';
* location of the zipped files;
filename mosaic 'c:\ken\sasmacros\mosaics';
* storage location of compiled macros;
libname mosaic 'c:\ken\sasmacros\mosaics';

* Code to read in, compile and store the macros;
proc iml ;
reset storage=mosaic.mosaic;
%include mosaic(mosaics) ;
store module=_all_;
show storage;
quit;

* Prep: create the table, save the cell counts;
proc freq data = "c:\book\help.sas7bdat";
tables homeless * female / out=outhelp;
run;

* Read in the wrapper macro;
%include "c:\ken\sasmacros\mosaics\mosaic.sas";

* Make the plot;
%mosaic(data=outhelp,var = female homeless,
sort=homeless descending female, space = 1 1);

The sort and space options make the results more similar to those shown for mosaicplot(). In this version, the colors reflect the signs of the residuals.