Showing posts with label circles. Show all posts
Showing posts with label circles. Show all posts

Tuesday, September 14, 2010

Example 8.5: bubble plots part 3



An anonymous commenter expressed a desire to see how one might use SAS to draw a bubble plot with bubbles in three colors, corresponding to a fourth variable in the data set. (x, y, z for bubble size, and the category variable.) In a previous entries we discussed bubble plots and showed how to make the bubble print in two colors depending a fourth dichotomous variable.

The SAS approach to this cannot be extended to fourth variables with many values: we show here an approach to generating this output. The R version below represents a trivial extension of the code demonstrated earlier.

SAS

We'll start by making some data-- 20 observations in each of 3 categories.

data testbubbles;
do cat = 1 to 3;
do i = 1 to 20;
abscissa = normal(0);
ordinate = normal(0);
z = uniform(0);
output;
end;
end;
run;

Our approach will be to make an annotate data set using the annotate macros (section 5.2). The %slice macro easily draws filled circles. Check its documentation for full details on the parameters it needs in the on-line help: SAS Products; SAS/GRAPH; The Annotate Facility; Annotate Dictionary. Here we note that the 5th parameter is the radius of the circle, chosen here as an arbitrary function of z that makes pleasingly sized circles. Other parameters reflect color density, arc, and starting angle, which could be used to represent additional variables.

%annomac;
data annobub1;
set testbubbles;
%system(2,2,3);
%slice(abscissa, ordinate, 0, 360, sqrt(3*z), green, ps, 0);
run;

Unfortunately, due to a quirk of the macro facility, I don't think the color can be changed conditionally in the preceding step. Instead, we need a new data step to do this.

data annobub2;
set annobub1;
if cat=2 then color="red";
if cat=3 then color="blue";
run;

Now we're ready to plot. We use the symbol (section 5.2.2) statement to tell proc gplot not to plot the data, add the annotate data set, and suppress the legend, as the default legend will not look correct here. An appropriate legend could be generated with a legend statement.

symbol1 i=none r=3;
proc gplot data=testbubbles;
plot ordinate * abscissa = cat / annotate = annobub2 nolegend;
run;
quit;

The resulting plot is shown above. Improved axes are demonstrated throughout the book and in many previous blog posts.

R

The R approach merely requires passing three colors to the bg option in the symbols() function. To mimic SAS, we'll start by defining some data, then generate the vector of colors needed.

cat = rep(c(1, 2, 3), each=20)
abscissa = rnorm(60)
ordinate = rnorm(60)
z = runif(60)
plotcolor = ifelse(cat==1, "green", ifelse(cat==2, "red", "blue"))

The nested calls to the ifelse function (section 1.11.2) allow vectorized conditional tests with more than two possibilities. Another option would be to use a for loop (section 1.11.1) but this would be avoiding one of the strengths of R. In this example, I suppose I could have defined the cat vector with the color values as well, and saved some keystrokes.

With the data generated and the color vector prepared, we need only call the symbols() function.

symbols(ordinate, abscissa, circles=z, inches=1/5, bg=plotcolor)

The resulting plot is shown below.

Saturday, March 27, 2010

Example 7.29: Bubble plots colored by a fourth variable

In Example 7.28, we generated a bubble plot showing the relationship among CESD, age, and number of drinks, for women. An anonymous commenter asked whether it would be possible to color the circles according to gender. In the comments, we showed simple code for this in R and hinted at a SAS solution for two colors. Here we show in detail what the SAS code would look like, and revisit the R code.


SAS

For SAS, we have to make two separate variables-- one with the CESD for the females, and another for the males. For the other gender, these gender-specific variables will have missing values. We'll do this using conditioning (section 1.11.2).


libname k "c:\book";

data twocolors;
set k.help;
if female eq 1 then femalecesd = cesd;
else malecesd = cesd;
run;


Now we can use the bubble2 statement (close kin of the plot2 statement, section 5.1.2) to add both gender-specific variables to the plot. While we're at it, we relabel the x-axis to no longer be gender specific and specify that the right y-axis is not to be labeled.


proc gplot data = twocolors;
bubble malecesd*age=i1 / bscale = radius bsize=200
bcolor = blue bfill = solid;
bubble2 femalecesd*age=i1 / bscale = radius bsize = 200
bcolor = pink bfill = solid noaxis;
label malecesd="CESD";
run;


As in the previous bubble plot example, the scale is manipulated arbitrarily so that the SAS and R figures are similar.

We're somewhat fortunate here that the range of the two gendered CESD scores are similar

R

In the comments for Example 7.28, we suggested the following simple R code.


load(url("http://www.math.smith.edu/sasr/datasets/savedfile"))
femalealc = subset(ds, female==1 & substance=="alcohol")
malealc = subset(ds, female==0 & substance=="alcohol")
with(malealc, symbols(age, cesd, circles=i1,
inches=1/5, bg="blue"))
with(femalealc, symbols(age, cesd, circles=i1,
inches=1/5, bg="pink", add=TRUE))


While this does generate a plot, it could be misleading, in that the scale of the circle sizes is relative to the largest value within each symbols() call. While this could be desirable, it's more likely that we'd like a single scale for the circles. R code for this can be made in a single statement:


load(url("http://www.math.smith.edu/sasr/datasets/savedfile"))
attach(ds)
symbols(age, cesd, circles=i1,inches=1/5,
bg=ifelse(female==1,"pink","blue"))


Here the ifelse() function (section 1.11.2) generates a different circle fill color depending on the value of female.

The resulting plots are shown below.


Monday, March 22, 2010

Example 7.28: Bubble plots

A bubble plot is a means of displaying 3 variables in a scatterplot. The z dimension is presented in the size of the plot symbol, typically a circle. The area or radius of the circle plotted is proportional to the value of the third variable. This can be a very effective data presentation method. For example, consider Andrew Gelman's recent re-presentation of health expenditure/survival data/annual number of doctor visits per person. On the other hand, Edward Tufte suggests that such representations are ambiguous, in that it is often unclear whether the area, radius, or height reflects the third variable. In addition, he reports that humans tend not to be good judges of relative area.

However, other means of presenting three dimensions on a flat screen or piece of paper often rely on visual cues regarding perspective, which some find difficult to judge.

Here we demonstrate SAS and R bubble plots using the HELP data set used in our book. We show a plot of depression by age, with bubble size proportional to the average number of drinks per day. To make the plot a little easier to read, we show this only for female alcohol abusers.

SAS

In SAS, we can use the bubble statement in proc gplot. We demonstrate here the use of the where data set option (section 1.5.1) for subsetting, which allows us to avoid using any data steps. SAS allows the circle area or radius to be proportional to the third variable; we choose the radius for compatibility with R. We alter the size of the circles for the same reason. We also demonstrate options for coloring in the filled circles.


libname k "c:\book";

proc gplot data = k.help (where=((female eq 1)
and (substance eq "alcohol")));
bubble cesd*age=i1 / bscale = radius bsize=60
bcolor=blue bfill=solid;
run;



R

In R, we can use the symbols() function for the plot. Here we also demonstrate reading in data previously saved in native R format (section 1.1.1), as well as the subset() function and the with() function (the latter appears in section 1.3.1). The inches option is an arbitrary scale factor. We note that the symbols() function has a great deal of additional capability-- it can substitute squares for circles for plotting the third variable, and add additional dimensions with rectangles or stars. Proportions can be displayed with thermometers, and boxplots can also be displayed.


load(url("http://www.math.smith.edu/sasr/datasets/savedfile"))
femalealc = subset(ds, female==1 & substance=="alcohol")
with(femalealc, symbols(age, cesd, circles=i1,
inches=1/5, bg="blue"))


The results are shown below. It appears that younger women with more depressive symptoms tend to report more drinking.