Showing posts with label legend. Show all posts
Showing posts with label legend. Show all posts

Tuesday, September 14, 2010

Example 8.5: bubble plots part 3



An anonymous commenter expressed a desire to see how one might use SAS to draw a bubble plot with bubbles in three colors, corresponding to a fourth variable in the data set. (x, y, z for bubble size, and the category variable.) In a previous entries we discussed bubble plots and showed how to make the bubble print in two colors depending a fourth dichotomous variable.

The SAS approach to this cannot be extended to fourth variables with many values: we show here an approach to generating this output. The R version below represents a trivial extension of the code demonstrated earlier.

SAS

We'll start by making some data-- 20 observations in each of 3 categories.

data testbubbles;
do cat = 1 to 3;
do i = 1 to 20;
abscissa = normal(0);
ordinate = normal(0);
z = uniform(0);
output;
end;
end;
run;

Our approach will be to make an annotate data set using the annotate macros (section 5.2). The %slice macro easily draws filled circles. Check its documentation for full details on the parameters it needs in the on-line help: SAS Products; SAS/GRAPH; The Annotate Facility; Annotate Dictionary. Here we note that the 5th parameter is the radius of the circle, chosen here as an arbitrary function of z that makes pleasingly sized circles. Other parameters reflect color density, arc, and starting angle, which could be used to represent additional variables.

%annomac;
data annobub1;
set testbubbles;
%system(2,2,3);
%slice(abscissa, ordinate, 0, 360, sqrt(3*z), green, ps, 0);
run;

Unfortunately, due to a quirk of the macro facility, I don't think the color can be changed conditionally in the preceding step. Instead, we need a new data step to do this.

data annobub2;
set annobub1;
if cat=2 then color="red";
if cat=3 then color="blue";
run;

Now we're ready to plot. We use the symbol (section 5.2.2) statement to tell proc gplot not to plot the data, add the annotate data set, and suppress the legend, as the default legend will not look correct here. An appropriate legend could be generated with a legend statement.

symbol1 i=none r=3;
proc gplot data=testbubbles;
plot ordinate * abscissa = cat / annotate = annobub2 nolegend;
run;
quit;

The resulting plot is shown above. Improved axes are demonstrated throughout the book and in many previous blog posts.

R

The R approach merely requires passing three colors to the bg option in the symbols() function. To mimic SAS, we'll start by defining some data, then generate the vector of colors needed.

cat = rep(c(1, 2, 3), each=20)
abscissa = rnorm(60)
ordinate = rnorm(60)
z = runif(60)
plotcolor = ifelse(cat==1, "green", ifelse(cat==2, "red", "blue"))

The nested calls to the ifelse function (section 1.11.2) allow vectorized conditional tests with more than two possibilities. Another option would be to use a for loop (section 1.11.1) but this would be avoiding one of the strengths of R. In this example, I suppose I could have defined the cat vector with the color values as well, and saved some keystrokes.

With the data generated and the color vector prepared, we need only call the symbols() function.

symbols(ordinate, abscissa, circles=z, inches=1/5, bg=plotcolor)

The resulting plot is shown below.

Monday, April 19, 2010

Example 7.33: Specifying fonts in graphics

For interactive data analysis, the default fonts used by SAS and R are acceptable, if not beautiful. However, for publication, it may be important to manipulate the fonts. For example, it would be desirable for the fonts in legends, axis labels, or other text printed in plots to approximate the typeface used in the rest of the text. Credit where it's due department: this blog entry is inspired by this blog post by Yihui Xie.

As an example plot, we'll revisit Figure 2.2 from the book, in which we plot MCS by CESD, with plot symbol showing substance abused.

SAS

In SAS, we'll focus on the "traditional" graphics most likely to be used for scientific publication. ODS graphics and the new SG procedures are currently difficult to customize.

There are several fonts supplied with the standard installation. However, any TrueType or Adobe Type 1 font on a local disk can be used. Making the fonts available requires a one-time use of proc fontreg.


proc fontreg mode=all;
fontpath "c:/windows/fonts";
run;


In the above, the fontpath is the location in the operating system of the .ttf (TrueType) and/or .pfa/.pfb (Type 1) files.

Any font in that directory can then be used in SAS graphics, regardless of the format of the image file. Note however that this approach may mean that running your code on a different computer may alter the appearance of the graphic, since the fonts may not be available on another computer. If you anticipate needing to share the code, you can stick with the fonts SAS supplies, which are described in the online documentation: SAS Products; SAS/GRAPH; Concepts; Fonts.

The simplest way to make all text default to a desired font is to use the goptions statement (sections under 5.3).

 
goptions ftext="Comic Sans MS";


In Windows, the name of the font to put inside the quotation marks is displayed in the Explorer. If bold, italic, or bold italic is available, it can be requested by appending /bo, /it, or /bo/it to the font name, as demonstrated below.

Many statements in SAS/GRAPH accept a font= option, and these can be used to override the default font specified in the goptions statement.

In the example below, we use several different fonts to demonstrate how different statements specify fonts. In fact, the only plot elements in the assigned default Rockwell typeface are the y axis label and numbers and the labels of the symbols in the legend.


filename myurl
url 'http://www.math.smith.edu/sasr/datasets/help.csv'
lrecl=704;

proc import datafile=myurl out=ds dbms=dlm;
delimiter=',';
getnames=yes;
run;

goptions ftext = "Rockwell";

title font="Lucida Handwriting" "MCS by PCS with fonts";

legend1
mode=reserve position=(bottom center outside) across=3
label = (font= "Elephant" h=2 "Substance");

axis1 label=(font="Goudy Old Style" h=2 "The CESD axis")
value = ( h = 2 font= "Comic Sans MS") minor=none;

symbol1 font="Comic Sans MS" v='A' h=.7 c=black;
symbol2 font="Agency FB/bo" v='C' h=.7 c=black;
symbol3 font="Franklin Gothic Book/bo/it" v='H' h=.7;
proc gplot data=ds;
where female=1;
plot mcs*cesd=substance / legend=legend1 haxis=axis1;
run; quit;


The results, shown below, demonstrate the comic effects of reserving too much of the plot space for labels.


R

In R, the available fonts and the ways to use them varies by device. TrueType fonts can be displayed easily for the windows() device, and can similarly be used in publication graphics through the win.metafile() device (section 5.4.5).


windowsFonts(CS = windowsFont("Comic Sans MS"))
windows()
par(family="CS")


The windowsFonts() function would be called for each font to be included.

Unfortunately, the pdf() and postscript() devices most likely to be useful for publication using LaTeX do not appear to be able to read TrueType fonts, only Adobe Type 1 fonts. Some fonts can be purchased or downloaded for free in this format. If any reader has had success in using external fonts for these devices, I hope they'll provides code or links in the comments. There are some resources for converting TrueType to Type 1 freely available for *nix operating systems. However, for technical reasons, these conversions don't usually offer satisfying results.

Fortunately, R comes with several fonts for these devices. Their names can be easily displayed:


> names(pdfFonts())
[1] "serif" "sans"
[3] "mono" "AvantGarde"
[5] "Bookman" "Courier"
[7] "Helvetica" "Helvetica-Narrow"
[9] "NewCenturySchoolbook" "Palatino"
[11] "Times" "URWGothic"
[13] "URWBookman" "NimbusMon"
[15] "NimbusSan" "URWHelvetica"
[17] "NimbusSanCond" "CenturySch"
[19] "URWPalladio" "NimbusRom"
[21] "URWTimes" "Japan1"
[23] "Japan1HeiMin" "Japan1GothicBBB"
[25] "Japan1Ryumin" "Korea1"
[27] "Korea1deb" "CNS1"
[29] "GB1"


As with SAS, R offers both ways to change the default font (with the par() function) and fine control of individual options in specific function calls. R stores fonts in familys with names as listed above, and the (confusingly named) font which is an integer where 1 corresponds to plain text (the default), 2 to bold face, 3 to italic, 4 to bold italic, and 5 to the symbol font. Not all faces are necessarily available for all font families.



pdf(file="c:/temp/test1.pdf")
par(family="Palatino")
plot(cesd[female==1], mcs[female==1], type="n", bty="n", ylab="MCS",
xlab = '')
text(cesd[female==1&substance=="alcohol"],
mcs[female==1&substance=="alcohol"],"A", family="AvantGarde", font=2)
text(cesd[female==1&substance=="cocaine"],
mcs[female==1&substance=="cocaine"],"C", family="serif")
text(cesd[female==1&substance=="heroin"],
mcs[female==1&substance=="heroin"],"H", family="Courier", font=4)
title(xlab="This is the CESD axis", family="NewCenturySchoolbook", cex.lab=2)
title(family="Helvetica", font.main=3, cex.main=3, "MCS by CESD with fonts")
dev.off()




Similar to the SAS example, the only characters in the default Palatino font are the y axis label and the numerals.

Replicating a SAS legend appearing below the plot would be more difficult in R, as would replicating the default SAS legend that shows different plotted font characters.

Tuesday, October 13, 2009

Example 7.15: A more complex sales graphic

The plot of Amazon sales rank over time generated in example 7.14 leaves questions. From a software perspective, we'd like to make the plot prettier, while we can embellish the plot to inform our interpretation about how the rank is calculated.

For the latter purpose, we'll create an indicator of whether the rank was recorded in nighttime (eastern US time) or not. Then we'll color the nighttime ranks differently than the daytime ranks.


SAS
In SAS, we use the timepart function to extract the time of day from the salestime variable which holds the date and time. This is a value measured in seconds since midnight, and we use some conditional logic (section 1.4.11) to identify hours before 8 AM or after 6 PM.


data sales2;
set sales;
if timepart(salestime) lt (8 * 60 * 60) or
timepart(salestime) gt (18 * 60 * 60) then night=1;
else night = 0;
run;



Then we can make the plot. We use the axis statement (sections 5.3.7 and 5.3.8) to specify the axis ranges, rotate some labels and headers, and remove the minor tick marks. Note that since the x-axis is a date-time variable, we have to specify the axis range using date-time data, here read in using formats (section A.6.4) and request labels every three days by requesting an interval of three days' worth of seconds. We also use the symbol statement (section 5.2.2, 5.3.11) to specify shapes and colors for the plotted points.


axis1 order = ("09AUG2009/12:00:00"dt to
"27AUG2009/12:00:00"dt by 259200)
minor = none;
axis2 order=(30000 to 290000 by 130000) label=(angle=90)
value=(angle=90) minor=none;
symbol1 i=none v=dot c=red h=.3;
symbol2 i=none v=dot c=black h=.3;


Finally, we request the plot using proc gplot. The a*b=c syntax (as in section 5.6.2) will result in different symbols for each value of the new night variable, and the symbols we just defined will be used. The haxis and vaxis options are used to associate each axis with the axis definitions specified in the axis statements.


proc gplot data=sales2;
plot rank*salestime=night / haxis=axis1 vaxis=axis2;
format salestime dtdate5.;
run; quit;


R
In R, we make a new variable reflecting the date-time at the midnight before we started collecting data. We then coerce the time values to numeric values using the as.numeric() function (section 1.4.2), while subtracting that midnight value. Next, we mod by 24 (using the %% operator, section B.4.3) and lastly round to the integer value (section 1.8.4) to get the hour of measurement. There's probably a more elegant way of doing this in R, but this works.


midnight <- as.POSIXlt("2009-08-09 00:00:00 EDT")
timeofday <- round(as.numeric(timeval-midnight)%%24,0)


Next, we prepare for making a nighttime indicator by intializing a vector with 0. Then we assign a value of 1 when the corresponding element of the hour of measurement vector has a value in the correct range.


night <- rep(0,length(timeofday))
night[timeofday < 8 | timeofday > 18] <- 1


Finally, we're ready to make the plot. We begin by setting up the axes, using the type="n" option (section 5.1.1) to prevent any data being plotted. Next, we plot the nighttime ranks by conditioning the plot vector on the value of the nighttime indicator vector; we then repeact for the daytime values, additionally specifying a color for these points. Lastly, we add a legend to the plot (section 5.2.14).


plot(timeval, rank, type="n")
points(timeval[night==1], rank[night==1], pch=20)
points(timeval[night==0], rank[night==0], pch=20,
col="red")
legend(as.POSIXlt("2009-08-22 00:00:00 EDT"), 250000,
legend=c("day", "night"), col=c("black", "red"),
pch=c(20, 20))




Interpretation: It appears that Amazon's ranking function adjusts for the pre-dawn hours, most likely to reflect a general lack of activity. (Note that these ranks are from Amazon.com. In all likelihood, Amazon.co.uk and other local Amazon sites adjust for local time similarly.) Perhaps some recency in sales allows a decline in rank for some books during these hours? In addition, we see that most sales of this book, as inferred from the discontinuous drops (improvement) in rank, tend to happen near the beginning of the day, or mid-day, rather than at night.