Tuesday, October 13, 2009

Example 7.15: A more complex sales graphic

The plot of Amazon sales rank over time generated in example 7.14 leaves questions. From a software perspective, we'd like to make the plot prettier, while we can embellish the plot to inform our interpretation about how the rank is calculated.

For the latter purpose, we'll create an indicator of whether the rank was recorded in nighttime (eastern US time) or not. Then we'll color the nighttime ranks differently than the daytime ranks.


SAS
In SAS, we use the timepart function to extract the time of day from the salestime variable which holds the date and time. This is a value measured in seconds since midnight, and we use some conditional logic (section 1.4.11) to identify hours before 8 AM or after 6 PM.


data sales2;
set sales;
if timepart(salestime) lt (8 * 60 * 60) or
timepart(salestime) gt (18 * 60 * 60) then night=1;
else night = 0;
run;



Then we can make the plot. We use the axis statement (sections 5.3.7 and 5.3.8) to specify the axis ranges, rotate some labels and headers, and remove the minor tick marks. Note that since the x-axis is a date-time variable, we have to specify the axis range using date-time data, here read in using formats (section A.6.4) and request labels every three days by requesting an interval of three days' worth of seconds. We also use the symbol statement (section 5.2.2, 5.3.11) to specify shapes and colors for the plotted points.


axis1 order = ("09AUG2009/12:00:00"dt to
"27AUG2009/12:00:00"dt by 259200)
minor = none;
axis2 order=(30000 to 290000 by 130000) label=(angle=90)
value=(angle=90) minor=none;
symbol1 i=none v=dot c=red h=.3;
symbol2 i=none v=dot c=black h=.3;


Finally, we request the plot using proc gplot. The a*b=c syntax (as in section 5.6.2) will result in different symbols for each value of the new night variable, and the symbols we just defined will be used. The haxis and vaxis options are used to associate each axis with the axis definitions specified in the axis statements.


proc gplot data=sales2;
plot rank*salestime=night / haxis=axis1 vaxis=axis2;
format salestime dtdate5.;
run; quit;


R
In R, we make a new variable reflecting the date-time at the midnight before we started collecting data. We then coerce the time values to numeric values using the as.numeric() function (section 1.4.2), while subtracting that midnight value. Next, we mod by 24 (using the %% operator, section B.4.3) and lastly round to the integer value (section 1.8.4) to get the hour of measurement. There's probably a more elegant way of doing this in R, but this works.


midnight <- as.POSIXlt("2009-08-09 00:00:00 EDT")
timeofday <- round(as.numeric(timeval-midnight)%%24,0)


Next, we prepare for making a nighttime indicator by intializing a vector with 0. Then we assign a value of 1 when the corresponding element of the hour of measurement vector has a value in the correct range.


night <- rep(0,length(timeofday))
night[timeofday < 8 | timeofday > 18] <- 1


Finally, we're ready to make the plot. We begin by setting up the axes, using the type="n" option (section 5.1.1) to prevent any data being plotted. Next, we plot the nighttime ranks by conditioning the plot vector on the value of the nighttime indicator vector; we then repeact for the daytime values, additionally specifying a color for these points. Lastly, we add a legend to the plot (section 5.2.14).


plot(timeval, rank, type="n")
points(timeval[night==1], rank[night==1], pch=20)
points(timeval[night==0], rank[night==0], pch=20,
col="red")
legend(as.POSIXlt("2009-08-22 00:00:00 EDT"), 250000,
legend=c("day", "night"), col=c("black", "red"),
pch=c(20, 20))




Interpretation: It appears that Amazon's ranking function adjusts for the pre-dawn hours, most likely to reflect a general lack of activity. (Note that these ranks are from Amazon.com. In all likelihood, Amazon.co.uk and other local Amazon sites adjust for local time similarly.) Perhaps some recency in sales allows a decline in rank for some books during these hours? In addition, we see that most sales of this book, as inferred from the discontinuous drops (improvement) in rank, tend to happen near the beginning of the day, or mid-day, rather than at night.

4 comments:

Anonymous said...

Your SAS code implies that red dots are the "day" values and black dots are the "night" ones. But that does not seem to be reflected in the graph itself (SAS and R graphs should have colors reversed according to your codes, but that is not the case). ???

Anonymous said...

Also, here's how you can have the legend in SAS to be as pretty as the one in R (and the graphs to look almost identical).

data sales2;
set sales;
if timepart(salestime) lt (8 * 60 * 60) or
timepart(salestime) gt (18 * 60 * 60)
then t=1;
else t = 0;
if t = 0 then day = rank;
else if t = 1 then night = rank;
run;
axis1 order = ("09AUG2009/12:00:00"dt to
"27AUG2009/12:00:00"dt by 259200)
label=none minor=none;
axis2 order=(50000 to 250000 by 500000) label=(angle=90 "rank") value=(angle=90) minor=none;
symbol1 i=none v=dot c=black h=.3;
symbol2 i=none v=dot c=red h=.3;
legend1 across=1 label=none mode=protect position=(inside top right) frame;
proc gplot data=sales2;
plot (day night)*salestime /haxis=axis1 vaxis=axis2 legend=legend1 overlay;
format salestime dtdate5.;
run; quit;

Anonymous said...

On second thought, though, your code is better - just needs small tweaking:

data sales2;
set sales;
if timepart(salestime) lt (8 * 60 * 60) or
timepart(salestime) gt (18 * 60 * 60)
then t="night";
else t="day ";
run;

axis1 order = ("09AUG2009/12:00:00"dt to
"27AUG2009/12:00:00"dt by 259200)
label=none minor=none;
axis2 order=(50000 to 250000 by 500000) label=(angle=90 "rank") value=(angle=90) minor=none;
symbol1 i=none v=dot c=black h=.3;
symbol2 i=none v=dot c=red h=.3;
legend1 across=1 label=none mode=protect position=(inside top right) frame;

proc gplot data=sales2;
plot rank*salestime=t /haxis=axis1 vaxis=axis2 legend=legend1;
format salestime dtdate5.;
run; quit;

Ken Kleinman said...

Thanks for being interested, Constantine. We show examples using the legend statement in several places in the book, and I wanted to demonstrate the default legend here. Try the "offset" options in the legend statement to position the legend exactly where you like. It's not quite as easy as positioning a legend in R, but you can get the result to be the same.

I think the problem with the plots is just the labeling of the R legend, no? The need to make them manually runs this kind of risk, but does give you somewhat more direct control of the legend than you get in SAS.