SAS and R: Amazon Sales rank

Showing posts with label Amazon Sales rank. Show all posts

Tuesday, October 13, 2009

Example 7.15: A more complex sales graphic

The plot of Amazon sales rank over time generated in example 7.14 leaves questions. From a software perspective, we'd like to make the plot prettier, while we can embellish the plot to inform our interpretation about how the rank is calculated.

For the latter purpose, we'll create an indicator of whether the rank was recorded in nighttime (eastern US time) or not. Then we'll color the nighttime ranks differently than the daytime ranks.

SAS
In SAS, we use the timepart function to extract the time of day from the salestime variable which holds the date and time. This is a value measured in seconds since midnight, and we use some conditional logic (section 1.4.11) to identify hours before 8 AM or after 6 PM.


data sales2;
set sales;
  if timepart(salestime) lt (8 * 60 * 60) or
     timepart(salestime) gt (18 * 60 * 60) then night=1;
    else night = 0;
run;

Then we can make the plot. We use the axis statement (sections 5.3.7 and 5.3.8) to specify the axis ranges, rotate some labels and headers, and remove the minor tick marks. Note that since the x-axis is a date-time variable, we have to specify the axis range using date-time data, here read in using formats (section A.6.4) and request labels every three days by requesting an interval of three days' worth of seconds. We also use the symbol statement (section 5.2.2, 5.3.11) to specify shapes and colors for the plotted points.


axis1 order = ("09AUG2009/12:00:00"dt to 
   "27AUG2009/12:00:00"dt by 259200) 
   minor = none;
axis2 order=(30000 to 290000 by 130000) label=(angle=90) 
   value=(angle=90) minor=none;
symbol1 i=none v=dot c=red h=.3;
symbol2 i=none v=dot c=black h=.3;

Finally, we request the plot using proc gplot. The a*b=c syntax (as in section 5.6.2) will result in different symbols for each value of the new night variable, and the symbols we just defined will be used. The haxis and vaxis options are used to associate each axis with the axis definitions specified in the axis statements.


proc gplot data=sales2;
   plot rank*salestime=night / haxis=axis1 vaxis=axis2;
   format salestime dtdate5.;
run; quit;

R
In R, we make a new variable reflecting the date-time at the midnight before we started collecting data. We then coerce the time values to numeric values using the as.numeric() function (section 1.4.2), while subtracting that midnight value. Next, we mod by 24 (using the %% operator, section B.4.3) and lastly round to the integer value (section 1.8.4) to get the hour of measurement. There's probably a more elegant way of doing this in R, but this works.


midnight <- as.POSIXlt("2009-08-09 00:00:00 EDT")
timeofday <- round(as.numeric(timeval-midnight)%%24,0)

Next, we prepare for making a nighttime indicator by intializing a vector with 0. Then we assign a value of 1 when the corresponding element of the hour of measurement vector has a value in the correct range.


night <- rep(0,length(timeofday))
night[timeofday < 8 | timeofday > 18] <- 1

Finally, we're ready to make the plot. We begin by setting up the axes, using the type="n" option (section 5.1.1) to prevent any data being plotted. Next, we plot the nighttime ranks by conditioning the plot vector on the value of the nighttime indicator vector; we then repeact for the daytime values, additionally specifying a color for these points. Lastly, we add a legend to the plot (section 5.2.14).


plot(timeval, rank, type="n")
points(timeval[night==1], rank[night==1], pch=20)
points(timeval[night==0], rank[night==0], pch=20, 
   col="red")
legend(as.POSIXlt("2009-08-22 00:00:00 EDT"), 250000,
   legend=c("day", "night"), col=c("black", "red"), 
   pch=c(20, 20))

Interpretation: It appears that Amazon's ranking function adjusts for the pre-dawn hours, most likely to reflect a general lack of activity. (Note that these ranks are from Amazon.com. In all likelihood, Amazon.co.uk and other local Amazon sites adjust for local time similarly.) Perhaps some recency in sales allows a decline in rank for some books during these hours? In addition, we see that most sales of this book, as inferred from the discontinuous drops (improvement) in rank, tend to happen near the beginning of the day, or mid-day, rather than at night.

Tuesday, September 29, 2009

Example 7.14: A simple graphic of sales

In this example, we show a simple plot of the sales rank data read in as shown in example 7.13.

SAS
In SAS, we use the symbol statement (section 5.3) to request small (with the h option) dots (with the v option, and that the dots not be connected (with the i option. (See sections 5.2.2, 5.3.9 for more details.)
we request a scatter plot with the gplot procdure (section 5.1.1), and tell SAS how to display the date/time values using the format statement (section A.6.4).


symbol1 v=dot i=none h=.2;
proc gplot data=sales;
   plot rank*salestime;
   format salestime datetime7.;
run;

Note in the results that the default SAS behavior is to use round and regular axis tick marks, in this case wasting a great deal of space on both axes. Similarly, the orientation of the y-axis labels uses up much space as well. We'll fix this behavior in a later entry.

R
In contrast, the R code is both simpler and more attractive by default; the plot() function (section 5.1.1) has a default treatment for date/time-formatted variables. Our only modification to the defaults is to request dots (with the pch option (section 5.2.2) instead of the default open circles.


plot(timeval, rank, pch=20)

Interpretation: As noted in a previous entry, information about the Amazon sales rank is relatively scant. Amazon considers the number of books it sells to be a competitive secret, and it discloses little information about how the rank is calculated. However, by examining the plot above, we can make some deductions. First, sales ranks are updated at least hourly. Second, there appears to be some adjustment for time of day. This would explain the smooth changes in direction during, for example, the first series of ~20 observations. In contrast, notable discontinuous changes in rank apparently signify sales; for a book with small circulation, such as this one, we can assume that most hours contain only one sale.

Monday, July 20, 2009

Example 7.6: Find Amazon sales rank for a book

In honor of Amazon's official release date for the book, we offer this blog entry.

Both SAS and R can be used to find the Amazon Sales Rank for a book by downloading the desired web page and ferreting out the appropriate line. This code is likely to break if Amazon’s page format is changed (but it worked as of October, 2010). [Note: as of spring 2010 Amazon changed the format for their webpages, and the appropriate text to search for changed from "Amazon.com Sales Rank" to "Amazon Bestsellers Rank". We've updated the blog code with this string. As of October 9, 2010 they added a number of blank lines to the web page, which we also now address.]

In this example, we find the sales rank for our book. Some interesting information about interpreting the rank can be found here or here.

Both SAS and R code below rely on section 1.1.3, ”Reading more complex text ﬁles.” Note that in the displayed SAS and R code, the long URL has been broken onto several lines, while it would have to be entered on a single line to run correctly.

In SAS, we assign the URL an internal name (section 1.1.6), then input the ﬁle using a data step. We exclude all the lines which don’t contain the sales rank, using the count function (section 1.4.6). We then extract the number using the substr function (section 1.4.3), with the find function (section 1.4.6) employed to locate the number within the line. The last step is to turn the extracted text (which contains a comma) into a numeric variable.

SAS


filename amazon url "http://www.amazon.com/
         SAS-Management-Statistical-Analysis-Graphics/
         dp/1420070576/ref=sr_1_1?ie=UTF8&s=books
         &qid=1242233418&sr=8-1";

data test;
infile amazon truncover;
input @1 line $256.;
   if count(line, "Amazon Bestsellers Rank") ne 0;
   rankchar = substr(line, find(line, "#")+1, 
        find(line, "in Books") - find(line, "#") - 2);
   rank = input(rankchar, comma9.);
run;

proc print data=test noobs; 
   var rank;
run;


# grab contents of web page
urlcontents <- readLines("http://www.amazon.com/
           SAS-Management-Statistical-Analysis-Graphics/
           dp/1420070576/ref=sr_1_1?ie=UTF8&s=books
           &qid=1242233418&sr=8-1")
# find line with sales rank
linenum <- suppressWarnings(grep("Amazon Bestsellers Rank:",
           urlcontents))

newline = linenum + 1    # work around October 2010 blank spaces
while (urlcontents[newline] == "") {
   newline = newline + 1
}

# split line into multiple elements
linevals <- strsplit(urlcontents[newline], ' ')[[1]]

# find element with sales rank number
entry <- grep("#", linevals)   
# snag that entry
charrank <- linevals[entry]
# kill '#' at start
charrank <- substr(charrank, 2, nchar(charrank))  
# remove commas
charrank <- gsub(',','', charrank) 
# turn it into a numeric opject
salesrank <- as.numeric(charrank)
cat("salesrank=",salesrank,"\n")

The resulting output (on July 16, 2009) is

SAS


                      rank
       
                      23476


salesrank= 23467

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts