SAS and R: as.numeric()

Showing posts with label as.numeric(). Show all posts

Monday, October 8, 2012

Example 10.5: Convert a character-valued categorical variable to numeric

In some settings it may be necessary to recode a categorical variable with character values into a variable with numeric values. For example, the matching macro we discussed in example 7.35 will only match on numeric variables. One way to convert character variables to numeric values is to determine which values exist, then write a possibly long series of conditional tests to assign numbers to the values. Surely there's a better way?

SAS
In SAS, Rick Wicklin offers an IML solution and links to a macro with the same function. But if you're not an IML coder, and you don't want to investigate a macro solution, it's simple enough to do with data steps. We'll begin by making some fake data.

data test;
  do i = 1 to 100;
  cat = "meow";
  if i gt 30 then cat = "Purr";
  if i gt 70 then cat = "Hiss";
  output;
  end;
run;

To make the new variable, we'll just sort (section 1.5.6) the data on the categorical variable we want to convert, then use the set ds; by x; syntax to keep track of when a new value is encountered in the data. It's hard to believe that we've never demonstrated this useful syntax before-- perhaps we just can't find it today. The set ds; by x; syntax makes new temporary variables first.x and last.x that are equal to 1 for the first and last observations of each new level of x, respectively, and 0 otherwise. When we find a new value, we'll increase a counter by 1; the counter is our new numeric-valued variable.

proc sort data = test; by cat; run;

data catize;
set test;
by cat;
retain catnum 0;
if first.cat then catnum = catnum + 1;
run;

/* check the result */
proc freq data = catize;
tables cat * catnum;
run;

The table also shows the recoding values.

                             Table of cat by catnum

                  cat       catnum

                  Frequency|
                  Percent  |
                  Row Pct  |
                  Col Pct  |       1|       2|       3|  Total
                  ---------+--------+--------+--------+
                  Hiss     |     30 |      0 |      0 |     30
                           |  30.00 |   0.00 |   0.00 |  30.00
                           | 100.00 |   0.00 |   0.00 |
                           | 100.00 |   0.00 |   0.00 |
                  ---------+--------+--------+--------+
                  Purr     |      0 |     40 |      0 |     40
                           |   0.00 |  40.00 |   0.00 |  40.00
                           |   0.00 | 100.00 |   0.00 |
                           |   0.00 | 100.00 |   0.00 |
                  ---------+--------+--------+--------+
                  meow     |      0 |      0 |     30 |     30
                           |   0.00 |   0.00 |  30.00 |  30.00
                           |   0.00 |   0.00 | 100.00 |
                           |   0.00 |   0.00 | 100.00 |
                  ---------+--------+--------+--------+
                  Total          30       40       30      100
                              30.00    40.00    30.00   100.00

R
We begin by making the data. To convert to numbers, we use the labels option to the factor() function, feeding it the sequences of numbers between 1 and however many different values there are. Note that we find this using the factor() function again. There's probably a better way of doing this, but it's a little bit amusing to code it this way. Then we have numbers, but they're store as a factor. We can get them out with a call to as.numeric().

cat = c(rep("meow",30),rep("Hiss",30), rep("Purr", 40))
catn1 = factor(cat, labels=(1:length(levels(factor(cat)))))
catn = as.numeric(catn1)
table(catn,cat)

    cat
catn Hiss meow Purr
   1   30    0    0
   2    0   30    0
   3    0    0   40

There's a warning in the documentation for factor() that the values are assigned in location-specific fashion, so the table should be used to establish how the codes were assigned. For the record, the use cases for this kind of recoding in R may be more strained than the SAS example given above.

An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Tuesday, April 19, 2011

Example 8.35: Grab true (not pseudo) random numbers; passing API URLs to functions or macros

Usually, we're content to use a pseudo-random number generator. But sometimes we may want numbers that are actually random-- an example might be for randomizing treatment status in a randomized controlled trial.

The site Random.org provides truly random numbers based on radio static. For long simulations, its quota system may prevent its use. But for small to moderate needs, it can be used to provide truly random numbers. In addition, you can purchase larger quotas if need be.

The site provides APIs for several types of information. We'll write functions to use these to pull vectors of uniform (0,1) random numbers (of 10^(-9) precision) and to check the quota. To generate random variates from other distributions, you can use the inverse probability integral transform (section 1.10.8).

The coding challenge here comes in integrating quotation marks and special characters with function and macro calls.

SAS
In SAS, the challenging bit is to pass the desired number of random numbers off to the API, though the macro system. This is hard because the API includes the special characters ?, ", and especially &. The ampersand is used by the macro system to denote the start of a macro variable, and is used in APIs to indicate that an additional parameter follows.

To avoid processing these characters as part of the macro syntax, we have to enclose them within the macro quoting function %nrstr. We use this approach twice, for the fixed pieces of the API, and between them insert the macro variable that contains the number of random numbers desired. Also note that the sequence %" is used to produce the quotation mark. Then, to unmask the resulting character string and use it as intended, we %unquote it. Note that the line break shown in the filename statement must be removed for the code to work.

Finally, we read data from the URL (section 1.1.6) and transform the data to take values between 0 and 1.


%macro rands (outds=ds, nrands=);
filename randsite url %unquote(%nrstr(%"http://www.random.org/integers/?num=)
 &nrands%nrstr(&min=0&max=1000000000&col=1&base=10&format=plain&rnd=new%"));
proc import datafile=randsite out = &outds dbms = dlm replace; 
  getnames = no; 
run;

data &outds;
set &outds;
  var1 = var1 / 1000000000;
run;
%mend rands;

/* an example macro call */
%rands(nrands=25, outds=myrs);

The companion macro to find the quota is slightly simpler, since we don't need to insert the number of random numbers in the middle of the URL. Here, we show the quota in the SAS log; the file print syntax, shown in Example 8.34, can be used to send it to the output instead.


%macro quotacheck;
filename randsite url %unquote(%nrstr(%"http://www.random.org/quota/?format=plain%"));
proc import datafile=randsite out = __qc dbms = dlm replace; 
getnames = no; 
run;

data _null_;
set __qc;
put "Remaining quota is " var1 "bytes";
run;
%mend quotacheck;

/* an example macro call */
%quotacheck;

R

Two R functions are shown below. While the problem isn't as difficult as in SAS, it is necessary to enclose the character string for the URL in the as.character() function (section 1.4.1).


truerand = function(numrand) {
  read.table(as.character(paste("http://www.random.org/integers/?num=", 
  numrand, "&min=0&max=1000000000&col=1&base=10&format=plain&rnd=new", 
  sep="")))/1000000000
}

quotacheck = function() {
  line = as.numeric(readLines("http://www.random.org/quota/?format=plain"))
  return(line)
}

Tuesday, October 13, 2009

Example 7.15: A more complex sales graphic

The plot of Amazon sales rank over time generated in example 7.14 leaves questions. From a software perspective, we'd like to make the plot prettier, while we can embellish the plot to inform our interpretation about how the rank is calculated.

For the latter purpose, we'll create an indicator of whether the rank was recorded in nighttime (eastern US time) or not. Then we'll color the nighttime ranks differently than the daytime ranks.

SAS
In SAS, we use the timepart function to extract the time of day from the salestime variable which holds the date and time. This is a value measured in seconds since midnight, and we use some conditional logic (section 1.4.11) to identify hours before 8 AM or after 6 PM.


data sales2;
set sales;
  if timepart(salestime) lt (8 * 60 * 60) or
     timepart(salestime) gt (18 * 60 * 60) then night=1;
    else night = 0;
run;

Then we can make the plot. We use the axis statement (sections 5.3.7 and 5.3.8) to specify the axis ranges, rotate some labels and headers, and remove the minor tick marks. Note that since the x-axis is a date-time variable, we have to specify the axis range using date-time data, here read in using formats (section A.6.4) and request labels every three days by requesting an interval of three days' worth of seconds. We also use the symbol statement (section 5.2.2, 5.3.11) to specify shapes and colors for the plotted points.


axis1 order = ("09AUG2009/12:00:00"dt to 
   "27AUG2009/12:00:00"dt by 259200) 
   minor = none;
axis2 order=(30000 to 290000 by 130000) label=(angle=90) 
   value=(angle=90) minor=none;
symbol1 i=none v=dot c=red h=.3;
symbol2 i=none v=dot c=black h=.3;

Finally, we request the plot using proc gplot. The a*b=c syntax (as in section 5.6.2) will result in different symbols for each value of the new night variable, and the symbols we just defined will be used. The haxis and vaxis options are used to associate each axis with the axis definitions specified in the axis statements.


proc gplot data=sales2;
   plot rank*salestime=night / haxis=axis1 vaxis=axis2;
   format salestime dtdate5.;
run; quit;

R
In R, we make a new variable reflecting the date-time at the midnight before we started collecting data. We then coerce the time values to numeric values using the as.numeric() function (section 1.4.2), while subtracting that midnight value. Next, we mod by 24 (using the %% operator, section B.4.3) and lastly round to the integer value (section 1.8.4) to get the hour of measurement. There's probably a more elegant way of doing this in R, but this works.


midnight <- as.POSIXlt("2009-08-09 00:00:00 EDT")
timeofday <- round(as.numeric(timeval-midnight)%%24,0)

Next, we prepare for making a nighttime indicator by intializing a vector with 0. Then we assign a value of 1 when the corresponding element of the hour of measurement vector has a value in the correct range.


night <- rep(0,length(timeofday))
night[timeofday < 8 | timeofday > 18] <- 1

Finally, we're ready to make the plot. We begin by setting up the axes, using the type="n" option (section 5.1.1) to prevent any data being plotted. Next, we plot the nighttime ranks by conditioning the plot vector on the value of the nighttime indicator vector; we then repeact for the daytime values, additionally specifying a color for these points. Lastly, we add a legend to the plot (section 5.2.14).


plot(timeval, rank, type="n")
points(timeval[night==1], rank[night==1], pch=20)
points(timeval[night==0], rank[night==0], pch=20, 
   col="red")
legend(as.POSIXlt("2009-08-22 00:00:00 EDT"), 250000,
   legend=c("day", "night"), col=c("black", "red"), 
   pch=c(20, 20))

Interpretation: It appears that Amazon's ranking function adjusts for the pre-dawn hours, most likely to reflect a general lack of activity. (Note that these ranks are from Amazon.com. In all likelihood, Amazon.co.uk and other local Amazon sites adjust for local time similarly.) Perhaps some recency in sales allows a decline in rank for some books during these hours? In addition, we see that most sales of this book, as inferred from the discontinuous drops (improvement) in rank, tend to happen near the beginning of the day, or mid-day, rather than at night.

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts