SAS and R: rep()

Showing posts with label rep(). Show all posts

Monday, January 5, 2015

Example 2015.1: Time to refinance?

In the US, it's typical to borrow a fairly substantial portion of the cost of a new house from a bank. The cost of these loans, the mortgage rate, varies over time depending on what the financial wizards see in their crystal balls. What this means over time is that when the mortgage rates go down, the cost of living in your own house magically decreases--you take a new loan at the lower rate and pay off your old loan with it-- then you only have to pay off the new loan at the lower rate. You can find mortgage rate calculators on the web very easily-- if you don't mind their collecting your data and being bombarded with ads if you let their cookies trace you.

Instead, you can use SAS or R to calculate what you might pay for a new loan with various posted rates. There are some sophisticated tools available for either package if you're interested in the remaining principal or the proportion of each payment that's principal. Here, we just want to check the monthly payment.

R
We'll begin by writing a little function to calculate the monthly payment from the principal, interest rate (in per cent), and term (in years) of the loan. This is basic stuff, but the code here is adapted from a function written by Thomas Girke of UC Riverside.

mortgage <- function(principal=300000, rate=3.875, term=30) { 
  J <- rate/(12 * 100)
  N <- 12 * term
  M <- principal*J/(1-(1+J)^(-N))
  monthPay <<- M
  return(monthPay)
}

To compare the monthly costs for a series of loans offered by a local bank, we'll input the bank's loans as a data frame. To save typing, we'll use the rep() function to generate the term of the loan and the points.

offers = data.frame(
  principal = rep(275000, times=9),
  term = rep(c(30,20,15), each=3), 
  points = rep(c(0,1,2), times=3),
  rate = c(3.875, 3.75, 3.5, 3.625, 3.5, 3.375, 3, 2.875, 2.75))

> offers

  principal term points  rate
1    275000   30      0 3.875
2    275000   30      1 3.750
3    275000   30      2 3.500
4    275000   20      0 3.625
5    275000   20      1 3.500
6    275000   20      2 3.375
7    275000   15      0 3.000
8    275000   15      1 2.875
9    275000   15      2 2.750

(Points are an up-front cost a borrower can pay to lower the mortgage rate for the loan.) With the data and function in hand, it's easy to add the monthly cost to the data frame:

offers$monthly = with(offers, mortgage(rate=rate, term=term, principal=principal))

> offers

  principal term points  rate  monthly
1    275000   30      0 3.875 1293.152
2    275000   30      1 3.750 1273.568
3    275000   30      2 3.500 1234.873
4    275000   20      0 3.625 1612.610
5    275000   20      1 3.500 1594.889
6    275000   20      2 3.375 1577.282
7    275000   15      0 3.000 1899.100
8    275000   15      1 2.875 1882.611
9    275000   15      2 2.750 1866.210

In theory, each of these costs are fair, and the borrower should choose based on monthly costs they can afford, as well as whether they see a better value in having money in hand to spend on a better quality of life or to invest it in savings or in paying off their house sooner. Financial professionals often discuss things like the total dollars spent or the total spent on interest vs. principal, as well.

SAS
The SAS/ETS package provides the LOAN procedure, which can calculate the detailed analyses mentioned above. For simple calculations like this one, we can use the mort function in the data step. It will find and return the missing one of the four parameters-- principal, payment, rate, and term. To enter the data in a manner similar to R, we'll use array statements and do loops.

data t;
principal = 275000; 
array te [3] (30,20,15);
array po [3] (0,1,2); 
array ra [9] (.03875, .0375, .035, .03625, .035, 
              .03375, .03, .02875, .0275);
do i = 1 to 3;
  do j = 1 to 3;
    term = te[i];
 points = po[j];
 rate = ra[ 3 * (i-1) +j];
 monthly = mort(principal,.,rate/12, term*12);
    output;
  end;
end;
run;

proc print noobs data = t; 
var principal term points rate monthly; run;

principal    term    points      rate     monthly

  275000      30        0      0.03875    1293.15
  275000      30        1      0.03750    1273.57
  275000      30        2      0.03500    1234.87
  275000      20        0      0.03625    1612.61
  275000      20        1      0.03500    1594.89
  275000      20        2      0.03375    1577.28
  275000      15        0      0.03000    1899.10
  275000      15        1      0.02875    1882.61
  275000      15        2      0.02750    1866.21

An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission. If you read this on another aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page other than as noted above, the aggregator is violating the terms by which we publish our work.

Monday, May 14, 2012

Example 9.31: Exploring multiple testing procedures

In example 9.30 we explored the effects of adjusting for multiple testing using the Bonferroni and Benjamini-Hochberg (or false discovery rate, FDR) procedures. At the time we claimed that it would probably be inappropriate to extract the adjusted p-values from the FDR method from their context. In this entry we attempt to explain our misgivings about this practice.

The FDR procedure is described in Benjamini and Hochberg (JRSSB, 1995) as a "step-down" procedure. Put simply, the procedure has the following steps:


0. Choose the familywise alpha
1. Rank order the unadjusted p-values
2. Beginning with the Mth of the ordered p-values p(m), 
2a.    if p(m) < alpha*(m/M), then reject all tests 1 ... m, 
2b.    if not, m = m-1
3. Repeat steps 2a and 2b until the condition is met 
               or p(1) > alpha/M

where M is the number of tests. The "adjusted p-value" based on this procedure is the smallest familywise alpha under which the current test would have been rejected. To calculate this, we can modify the routine above:


1. Rank order the unadjusted p-values
2. For ordered p-values p(m) M to 1, 
2a.    candidate ap(m) = p(m) *(M/m) 
2b.    if candidate ap(m) > ap(m+1) then ap(m) = ap(m+1)
2c.    else ap(m) = candidate ap(m)

where ap(m) refers to the adjusted p-value corresponding to the mth ordered unadjusted p-value. It's interesting to note that the adjusted p-value for the Mth ordered test is the same as the unadjusted p-value, while the candidate adjusted p-value for the smallest test is the Bonferroni adjusted p-value. The primary difficulty with taking these p-values (as opposed to the test results) out of context is captured in steps 2b and 2c. They imply that the p-value for a given test may be lowered by other observed p-values in the family of tests. It's also true that the adjusted p-value depends on the number of tests included in the family, but this seems somewhat less troubling.

To examine the impact of the procedure on the adjusted p-values for the individual tests, we'll compare the candidate ap(m) from step 2a against the actual ap(m). Our sense is that to the degree these are different, the adjusted p-value should not be extracted from the context of the observed family of tests.

SAS
Our SAS code relies heavily on the array statement (section 1.11.5). We loop through the p-values from largest to smallest, calculating the candidate fdr p-value as above, before arriving at the final adjusted p-value. To compare the values conveniently, we make a new data set with two copies of the original data set, renaming first the candidate and then the adjusted p-values to have the same names. The in = data set option creates a temporary variable which identifies which data set an observation was read from; here it denotes which version of the same data set (and which set of p-values) was used.


data fdr;
array pvals [10] pval1 - pval10 
     (.001 .001 .001 .001 .001 .03 .035 .04 .05 .05);
array cfdrpvals [10] cfdr1 - cfdr10;
array fdrpvals [10] fdr1 - fdr10;
fdrpvals[10] = pvals[10];
do i = 9 to 1 by -1;
  cfdrpvals[i] = pvals[i] * 10/i;
  if cfdrpvals[i] > fdrpvals[i+1] then fdrpvals[i] = fdrpvals[i+1];
  else fdrpvals[i] = cfdrpvals[i];
  end;
run;

data compare;
set fdr (in = cfdr rename=(cfdr1=c1 cfdr2=c2 cfdr3=c3 cfdr4=c4 
           cfdr5=c5 cfdr6=c6 cfdr7=c7 cfdr8=c8 cfdr9=c9)) 
    fdr (in = fdr rename=(fdr1=c1 fdr2=c2 fdr3=c3 fdr4=c4 fdr5=c5 
           fdr6=c6 fdr7=c7 fdr8=c8 fdr9=c9));
if cfdr then adjustment = "Candidate fdr";
if fdr then adjustment = "Final fdr";
run;

proc print data = compare; var adjustment c1-c9; run;

adjustment       c1    c2    c3     c4    c5     c6    c7    c8    c9

Candidate fdr   0.010  .005  .0033  .0025  .002  .05   .05   .05   .055
Final fdr       0.002  .002  .0020  .0020  .002  .05   .05   .05   .050

(We omit the last p-value because the adjustment does not affect it.) The result shows that for many of the tests in this family, a substantially smaller p-value is obtained with the final FDR p-value than the candidate. To this degree, the FDR p-value is dependent on the observed values of the p-values in the tests in the family, and ought not to be removed from the context of these other tests. We would recommend caution in displaying the FDR p-values in such settings, given readers' propensity to use them as if they were ordinary p-values, safely adjusted for multiple testing.

R
Comparison of the R and SAS code may make SAS programmers weep. The candidate values are easily calculated, and can be presented with the final p-values in one step using the p.adjust() function. Three lines of code, albeit incorporating multiple functions in each line. (And it could sensibly be done in two, calculating the candidate p-values within the rbind() function call.) Note especially the line calculating the candidate p-values, in which vectorization allows a for loop to be avoided in a very natural fashion.


fakeps = c(rep(.2, 5), 6, 7, 8, 10, 10)/200
cfdr = fakeps * 10/(1:10)
rbind(cfdr, fdr=p.adjust(fakeps, "fdr"))[,1:9]

      [,1]  [,2]   [,3]   [,4]  [,5] [,6] [,7] [,8]   [,9] [,10]
cfdr 0.010 0.005 0.0033 0.0025 0.002 0.05 0.05 0.05 0.0556  0.05
fdr  0.002 0.002 0.0020 0.0020 0.002 0.05 0.05 0.05 0.0500  0.05

An unrelated note about aggregatorsWe love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers and PROC-X with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Monday, April 23, 2012

Example 9.28: creating datasets from tables

R
There are often times when it is useful to create an individual level dataset from aggregated data (such as a table). While this can be done using the expand.table() function within the epitools package, it is also straightforward to do directly within R.

Imagine that instead of the individual level data, we had only the 2x2 table for the association between homeless status and gender within the HELP RCT:


> HELPrct = read.csv("http://www.math.smith.edu/r/data/help.csv")
> xtabs(~ homeless + female, data=HELPrct)
        female
homeless   0   1
       0 177  67
       1 169  40

We can use this information to create an analytic dataset using just the four rows of a new dataset:


> female = c(0, 1, 0, 1)
> homeless = c(1, 1, 0, 0)
> count = c(169, 40, 177, 67)
> ds=data.frame(cbind(female, homeless, count))
> ds
  female homeless count
1      0        1   169
2      1        1    40
3      0        0   177
4      1        0    67

Next we use the rep() function to generate a vector of indices to repeat. The index object repeats each row number count times.


> index = rep(seq_len(nrow(ds)), times=ds$count)
> newds = ds[index,]
> newds$count = NULL
> xtabs(~ homeless + female, data=newds)
        female
homeless   0   1
       0 177  67
       1 169  40

The resulting data set is identical to the summarized input data set.

SAS
Many SAS procedures offer a weight varname option (as a statement within the proc) which will duplicate each observation varname times. So, for example, we can make a data set such as that shown above, then use, e.g., proc freq to produce a table.


data ds;
female = 0; homeless = 1; count = 169; output;
female = 1; homeless = 1; count = 40; output;
female = 0; homeless = 0; count = 177; output;
female = 1; homeless = 0; count = 67; output;
run;

proc freq data = ds;
table homeless * female;
weight count;
run;
                homeless     female

                Frequency|
                Percent  |
                Row Pct  |
                Col Pct  |       0|       1|  Total
                ---------+--------+--------+
                       0 |    177 |     67 |    244
                         |  39.07 |  14.79 |  53.86
                         |  72.54 |  27.46 |
                         |  51.16 |  62.62 |
                ---------+--------+--------+
                       1 |    169 |     40 |    209
                         |  37.31 |   8.83 |  46.14
                         |  80.86 |  19.14 |
                         |  48.84 |  37.38 |
                ---------+--------+--------+
                Total         346      107      453
                            76.38    23.62   100.00

However, some procedures lack this option, and/or it may be difficult to arrange your data appropriately to take advantage of it. In such cases, it's useful to be able to expand the data manually, as we show for R above. We demonstrate this below, assuming the count variable can be constructed. The explicit output statement puts a line into the newds data set count times.


data newds;
set ds;
do i = 1 to count;
  output;
  end;
run;

proc freq data = newds;
table homeless * female;
run;
                homeless     female

                Frequency|
                Percent  |
                Row Pct  |
                Col Pct  |       0|       1|  Total
                ---------+--------+--------+
                       0 |    177 |     67 |    244
                         |  39.07 |  14.79 |  53.86
                         |  72.54 |  27.46 |
                         |  51.16 |  62.62 |
                ---------+--------+--------+
                       1 |    169 |     40 |    209
                         |  37.31 |   8.83 |  46.14
                         |  80.86 |  19.14 |
                         |  48.84 |  37.38 |
                ---------+--------+--------+
                Total         346      107      453
                            76.38    23.62   100.00

Tuesday, March 13, 2012

Example 9.23: Demonstrating proportional hazards

A colleague recently asked after a slide suitable for explaining proportional hazards. In particular, she was concerned that her audience not focus on the time to event or probability of the event. An initial thought was to display the cumulative hazards, which have a constant proportion if the model is true. But the colleague's audience might get distracted both by the language (what's a "hazard"?) and the fact that the cumulative hazard doesn't have a readily interpretable scale. The failure curve, meaning the probability of failure by time t, over time, might be a bit more accessible.

Rather than just draw some curves, we simulated data, based on the code we demonstrated previously. In this case, there's no need for any interesting censoring, but a more interesting survival curve seems worthwhile.

SAS
The more interesting curve is introduced by manually accelerating and slowing down the Weibull survival time demonstrated in the previous approach. We also trichotomized one of the exposures to match the colleague's study, and censored all values greater than 10 to keep focus where the underlying hazard was interesting.


data simcox;
  beta1 = .2;
  beta2 = log(1.25);
  lambdat = 20; *baseline hazard;
  do i = 1 to 10000;
    x1 = normal(45);
    x2 = (normal(0) gt -1) + (normal(0) gt 1);
    linpred = -beta1*x1 - beta2*x2;
    t = rand("WEIBULL", 1, lambdaT * exp(linpred));
 if t gt 5 then t = 5 + (t-5)/10;
 if t gt 7 then t = 7 + (t-7) * 20;
    * time of event;
 censored = (t > 10);
    output;
  end;
run;

The phreg procedure will fit the model and produces nice plots of the survival function and the cumulative hazard. But to generate useful versions of these, you need to make an additional data set with the covariate values you want to show plots for. We set the other covariate's value at 0. You can also include an id variable with some descriptive text.


data covars;
x1 = 0; x2 = 2; id = "Arm 3"; output;
x1 = 0; x2 = 1; id = "Arm 2"; output;
x1 = 0; x2 = 0; id = "Arm 1"; output;
run;

proc phreg data = simcox plot(overlay)=cumhaz;
class x2 (ref = "0");
baseline covariates = covars out= kkout cumhaz = cumhaz 
    survival = survival / rowid = id;
model t*censored(1) = x1 x2;
run;

The cumulative hazard plot generated by the plot option is shown below, demonstrating the correct relative hazard of 1.25. The related survival curve could be generated with plot = s .

To get the desired plot of the failure times, use the out = option to the baseline statement. This generates a data set with the listed statistics (here the cumulative hazard and the survival probability, across time). Then we can produce a plot using the gplot procedure, after generating the failure probability.


data kk2;
set kkout;
iprob = 1 - survival;
run;

goptions reset=all;
legend1 label = none value=(h=2);
axis1 order = (0 to 1 by .25) minor = none value=(h=2)
  label = (a = 90 h = 3 "Probability of infection");
axis2 order = (0 to 10 by 2) minor = none value=(h=2)
  label = (h=3 "Attributable time");;
symbol1 i = sm51s v = none w = 3 r = 3;
proc gplot data = kk2;
plot iprob * t = id / vaxis = axis1 haxis = axis2 legend=legend1;
run; quit;

The result is shown at the top. Note the use of the h= option in various places in the axis and legend statement to make the font more visible when shrunk to fit onto a slide. The smoothing spline plotted through the data with the smXXs interpolation makes a nice shape out of the underlying abrupt changes in the hazard. The symbol, legend, and axis statements are discussed in chapter 6.

R
As in SAS, we begin by simulating the data. Note that we use the simple categorical variable simulator mentioned in the comments for example 7.20.


n = 10000
beta1 = .2
beta2 = log(1.25)
lambdaT = 20

x1 = rnorm(n,0)
x2 = sample(0:2,n,rep=TRUE,prob=c(1/3,1/3,1/3))
# true event time
T = rweibull(n, shape=1, scale=lambdaT*exp(-beta1*x1 - beta2*x2)) 
T[T>5] = 5 + (T[T>5] -5)/10
T[T>7] = 7 + (T[T>7] -7) * 20
event = T < rep(10,n)

Now we can fit the model using the coxph() function from the Survival package. There is a default method for plot()ing the "survfit" objects resulting from several package functions. However, it shows the typical survival plot. To show failure probability instead, we'll manually take the complement of the survival probability, but still take advantage of the default method. Note the use of the xmax option to limit the x-axis. The results (below) are somewhat bland, and it's unclear if the lines can be colored differently, or their widths increased. They are also as angular as the cumulative hazards shown in the SAS implementation.


library(survival)
plotph = coxph(Surv(T, event)~ x1 * strata(x2), 
     method="breslow")
summary(plotph)
sp = survfit(plotph)
sp$surv = 1 - sp$surv
plot(sp, xmax=10)

Consequently, as is so often the case, presentation graphics require more manual fiddling. We'll begin by extracting the data we need from the "survfit" object. We'll take just the failure times and probabilities, as well as the name of the strata to which each observation belongs, limiting the data to time < 10. The last of these lines uses the names() function to pull the names of the strata, repeating each an appropriate number of times with the useful rep() function.


sp = survfit(plotph)
failtimes = sp$time[sp$time <10]
failprobs = 1 - sp$surv[sp$time <10]
failcats = c(rep(names(sp$strata),times=sp$n))[sp$time <10]

All that remains is plotting the data, which is not dissimilar to many examples in the book and in this blog. There's likely some way to make these three lines with a little less typing, but knowing how to do it from scratch gives you the most flexibility. It proved difficult to get the desired amount of smoothness from the loess(), lowess(), or supsmu() functions, but smooth.spline() served admirably. The code below demonstrates increasing the axis and tick label sizes


plot(failprobs~failtimes, type="n", ylim=c(0,1), cex.lab=2, cex.axis= 1.5, 
     ylab= "Probability of infection", xlab = "Attributable time")
lines(smooth.spline(y=failprobs[failcats == "x2=0"], 
     x=failtimes[failcats == "x2=0"],all.knots=TRUE, spar=1.8), 
     col = "blue", lwd = 3)
lines(smooth.spline(y=failprobs[failcats == "x2=1"], 
     x=failtimes[failcats == "x2=1"],all.knots=TRUE, spar=1.8), 
     col = "red", lwd = 3)
lines(smooth.spline(y=failprobs[failcats == "x2=2"], 
     x=failtimes[failcats == "x2=2"],all.knots=TRUE, spar=1.8),
     col = "green", lwd = 3)
legend(x=7,y=0.4,legend=c("Arm 1", "Arm 2", "Arm 3"), 
     col = c("blue","red","green"), lty = c(1,1,1), lwd = c(3,3,3) )

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts