SAS and R: Example 2014.12: Changing repeated measures data from wide to narrow format

Tuesday, October 21, 2014

Example 2014.12: Changing repeated measures data from wide to narrow format

Data with repeated measures often come to us in the "wide" format, as shown below for the HELP data set we use in our book. Here we show just an ID, the CESD depression measure from four follow-up assessments, plus the baseline CESD.

Obs    ID    CESD1    CESD2    CESD3    CESD4    CESD

  1     1       7        .        8        5      49
  2     2      11        .        .        .      30
  3     3      14        .        .       49      39
  4     4      44        .        .       20      15
  5     5      26       27       15       28      39
...

Frequently for data analysis we need to convert the data to the "long" format, with a single column for the repeated time-varying CESD measures and column indicating the time of measurement. This is needed, for example, in SAS proc mixed or in the lme4 package in R. The data should look something like this:

 Obs    ID    time    CESD    cesd_tv

   1     1     1       49         7
   2     1     2       49         .
   3     1     3       49         8
   4     1     4       49         5
...

In section 2.3.7 (2nd Edition) we discuss this problem, and we provide an example in section 7.10.9. Today we're adding a blog post to demonstrate some handy features in SAS and how the problem can be approached using plain R and, alternatively, using the new-ish R packages dplyr and tidyr, contributed by Hadley Wickham.

R
We'll begin by making a narrower data frame with just the columns noted above. We use the select() function from the dplyr package to do this; the syntax is simply to provide the the name of the input data frame as the first argument and then the names of the columns to be included in the output data frame. We use this function instead of the similar base R function subset(..., select=) because of dplyr's useful starts_with() function. This operates on the column names as character vectors in a hopefully obvious way.

load("c:/book/savedfile")

library(dplyr)
wide = select(ds, id, starts_with("cesd"))

Now we'll convert to the long format. The standard R approach is to use the reshape() function. The documentation for this is a bit of a slog, and the function can generate error messages that are not so helpful. But for simple problems like this one, it works well.

long = reshape(wide, varying = c("cesd1", "cesd2", "cesd3", "cesd4"),
               v.names = "cesd_tv",
               idvar = c("id", "cesd"), direction="long")
long[long$id == 1,]

       id cesd time cesd_tv
1.49.1  1   49    1       7
1.49.2  1   49    2      NA
1.49.3  1   49    3       8
1.49.4  1   49    4       5

In the preceding, the varying parameter is a list of columns which vary over time, while the id.var columns appear at each time. The v.names parameter is the name of the column which will hold the values of the varying variables.

Another option would be to use base R knowledge to separate, rename, and then recombine the data as follows. The main hassle here is renaming the columns in each separate data frame so that they can be combined later.

c1 = subset(wide, select= c(id, cesd, cesd1))
c1$time = 1
names(c1)[3] = "cesd_tv"

c2 = subset(wide, select= c(id, cesd, cesd2))
c2$time = 2
names(c2)[3] = "cesd_tv"

c3 = subset(wide, select= c(id, cesd, cesd3))
c3$time = 3
names(c3)[3] = "cesd_tv"

c4 = subset(wide, select= c(id, cesd, cesd4))
c4$time = 4
names(c4)[3] = "cesd_tv"

long = rbind(c1,c2,c3,c4)
long[long$id==1,]

     id cesd cesd_tv time
1     1   49       7    1
454   1   49      NA    2
907   1   49       8    3
1360  1   49       5    4

This is cumbersome, but effective.

More interesting is to use the tools provided by dplyr and tidyr.

library(tidyr)
gather(wide, key=names, value=cesd_tv, cesd1,cesd2,cesd3,cesd4) %>%
mutate(time = as.numeric(substr(names,5,5))) %>%
arrange(id,time) -> long

head(long)

  id cesd names cesd_tv time
1  1   49 cesd1       7    1
2  1   49 cesd2      NA    2
3  1   49 cesd3       8    3
4  1   49 cesd4       5    4
5  2   30 cesd1      11    1
6  2   30 cesd2      NA    2

The gather() function takes a data frame (the first argument) and returns new columns named in the key and value parameter. The contents of the columns are the names (in the key) and the values (in the value) of the former columns listed. The result is a new data frame with a row for every column in the original data frame, for every row in the original data frame. Any columns not named are repeated in the output data frame. The mutate function is like the R base function transform() but has some additional features and may be faster in some settings. Finally, the arrange() function is a much more convenient sorting facility than is available in standard R. The input is a data frame and a list of columns to sort by, and the output is a sorted data frame. This saves us having to select out a subject to display

The %>% operator is a "pipe" or "chain" operator that may be familiar if you're a *nix user. It feeds the result of the last function into the next function as the first argument. This can cut down on the use of nested parentheses and may make reading R code easier for some folks. The effect of the piping is that the mutate() function should be read as taking the result of the gather() as its input data frame, and sending its output data frame into the arrange() function. For Ken, the right assignment arrow (-> long) makes sense as a way to finish off this set of piping rules, but Nick and many R users would prefer to write this as long = gather... or long <- gather.. , etc.

SAS
In SAS, we'll make the narrow data set using the keep statement in the data step, demonstrating meanwhile the convenient colon operator, that performs the same function provided by starts_with() in dplyr.

data all;
set "c:/book/help.sas7bdat";
run;

data wide;
set all;
keep id cesd:;
run;

The simpler way to make the desired data set is with the transpose procedure. Here the by statement forces the variables listed in that statement not to be transposed. The notsorted options save us having to actually sort the variables. Otherwise the procedure works like gather(): each transposed variable becomes a row in the output data set for every observation in the input data set. SAS uses standard variable names for gather()'s key (SAS: _NAME_)and value (SAS: COL1) though these can be changed.

proc transpose data = wide out = long_a;
by notsorted id notsorted cesd;
run;

data long;
set long_a;
time = substr(_name_, 5);
rename col1=cesd_tv;
run;

proc print data = long;
where id eq 1;
var id time cesd cesd_tv; 
run;

 Obs    ID    time    CESD    cesd_tv

   1     1     1       49         7
   2     1     2       49         .
   3     1     3       49         8
   4     1     4       49         5

As with R, it's trivial, though somewhat cumbersome, to generate this effect using basic coding.

data long;
set wide;
time = 1; cesd_tv = cesd1; output;
time = 2; cesd_tv = cesd2; output;
time = 3; cesd_tv = cesd3; output;
time = 4; cesd_tv = cesd4; output;
run;

proc print data = long;
where id eq 1;
var id time cesd cesd_tv; 
run;

 Obs    ID    time    CESD    cesd_tv

   1     1     1       49         7
   2     1     2       49         .
   3     1     3       49         8
   4     1     4       49         5

An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

4 comments:

Ken said...: More elegant is:

data long;
set wide;
array cesd_array {4} cesd1-cesd4;
do time=1 to 4;
cesd_tv=cesd_array{time};
output;
end;
run;

Helps a lot if you have data over say 12 visits and transpose isn't suitable. The nice thing about the loop is that you can add extra processing which with proc transpose requires a data step anyway and usually keeping track of id so it is just easier to use a loop.; October 22, 2014 at 1:19 AM
Ananda said...: I've written a package called "splitstackshape" that also provides a convenient way to do this reshaping. Hopefully Blogger doesn't mangle the code too much :-)

library(splitstackshape)
merged.stack(helpdata, id.vars = c("id", "cesd"),
var.stubs = "cesd\\d", sep = "cesd",
keep.all = FALSE)

The `keep.all = FALSE` says to not return all of the other columns of the dataset, but only those specified by the "id.vars" and those captured matching the "var.stubs".; October 22, 2014 at 2:28 AM
Ken Kleinman said...: The SAS approach that the other Ken posted above is demonstrated in our book in example 7.10.9; this is slightly less typing-intensive way of doing the job manually.

There are several alternative R approaches available. In addition to the package that Ananda contributed, for example, users might consider reshape2. This multitude of methods suggests that none has yet become standard. It's possible that the speed of the dplyr package means it will become that standard.; October 22, 2014 at 8:46 AM
Chris Andrews said...: RE: "We use this function instead of the similar base R function subset(..., select=) because of dplyr's useful starts_with() function."

This 'starts with' functionality can be implemented with grepl in base.

subset(ds, select=grepl("^cesd*", names(ds))); October 23, 2014 at 12:20 PM

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts

Tuesday, October 21, 2014

Example 2014.12: Changing repeated measures data from wide to narrow format

4 comments:

About SAS and R

Topics discussed