SAS and R: Example 8.26: reading data with variable number of words in a field

Tuesday, February 22, 2011

Example 8.26: reading data with variable number of words in a field

A student came with a question about how to snag data from a PDF report for analysis. Once she'd copied things her text file looked like:


1 Las Vegas, NV --- 53.3 --- --- 1
2 Sacramento, CA --- 42.3 --- --- 2
3 Miami, FL --- 41.8 --- --- 3
4 Tucson, AZ --- 41.7 --- --- 4
5 Cleveland, OH --- 38.3 --- --- 5
6 Cincinnati, OH 15 36.4 --- --- 6
7 Colorado Springs, CO --- 36.1 --- --- 7
8 Memphis, TN --- 35.3 --- --- 8
8 New Orleans, LA --- 35.3 --- --- 8
10 Mesa, AZ --- 34.7 --- --- 10
11 Baltimore, MD --- 33.2 --- --- 11
12 Philadelphia, PA --- 31.7 --- --- 12
13 Salt Lake City, UT --- 31.9 17 --- 13

Here the --- means a missing value, and there's some complexity induced because some cities are made up of multiple words (so the number of spaced delimited fields varies). Unfortunately, she had hundreds of such datasets to process.

While section 1.1.3 (p. 3) of the book describes reading more complex files, neither it nor the entry on finding the Amazon sales rank are directly relevant here.

R

In R, we first craft a function that processes a line and converts each field other than the city name into a numeric variable. The function works backwards to snag the appropriate elements, then calculates what is left over to stash in the city variable.


readellaline = function (thisline) {
  thislen = length(thisline)
  id = as.numeric(thisline[1])
  v1 = as.numeric(thisline[thislen-4])
  v2 = as.numeric(thisline[thislen-3])
  v3 = as.numeric(thisline[thislen-2])
  v4 = as.numeric(thisline[thislen-1])
  v5 = as.numeric(thisline[thislen])
  city = paste(thisline[2:(thislen-5)], collapse=" ")
  return(list(id=id,city=city,v1=v1,v2=v2,v3=v3,v4=v4,v5=v5))
}

However, before this function can work, it needs each line to be converted into a character vector containing each "word" (character strings divided by spaces) as a separate element. We'll do this by first reading each line, then split()ting it into words. This results in a list object, where the items in the list are the vectors of words. Then we can call the readellaline() function for each vector using an invocation of sapply() (section 1.3.2, p. 12), which avoids the need for a for loop. The resulting object can be transposed then coerced into a dataframe.


# read the input file
file = readLines("http://www.math.smith.edu/r/data/ella.txt")    
split = strsplit(file, " ")   # split up fields for each line  
processed = as.data.frame(t(sapply(split, readellaline)))
processed

This generates the following output:


   id                 city v1   v2 v3 v4 v5
1   1        Las Vegas, NV NA 53.3 NA NA  1
2   2       Sacramento, CA NA 42.3 NA NA  2
3   3            Miami, FL NA 41.8 NA NA  3
4   4           Tucson, AZ NA 41.7 NA NA  4
5   5        Cleveland, OH NA 38.3 NA NA  5
6   6       Cincinnati, OH 15 36.4 NA NA  6
7   7 Colorado Springs, CO NA 36.1 NA NA  7
8   8           Memphis,TN NA 35.3 NA NA  8
9   8      New Orleans, LA NA 35.3 NA NA  8
10 10             Mesa, AZ NA 34.7 NA NA 10
11 11        Baltimore, MD NA 33.2 NA NA 11
12 12     Philadelphia, PA NA 31.7 NA NA 12
13 13   Salt Lake City, UT NA 31.9 17 NA 13

SAS
The input tools in SAS can accommodate this data directly, without resorting to reading the data once and then processing it. However, two separate special tools are needed. These are: 1) the dlm option to the infile statement, to make both commas and spaces be treated as field delimiters, and 2) the & format modifier, which allows spaces within a variable that's being read in.


data ella4;
  infile "c:\book\ella.txt" dlm=", ";
  input id city & $20. state $2. v1 - v5;
run;

proc print data=ella; run;

In effect, the forgoing input statement instructs SAS that the field following id is to be named city, that it may have spaces in it, and that it is a character variable with a length of up to 20 characters; it will read into that variables for 20 spaces, or until the next non-space delimiter. The dlm option means that a comma is a delimiter.

Examining the log resulting from running the preceding code will reveal many notes regarding "invalid data", corresponding to the dashes. However, these result (correctly) in missing data codes, so they can safely be ignored.


Obs   id   city               state   v1    v2    v3   v4   v5

  1    1   Las Vegas           NV      .   53.3    .    .    1
  2    2   Sacramento          CA      .   42.3    .    .    2
  3    3   Miami               FL      .   41.8    .    .    3
  4    4   Tucson              AZ      .   41.7    .    .    4
  5    5   Cleveland           OH      .   38.3    .    .    5
  6    6   Cincinnati          OH     15   36.4    .    .    6
  7    7   Colorado Springs    CO      .   36.1    .    .    7
  8    8   Memphis             TN      .   35.3    .    .    8
  9    8   New Orleans         LA      .   35.3    .    .    8
 10   10   Mesa                AZ      .   34.7    .    .   10
 11   11   Baltimore           MD      .   33.2    .    .   11
 12   12   Philadelphia        PA      .   31.7    .    .   12
 13   13   Salt Lake City      UT      .   31.9   17    .   13

1 comment:

Anonymous said...: file <- readLines("http://www.math.smith.edu/r/data/ella.txt")
file <- gsub("^([0-9]* )(.*),( .*)$","\\1'\\2'\\3",file)
tc <- textConnection(file)
processed <- read.table(tc,sep=" ",na.string="---")
close(tc); February 23, 2011 at 2:38 PM

Post a Comment

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts

Tuesday, February 22, 2011

Example 8.26: reading data with variable number of words in a field

1 comment:

About SAS and R

Topics discussed