## Tuesday, February 22, 2011

### Example 8.26: reading data with variable number of words in a field

A student came with a question about how to snag data from a PDF report for analysis. Once she'd copied things her text file looked like:
`1 Las Vegas, NV --- 53.3 --- --- 12 Sacramento, CA --- 42.3 --- --- 23 Miami, FL --- 41.8 --- --- 34 Tucson, AZ --- 41.7 --- --- 45 Cleveland, OH --- 38.3 --- --- 56 Cincinnati, OH 15 36.4 --- --- 67 Colorado Springs, CO --- 36.1 --- --- 78 Memphis, TN --- 35.3 --- --- 88 New Orleans, LA --- 35.3 --- --- 810 Mesa, AZ --- 34.7 --- --- 1011 Baltimore, MD --- 33.2 --- --- 1112 Philadelphia, PA --- 31.7 --- --- 1213 Salt Lake City, UT --- 31.9 17 --- 13`

Here the --- means a missing value, and there's some complexity induced because some cities are made up of multiple words (so the number of spaced delimited fields varies). Unfortunately, she had hundreds of such datasets to process.

While section 1.1.3 (p. 3) of the book describes reading more complex files, neither it nor the entry on finding the Amazon sales rank are directly relevant here.

R

In R, we first craft a function that processes a line and converts each field other than the city name into a numeric variable. The function works backwards to snag the appropriate elements, then calculates what is left over to stash in the city variable.
`readellaline = function (thisline) {  thislen = length(thisline)  id = as.numeric(thisline[1])  v1 = as.numeric(thisline[thislen-4])  v2 = as.numeric(thisline[thislen-3])  v3 = as.numeric(thisline[thislen-2])  v4 = as.numeric(thisline[thislen-1])  v5 = as.numeric(thisline[thislen])  city = paste(thisline[2:(thislen-5)], collapse=" ")  return(list(id=id,city=city,v1=v1,v2=v2,v3=v3,v4=v4,v5=v5))}`

However, before this function can work, it needs each line to be converted into a character vector containing each "word" (character strings divided by spaces) as a separate element. We'll do this by first reading each line, then split()ting it into words. This results in a list object, where the items in the list are the vectors of words. Then we can call the readellaline() function for each vector using an invocation of sapply() (section 1.3.2, p. 12), which avoids the need for a for loop. The resulting object can be transposed then coerced into a dataframe.
`# read the input filefile = readLines("http://www.math.smith.edu/r/data/ella.txt")    split = strsplit(file, " ")   # split up fields for each line  processed = as.data.frame(t(sapply(split, readellaline)))processed`

This generates the following output:
`   id                 city v1   v2 v3 v4 v51   1        Las Vegas, NV NA 53.3 NA NA  12   2       Sacramento, CA NA 42.3 NA NA  23   3            Miami, FL NA 41.8 NA NA  34   4           Tucson, AZ NA 41.7 NA NA  45   5        Cleveland, OH NA 38.3 NA NA  56   6       Cincinnati, OH 15 36.4 NA NA  67   7 Colorado Springs, CO NA 36.1 NA NA  78   8           Memphis,TN NA 35.3 NA NA  89   8      New Orleans, LA NA 35.3 NA NA  810 10             Mesa, AZ NA 34.7 NA NA 1011 11        Baltimore, MD NA 33.2 NA NA 1112 12     Philadelphia, PA NA 31.7 NA NA 1213 13   Salt Lake City, UT NA 31.9 17 NA 13`

SAS
The input tools in SAS can accommodate this data directly, without resorting to reading the data once and then processing it. However, two separate special tools are needed. These are: 1) the dlm option to the infile statement, to make both commas and spaces be treated as field delimiters, and 2) the & format modifier, which allows spaces within a variable that's being read in.
`data ella4;  infile "c:\book\ella.txt" dlm=", ";  input id city & \$20. state \$2. v1 - v5;run;proc print data=ella; run;`

In effect, the forgoing input statement instructs SAS that the field following id is to be named city, that it may have spaces in it, and that it is a character variable with a length of up to 20 characters; it will read into that variables for 20 spaces, or until the next non-space delimiter. The dlm option means that a comma is a delimiter.

Examining the log resulting from running the preceding code will reveal many notes regarding "invalid data", corresponding to the dashes. However, these result (correctly) in missing data codes, so they can safely be ignored.
`Obs   id   city               state   v1    v2    v3   v4   v5  1    1   Las Vegas           NV      .   53.3    .    .    1  2    2   Sacramento          CA      .   42.3    .    .    2  3    3   Miami               FL      .   41.8    .    .    3  4    4   Tucson              AZ      .   41.7    .    .    4  5    5   Cleveland           OH      .   38.3    .    .    5  6    6   Cincinnati          OH     15   36.4    .    .    6  7    7   Colorado Springs    CO      .   36.1    .    .    7  8    8   Memphis             TN      .   35.3    .    .    8  9    8   New Orleans         LA      .   35.3    .    .    8 10   10   Mesa                AZ      .   34.7    .    .   10 11   11   Baltimore           MD      .   33.2    .    .   11 12   12   Philadelphia        PA      .   31.7    .    .   12 13   13   Salt Lake City      UT      .   31.9   17    .   13`

#### 1 comment:

Anonymous said...