Tuesday, February 22, 2011

Example 8.26: reading data with variable number of words in a field

A student came with a question about how to snag data from a PDF report for analysis. Once she'd copied things her text file looked like:

1 Las Vegas, NV --- 53.3 --- --- 1
2 Sacramento, CA --- 42.3 --- --- 2
3 Miami, FL --- 41.8 --- --- 3
4 Tucson, AZ --- 41.7 --- --- 4
5 Cleveland, OH --- 38.3 --- --- 5
6 Cincinnati, OH 15 36.4 --- --- 6
7 Colorado Springs, CO --- 36.1 --- --- 7
8 Memphis, TN --- 35.3 --- --- 8
8 New Orleans, LA --- 35.3 --- --- 8
10 Mesa, AZ --- 34.7 --- --- 10
11 Baltimore, MD --- 33.2 --- --- 11
12 Philadelphia, PA --- 31.7 --- --- 12
13 Salt Lake City, UT --- 31.9 17 --- 13

Here the --- means a missing value, and there's some complexity induced because some cities are made up of multiple words (so the number of spaced delimited fields varies). Unfortunately, she had hundreds of such datasets to process.

While section 1.1.3 (p. 3) of the book describes reading more complex files, neither it nor the entry on finding the Amazon sales rank are directly relevant here.

R

In R, we first craft a function that processes a line and converts each field other than the city name into a numeric variable. The function works backwards to snag the appropriate elements, then calculates what is left over to stash in the city variable.

readellaline = function (thisline) {
thislen = length(thisline)
id = as.numeric(thisline[1])
v1 = as.numeric(thisline[thislen-4])
v2 = as.numeric(thisline[thislen-3])
v3 = as.numeric(thisline[thislen-2])
v4 = as.numeric(thisline[thislen-1])
v5 = as.numeric(thisline[thislen])
city = paste(thisline[2:(thislen-5)], collapse=" ")
return(list(id=id,city=city,v1=v1,v2=v2,v3=v3,v4=v4,v5=v5))
}

However, before this function can work, it needs each line to be converted into a character vector containing each "word" (character strings divided by spaces) as a separate element. We'll do this by first reading each line, then split()ting it into words. This results in a list object, where the items in the list are the vectors of words. Then we can call the readellaline() function for each vector using an invocation of sapply() (section 1.3.2, p. 12), which avoids the need for a for loop. The resulting object can be transposed then coerced into a dataframe.

# read the input file
file = readLines("http://www.math.smith.edu/r/data/ella.txt")
split = strsplit(file, " ") # split up fields for each line
processed = as.data.frame(t(sapply(split, readellaline)))
processed

This generates the following output:

id city v1 v2 v3 v4 v5
1 1 Las Vegas, NV NA 53.3 NA NA 1
2 2 Sacramento, CA NA 42.3 NA NA 2
3 3 Miami, FL NA 41.8 NA NA 3
4 4 Tucson, AZ NA 41.7 NA NA 4
5 5 Cleveland, OH NA 38.3 NA NA 5
6 6 Cincinnati, OH 15 36.4 NA NA 6
7 7 Colorado Springs, CO NA 36.1 NA NA 7
8 8 Memphis,TN NA 35.3 NA NA 8
9 8 New Orleans, LA NA 35.3 NA NA 8
10 10 Mesa, AZ NA 34.7 NA NA 10
11 11 Baltimore, MD NA 33.2 NA NA 11
12 12 Philadelphia, PA NA 31.7 NA NA 12
13 13 Salt Lake City, UT NA 31.9 17 NA 13


SAS
The input tools in SAS can accommodate this data directly, without resorting to reading the data once and then processing it. However, two separate special tools are needed. These are: 1) the dlm option to the infile statement, to make both commas and spaces be treated as field delimiters, and 2) the & format modifier, which allows spaces within a variable that's being read in.

data ella4;
infile "c:\book\ella.txt" dlm=", ";
input id city & $20. state $2. v1 - v5;
run;

proc print data=ella; run;

In effect, the forgoing input statement instructs SAS that the field following id is to be named city, that it may have spaces in it, and that it is a character variable with a length of up to 20 characters; it will read into that variables for 20 spaces, or until the next non-space delimiter. The dlm option means that a comma is a delimiter.

Examining the log resulting from running the preceding code will reveal many notes regarding "invalid data", corresponding to the dashes. However, these result (correctly) in missing data codes, so they can safely be ignored.

Obs id city state v1 v2 v3 v4 v5

1 1 Las Vegas NV . 53.3 . . 1
2 2 Sacramento CA . 42.3 . . 2
3 3 Miami FL . 41.8 . . 3
4 4 Tucson AZ . 41.7 . . 4
5 5 Cleveland OH . 38.3 . . 5
6 6 Cincinnati OH 15 36.4 . . 6
7 7 Colorado Springs CO . 36.1 . . 7
8 8 Memphis TN . 35.3 . . 8
9 8 New Orleans LA . 35.3 . . 8
10 10 Mesa AZ . 34.7 . . 10
11 11 Baltimore MD . 33.2 . . 11
12 12 Philadelphia PA . 31.7 . . 12
13 13 Salt Lake City UT . 31.9 17 . 13

1 comment:

wojteksobala said...

file <- readLines("http://www.math.smith.edu/r/data/ella.txt")
file <- gsub("^([0-9]* )(.*),( .*)$","\\1'\\2'\\3",file)
tc <- textConnection(file)
processed <- read.table(tc,sep=" ",na.string="---")
close(tc)