Monday, September 10, 2012

Example 10.1: Read a file byte by byte



More and more makers of electronic devices use standard storage media to record data. Sometimes this is central to the device's function, as in a camera, so that the data must be easy to recover. Other times, it's effectively incidental, and the device maker may not provide easy access to the stored data.

For example, I recently was prescribed a constant positive air pressure (CPAP) machine for sleep apnea. My machine, made by Philips, records data onto a SD card. The card's file system is readable, but Philips provides neither software to read the files, nor a data dictionary to explain what they contain. (I believe Philips sells software that does read it, for ludicrous prices, to physicians who prescribe their machines. Nice racket.)

If you open the files on the card as ASCII files, you get a bunch of gobbledygook. But the data they contain is mine, in the most fundamental sense! I want to be able to read it. Fortunately, some folks have done a fair amount of work to reverse engineer the files. Through them, I was able to find some guidance for data from a related machine. Now I know what's in the file, more or less: a header of 25 bytes, 1200 bytes of data-- representing 4 minutes of recording--, and a one-byte footer, repeated ad nauseum (ad somnum? somno contingit?).

Today we show how to read bytes stored in a file as signed integers. For this file, (download,) we also trim out the header and footer, and make a simple line plot of the recorded data, which appear to be some function of the variable pressure with which the CPAP machine outputs air. Next time we'll make a more useful plot.

SAS
In SAS, we can use the infile statement to read in the data (section 1.1.2).


data test;
infile "c:\ken\cpap\0000000007.005" recfm=n;
input byte ib1. @@;
run;

The recfm=n option tells SAS (for Windows, may differ in other OS) to read the file in binary. The ib1. informat tells SAS to read the bytes in the native format. (We cover reading in various formats in section 1.1.3, A.6.4, and several examples.) The @@ tells SAS to hold this line of input, rather than skipping to a new line, when the data is read. (See Example 8.1) . SAS will read bytes until there are no more to read.

Now I have a file with 128,680 observations, each being a signed integer. Some of these are actually nonsense, since the header and footer contain data stored in a variety of formats. To get rid of the header, we'll use the _n_ implied variable (section 1.4.15) which is effectively the line number, in conjunction with the mod function. While we're processing the data set anyhow, we'll also figure out the total elapsed time, which will be useful for plotting.

data cycles;
set test;
if mod(_n_,1226) ge 25 and mod(_n_,1226) lt 1225;
/* otherwise it's a header or the footer */
time_min = (4 * int(_n_/1226)) + (mod(_n_, 1226) - 24 )/(300);
/* 4 minutes for each header-data-footer block +
number of measurements in this data block / 300
(measurements per minute).
run;

Now it's easy to plot the data-- a simple connected line plot across time makes sense, and can be made using the symbol statement with the i=j (j for join) syntax. The result is shown above.

symbol i = j v = none c = black;
proc gplot data = cycles;
where time_min le 4;
plot byte * time_min;
run;
quit;


R
In R, we'll use the readBin() function to actually read the file, but we need to do a little prep, first. The readBin() function requires we input the number of data elements to read. An overestimate is OK, but we can easily find the exact length of the file using the file.info function; the resulting object has a size constituent with the number of bytes. We'll also need a "connection" to the file, which is established in a call to the file() function.

finfo = file.info("0000000007.005")
toread= file("0000000007.005", "rb")
alldata = readBin(toread, integer(), size=1, n = finfo$size, endian="little")

The size option is the length of the elements, in bytes, and the endian option helps describe how the bytes should be read.

Analogous to SAS, the alldata vector has 128,680 integers. All that remains is to remove the headers and footers. We'll do that by making a logical test with the %% operator, saving the result as a vector, and selecting out the data from among the headers and footers using this logical. All that then remains is to plot the data-- we replicate the SAS plot of the first 1200 observations (4 minutes).

keep = 1:finfo$size %% 1226 > 24 & 1:finfo$size %% 1226 < 1225
cycles = alldata[keep]
# cycles gets only the elements of alldata when the corresponding
# element of keep is TRUE

plot(1:1200, cycles[1:1200], type = "l")

The result is shown below. The first four minutes of my night's sleep were apparently characterized by generally lengthening breaths that became increasingly shallow.




An unrelated note about aggregatorsWe love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

1 comment:

Anonymous said...

Wonderful, absolutely wonderful. Keep the CPAP hacks coming!