Wednesday, May 11, 2011

Example 8.37: Read sheets from an excel file

Microsoft Excel is an awkward tool for data analysis. However, it is a reasonable environment for recording and transfering data. In our consulting practice, people frequently send us data in .xls (from Excel 97-2003) or .xlsx (from Excel 2007 or 2010) formatted files.

In order to use the data in statistical software, you have to get it out of Excel. While Excel does provide some tools for exporting data, these are not easily replicable, since they rely on menu choices through the GUI. A better approach is to read the file directly from within the statistical software.

An additional complication is that a single file may contain several sheets, each of which may have unique columns and rows. While importing from Excel into SAS is shown in section 1.1.5, we don't discuss reading from specific sheets or show how to read an Excel file in R.

SAS

In SAS, it's possible to use the "Import Data" wizard to gain access via the GUI (File; Import Data; etc.) but this is no better than using the GUI in Excel. However, all the wizard does is compose a proc import to read from the file. This turns out to be important, because the documentation for using proc import for Excel files is hard to find.

The documentation is buried in the on-line help at SAS Products; SAS/ACCESS; SAS/ACCESS 9.2 for PC Files: Reference; File Format-Specific Reference; Microsoft Excel Workbook Files. This is not the same material found through SAS Products; SAS Procedures; Proc Import. The code below was derived by running the wizard and using its option to save the resulting commands. The help.xlsx file can be downloaded from the book website; SAS currently cannot read an Excel file in directly from a URL-- you must download the file manually and read it locally.

PROC IMPORT OUT= WORK.test
DATAFILE= "C:\temp\help.xlsx"
DBMS=EXCEL REPLACE;
RANGE="help.csv";
GETNAMES=YES;
MIXED=NO;
SCANTEXT=YES;
USEDATE=YES;
SCANTIME=YES;
RUN;

The range option can be used to specify the desired sheet. This means that you must know the name of the sheet you want to import and type it in. The trailing "$" can be used to read in just a specific range of cells, but with no additional information it implies the full sheet. Since there is only one sheet in the helpdata.xlsx file, it's not required here.

R

The foreign package reads data from many file types. However, the .xls and .xlsx formats are not among them, as far as we know. Fortunately, the files can be read with the gdata package, using the read.xls() function. This will read files in either the .xls or the .xlsx format.

library(gdata)
ffdata = read.xls("http://www.math.smith.edu/r/data/help.xlsx",
sheet=1)

In this implementation, you specify the sheet by number, rather than name. This may be less precise than using the full name, but it does spare some tedious typing.

8 comments:

Harlan said...

Interestingly, the gdata read.xls() function is actually just calling some Perl that does the actual parsing of the XLS file.

Another option is to upload the Excel file to Google Docs, then pull down the CSV view of the spreadsheet directly from R.

Anonymous said...

In SAS, why not just use:
libname source "C:\temp\help.xlsx" access=readonly ;

then you can reference the sheet directly from any proc/data step any time you want with "source.'sheetname$'N".

Gabor said...

Although the read.xls documentation in gdata says that the sheet argument is a sheet number it does work if its a sheet name as well.

Andy said...

Its a little more complicated to use, but XLConnect provides the ability to read and write from / to Excel files.

http://cran.r-project.org/web/packages/XLConnect/index.html

For the record. I do not work for the developers. I am however, a big fan of their product.

Ken Kleinman said...

Thanks for the great comments, folks!

Anonymous said...

Another option for .xls & .xlsx files in R is the RODBC package:

library(RODBC)
xls <- odbcConnectExcel2007("http://www.math.smith.edu/r/data/help.xlsx")
ffdata <- sqlFetch(xls,"sheet.name")

The possible advantage of RODBC is that it is also the way to grab data from Access and true SQL databases; I avoid .xls files if at all possible in favor of smaller, cleaner .csv.

I use and recommend your R and Data Management book in webinars on R for natural resources.

Ken Kleinman said...

Thanks for mentioning us, Anonymous!

Anonymous said...

the excel engine does not allow options, such as "get names" for designating (or not) the first row as a variable name or the "mixed values" option. and Import will allow for the selection of a range, which can be the entire sheet or a range of cells, in addition to an option "sheet" that selects the sheet by name.