Showing posts with label ifelse(). Show all posts
Showing posts with label ifelse(). Show all posts

Friday, April 25, 2014

Example 2014.5: Simple mean imputation

We're both users of multiple imputation for missing data. We believe it is the most practical principled method for incorporating the most information into data analysis. In fact, one of our more successful collaborations is a review of software for multiple imputation.

But, for me at least, there are times when a simpler form of imputation may be useful. For example, it may be desirable to calculate the mean of the observed values and substitute it for any missing values. Typically it would be unwise to attempt to use a data set completed in this way for formal inference, but it could be convenient under deadline pressure or for a very informal overview of the data.

Nick disagrees. He finds it hard to imagine any setting in which he would ever use such a primitive approach. He passes on to the reader the sage advice he received in graduate school: that making up data in such an ad-hoc fashion might be construed as dereliction or even misconduct.   Use of single imputation approaches (which yield bias in many settings and attenuate estimates of variance) seems hard to justify in 2014.  But one of the hallmarks of our partnership is that we can agree to disagree on an absolute ban, while jointly advising the reader to proceed with great caution.

SAS
In SAS, it would possible to approach this using proc means to find the means and then add them back into the data set in a data step. But there is a simpler way, using proc standard.
proc standard data=indata out=outdata replace; 
run;
This will replace the values of all missing numeric variables in the indata data set with the mean of the observed values, and save the result in a new data set, outdata. To restrict the operation to specific variables, use a var statement.

R
There may be a function designed to do this in R, but it's simple enough using the features of the language. We provide an option using the bracket ([) extractor operator and another using the ifelse() function. The latter may be more approachable for those less familiar with R.
df = data.frame(x = 1:20, y = c(1:10,rep(NA,10)))
df$y[is.na(df$y)] = mean(df$y, na.rm=TRUE)
# alternative
df = transform(df, y = ifelse(is.na(y), mean(y, na.rm=TRUE), y))
 
In the first example, we identify elements of y that are NA, and replace them with the mean, if so. In the second, we test each element of y; if it is NA, we replace with the mean, otherwise we replace with the original value.

An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Monday, May 21, 2012

Example 9.32: Multiple testing simulation

In examples 9.30 and 9.31 we explored corrections for multiple testing and then extracting p-values adjusted by the Benjamini and Hochberg (or FDR) procedure. In this post we'll develop a simulation to explore the impact of "strong" and "weak" control of the family-wise error rate offered in multiple comparison corrections. Loosely put, weak control procedures may fail when some of the null hypotheses are actually false, in that the remaining (true) nulls may be rejected more than the nominal proportion of times.

For our simulation, we'll develop flexible code to generate some p-values from false nulls and others from true nulls. We'll assume that the true nulls have p-values distributed uniform (0,1); the false nulls will have p-values distributed uniform with a user-determined maximum. We'll also allow the number of tests overall and the number of false nulls to be set.

SAS
In SAS, a macro does the job. It accepts the user parameters described above, then generates false and true nulls for each desired simulation. With the data created, we can use proc multtest to apply the FDR procedure, with the ODS system saving the results. Note how the by statement allows us to replicate the analysis for each simulated set of p-values without creating a separate data set for each one. (Also note that we do not use proc sort before that by statement-- this can be risky, but works fine here.)

%macro fdr(nsims=1, ntests = 20, nfalse=10, howfalse=.01);
ods select none;
data test;
do sim = 1 to &nsims;
do i = 1 to &ntests;
raw_p = uniform(0) *
( ((i le &nfalse) * &howfalse ) + ((i gt &nfalse) * 1 ) );
output;
end;
end;
run;

ods output pvalues = __pv;
proc multtest inpvalues=test fdr;
by sim;
run;

With the results in hand, (still within the macro) we need to do some massaging to make the results usable. First we'll recode the rejections (assuming a 0.05 alpha level) so that non-rejections are 0 and rejections are 1/number of tests. That way we can just sum across the results to get the proportion of rejections. Next, we transform the data to get each simulation in a row (section 1.5.4). (The data output from proc multtest has nsims*ntests rows. After transposing, there are nsims rows.) Finally, we can sum across the rows to get the proportion of tests rejected in each simulated family of tests. The results are shown in a table made with proc freq.

data __pv1;
set __pv;
if falsediscoveryrate lt 0.05 then fdrprop = 1/&ntests;
else fdrprop =0;
run;

proc transpose data = __pv1 (keep =sim fdrprop) out = pvals_a;
by sim; run;

data pvals;
set pvals_a;
prop = sum(of col1 - col&ntests);
run;
ods select all;

proc freq data = pvals; tables prop; run;
%mend fdr;

%fdr(nsims = 1000, ntests = 20, nfalse = 10, howfalse=.001);

Cumulative Cumulative
prop Frequency Percent Frequency Percent
---------------------------------------------------------
0.5 758 75.80 758 75.80
0.55 210 21.00 968 96.80
0.6 27 2.70 995 99.50
0.65 5 0.50 1000 100.00

So true nulls were rejected 24% of the time, which seems like a lot. Multiple comparison procedures with "strong" control of the familywise error rate will reject them only 5% of the time. Building this simulation as a macro facilitates exploring the effects of the multiple comparison procedures in a variety of settings.

R
As in example 9.31, the R code is rather simpler, though perhaps a bit opaque. To make the p-values, we make them first for all of tests with the false, then for all of the tests with the true nulls. The matrix function reads these in by column, by default, meaning that the first nfalse columns get the nsims*nfalse observations. The apply function generates the FDR p-values for each row of the data set. The t() function just transposes the resulting matrix so that we get back a row for each simulation. As in the SAS version, we'll count each rejection as 1/ntests, and non-rejections as 0; we do this with the ifelse() statement. Then we sum across the simulations with another call to apply() and show the results with a simple table.

checkfdr = function(nsims=1, ntests=100, nfalse=0, howfalse=0.001) {
raw_p = matrix(c(runif(nfalse * nsims) * howfalse,
runif((ntests-nfalse) * nsims)), nrow=nsims)
fdr = t(apply(raw_p, 1, p.adjust, "fdr"))
reject = ifelse(fdr<.05, 1/ntests,0)
prop = apply(reject, 1, sum)
prop.table(table(prop))
}

> checkfdr(nsims=1000, ntests=20, nfalse=10, howfalse=.001)
prop
0.5 0.55 0.6 0.65
0.755 0.210 0.032 0.003

The results are reassuringly similar to those from SAS. In this R code, it's particularly simple to try a different test-- just replace "fdr" in the p.adjust() call. Here's the result with the Hochberg test, which has strong control.

checkhoch = function(nsims=1, ntests=100, nfalse=0, howfalse=0.001) {
pvals = matrix(c(runif(nfalse * nsims) * howfalse,
runif((ntests-nfalse) * nsims)), nrow=nsims)
hochberg = t(apply(pvals, 1, p.adjust,"hochberg"))
reject = ifelse(hochberg<.05,1/ntests,0)
prop = apply(reject, 1, sum)
prop.table(table(prop))
}

> checkhoch(nsims=1000, ntests=20, nfalse=10, howfalse=.001)
prop
0.5 0.55 0.6
0.951 0.046 0.003

With this procedure one or more of the true nulls is rejected an appropriate 4.9% of the time. For the most part, we feel more comfortable using multiple testing procedures with "strong control".


An unrelated note about aggregators
We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Monday, April 30, 2012

Example 9.29: the perils of for loops

A recent exchange on the R-sig-teaching list featured a discussion of how best to teach new students R. The initial post included an exercise to write a function, that given a n, will draw n rows of a triangle made up of "*", noting that for a beginner, this may require two for loops. For example, in pseudo-code:

for i = 1 to n
for j = 1 to i
print "*"

Unfortunately, as several folks (including Richard M. Heiberger and R. Michael Weylandt) noted, for loops in general are not the best way to take full advantage of R. In this entry, we review two solutions they proposed which fit within the R philosophy.

Richard's solution uses the outer() function to generate a 5x5 matrix of logical values indicating whether the column number is bigger than the row number. Next the ifelse() function is used to replace TRUE with *.

> ifelse(outer(1:5, 1:5, `>=`), "*", " ")
[,1] [,2] [,3] [,4] [,5]
[1,] "*" " " " " " " " "
[2,] "*" "*" " " " " " "
[3,] "*" "*" "*" " " " "
[4,] "*" "*" "*" "*" " "
[5,] "*" "*" "*" "*" "*"

Michael's solution uses the lapply() function to call a function repeatedly for different values of n. This returns a list rather than a matrix, but accomplishes the same task.

> lapply(1:5, function(x) cat(rep("*", x), "\n"))
*
* *
* * *
* * * *
* * * * *

While this exercise is of little practical value, it does illustrate some important points, and provides a far more efficient as well as elegant way of accomplishing the tasks. For those interested in more, another resource is the R Inferno project of Patric Burns.

SAS
We demonstrate a SAS data step solution mainly to call out some useful features and cautions. In all likelihood a proc iml matrix-based solution would be more elegant;

data test;
array star [5] $ star1 - star5;
do i = 1 to 5;
star[i] = "*";
output;
end;
run;

proc print noobs; var star1 - star5; run;

star1 star2 star3 star4 star5

*
* *
* * *
* * * *
* * * * *

In particular, note the $ in the array statement, which allows the variables to contain characters; by default variables created by an array statement are numeric. In addition, note the reference to a sequentially suffixed list of variables using the single hyphen shortcut; this would help in generalizing to n rows. Finally, note that we were able to avoid a second do loop (SAS' primary iterative looping syntax) mainly by luck-- the most recently generated value of a variable is saved by default. This can cause trouble, in general, but here it keeps all the previous "*"s when moving on to the next row.




An unrelated note about aggregatorsWe love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers and PROC-X with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Monday, June 13, 2011

Example 8.40: Side-by-side histograms


It's often useful to compare histograms for some key variable, stratified by levels of some other variable. There are several ways to display something like this. The simplest may be to plot the two histograms in separate panels.

SAS
In SAS, the most direct and generalizable approach is through the sgpanel procedure.


proc sgpanel data = 'c:\book\help.sas7bdat';
panelby female;
histogram cesd;
run;

The results are shown above.


R
In R, the lattice package provides a similarly direct approach.

ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
ds$gender = ifelse(ds$female==1, "female", "male")
library(lattice)
histogram(~ cesd | gender, data=ds)

The results are shown below.

Tuesday, January 18, 2011

Example 8.21: latent class analysis

Latent class analysis is a technique used to classify observations based on patterns of categorical responses. Collins and Lanza's book,"Latent Class and Latent Transition Analysis," provides a readable introduction, while the UCLA ATS center has an online statistical computing seminar on the topic.

We consider an example analysis from the HELP dataset, where we wish to classify subjects based on their observed (manifest) status on the following variables: 1) on street or in shelter in past 180 days [homeless], 2) CESD score above 20, 3) received substance abuse treatment [satreat], or 4) linked to primary care [linkstatus]. We arbitrarily specify a three class solution.

SAS
Support for this method in SAS is available through the proc lca and proc lta add-on routines created and distributed by the Methodology Center at Penn State University. While it's customary in R to use researcher-written routines, it's less so for SAS; the machinery which allows independently written procs thus has the potential to mislead users. It bears explicitly stating that third-party procs probably don't have the same level of robustness or support as those distributed by SAS Institute.

The proc lca code assumes that the data exist in the dataset ds. The current coding of 0's and 1's needs to be changed to 1's and 2's.

data ds_0; set "c:\book\help.sas7bdat"; run;

data ds; set ds_0;
homeless = homeless+1;
cesdcut = (cesd > 20) + 1;
satreat = satreat+1;
linkstatus = linkstatus+1;
run;

The call to the LCA procedure specifies the number of classes, the variables to include, the number of categories per variable, and information about the starting values and random starts. It's highly recommended to run a "large" number of random starts to ensure that the true maximum likelihood estimate is reached (the 20 we used is likely too few for more complex models).

proc lca data=ds;
title '3 class model';
nclass 3;
items homeless cesdcut satreat linkstatus;
categories 2 2 2 2;
seed 42;
nstarts 20;
run;

The output begins with diagnostic information, and indicates that 40% of the seeds were associated with the best fitting model.

Data Summary, Model Information, and Fit Statistics (EM
Algorithm)

Number of subjects in dataset: 431
Number of subjects in analysis: 431

Number of measurement items: 4
Response categories per item: 2 2 2 2
Number of groups in the data: 1
Number of latent classes: 3
Rho starting values were randomly generated (seed = 42).

No parameter restrictions were specified (freely estimated).

Seed selected for best fitted model: 1486228051
Percentage of seeds associated with best fitted model: 40.00%

The model converged in 3241 iterations.

Maximum number of iterations: 5000
Convergence method: maximum absolute deviation (MAD)
Convergence criterion: 0.000001000

A number of fit statistics are provided to help with model comparison (e.g. number of classes, constraints in more complex models).

=============================================
Fit statistics:
=============================================
Log-likelihood: -1032.48
G-squared: 1.22
AIC: 29.22
BIC: 86.15
CAIC: 100.15
Adjusted BIC: 41.72
Entropy R-sqd.: 0.94
Degrees of freedom: 1

The results indicate that 22% of subjects are in class 1, just 8% in class 2, and 70% in class 3.

Parameter Estimates
Gamma estimates (class membership probabilities):
Class: 1 2 3
0.2163 0.0785 0.7052

The next set of output describes the classes. The prevalence for each level of each variable is described for each class. The last response category is redundant (equal to 1 minus the sum of the other probabilities).

Rho estimates (item response probabilities):
Response category 1:
Class: 1 2 3
homeless : 0.2703 1.0000 0.5625
cesdcut : 0.1154 0.4214 0.1678
satreat : 0.0004 0.0000 1.0000
linkstatus : 0.6029 1.0000 0.5855

Response category 2:
Class: 1 2 3
homeless : 0.7297 0.0000 0.4375
cesdcut : 0.8846 0.5786 0.8322
satreat : 0.9996 1.0000 0.0000
linkstatus : 0.3971 0.0000 0.4145

Members of class 1 were primarily homeless subjects with a larger proportion of high scores on the CESD, with substance abuse treatment history, and 40% of whom linked to primary care. Class 2 (the smallest group) was comprised of non-homeless subjects with lower CESD scores, substance abuse treatment, but no linkage. Class 3 was 44% homeless, had high levels of CESD, did not report substance abuse treatment, and 41% linked to primary care.

R

We begin by reading in the data, Then we use the within() function (section 1.3.1) to generate a dataframe with the variables of interest.

ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
ds = within(ds, (cesdcut = ifelse(cesd>20, 1, 0)))


The poLCA package supports estimation of latent class models in R. The poLCA() function, like proc lca, can incorporate polytomous categorical variables, but also like proc lca requires the variables to be coded starting with positive integers. We specify 10 repetitions (with random starting values).

library(poLCA)
res2 = poLCA(cbind(homeless=homeless+1,
cesdcut=cesdcut+1, satreat=satreat+1,
linkstatus=linkstatus+1) ~ 1,
maxiter=50000, nclass=3,
nrep=10, data=ds)

This generates the following output:

Model 1: llik = -1032.889 ... best llik = -1032.889
Model 2: llik = -1032.889 ... best llik = -1032.889
Model 3: llik = -1032.484 ... best llik = -1032.484
Model 4: llik = -1032.889 ... best llik = -1032.484
Model 5: llik = -1032.889 ... best llik = -1032.484
Model 6: llik = -1032.484 ... best llik = -1032.484
Model 7: llik = -1032.484 ... best llik = -1032.484
Model 8: llik = -1032.889 ... best llik = -1032.484
Model 9: llik = -1032.889 ... best llik = -1032.484
Model 10: llik = -1032.889 ... best llik = -1032.484
Conditional item response (column) probabilities,
by outcome variable, for each class (row)

$homeless
Pr(1) Pr(2)
class 1: 0.2703 0.7297
class 2: 1.0000 0.0000
class 3: 0.5625 0.4375

$cesdcut
Pr(1) Pr(2)
class 1: 0.1154 0.8846
class 2: 0.4213 0.5787
class 3: 0.1678 0.8322

$satreat
Pr(1) Pr(2)
class 1: 0 1
class 2: 0 1
class 3: 1 0

$linkstatus
Pr(1) Pr(2)
class 1: 0.6029 0.3971
class 2: 1.0000 0.0000
class 3: 0.5855 0.4145

Estimated class population shares
0.2162 0.0785 0.7053

Predicted class memberships (by modal posterior prob.)
0.181 0.1137 0.7053

=========================================================
Fit for 3 latent classes:
=========================================================
number of observations: 431
number of estimated parameters: 14
residual degrees of freedom: 1
maximum log-likelihood: -1032.484

AIC(3): 2092.967
BIC(3): 2149.893
G^2(3): 1.221830 (Likelihood ratio/deviance statistic)
X^2(3): 1.233247 (Chi-square goodness of fit)

The results are consistent with those found in proc lca. We note that, also similar to proc lca the global maximum likelihood estimates were reached 3 times out of 10-- this can be discerned by examination of the 10 model results. It's always a good idea to fit a large number of iterations to ensure that the global maximum likelihood estimates have been reached.