Showing posts with label lurking variables. Show all posts
Showing posts with label lurking variables. Show all posts

Tuesday, February 7, 2012

Example 9.20: visualizing Simpson's paradox


Simpson's paradox is always amazing to explain to students. What's bad for one group, and bad for another group is good for everyone, if you just collapse over the grouping variable. Unlike many mathematical paradoxes, this arises in a number of real-world settings.

In this entry, we consider visualizing Simpson's paradox, using data from a study of average SAT scores, average teacher salaries and percentage of students taking the SATs at the state level (published by Deborah Lynn Guber, "Getting what you pay for: the debate over equity in public school expenditures" (1999), Journal of Statistics Education 7(2)).

R
The relevant data are available within the mosaic package.

> library(mosaic); trellis.par.set(theme=col.mosaic())
> head(SAT)
state expend ratio salary frac verbal math sat
1 Alabama 4.405 17.2 31.144 8 491 538 1029
2 Alaska 8.963 17.6 47.951 47 445 489 934
3 Arizona 4.778 19.3 32.175 27 448 496 944
4 Arkansas 4.459 17.1 28.934 6 482 523 1005
5 California 4.992 24.0 41.078 45 417 485 902
6 Colorado 5.443 18.4 34.571 29 462 518 980

The paradox manifests itself here if we ignore the fraction of students taking the SAT and instead consider the bivariate relationship between average teacher salary and average SAT score:

> lm(sat ~ salary, data=SAT)

Coefficients:
(Intercept) salary
1158.86 -5.54

Unfortunately, the relationship here appears to be negative: as salary increases,
average SAT score is predicted to decrease.

Luckily for the educators in the audience, once the confounding (aka lurking) variable is accounted for, we see a statistically significant positive relationship:

> summary(lm(sat ~ salary + frac, data=SAT))

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 987.9005 31.8775 30.991 <2e-16 ***
salary 2.1804 1.0291 2.119 0.0394 *
frac -2.7787 0.2285 -12.163 4e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 33.69 on 47 degrees of freedom
Multiple R-squared: 0.8056, Adjusted R-squared: 0.7973
F-statistic: 97.36 on 2 and 47 DF, p-value: < 2.2e-16

That's all well and good, but how to explain what is going on? We can start with a scatterplot, but unless the state names are plotted then it's hard to see what's going on (imagine the plot at the top of this entry without the text labels or the color shading).

It's straightforward to use the panel functionality within the lattice package to create a more useful plot, where states names are plotted rather than points, and different colors are used to represent the low (red), medium (blue) and high (green) values for the fraction of students taking the SAT.

SAT$fracgrp = cut(SAT$frac, breaks=c(0, 22, 49, 81),
labels=c("low", "medium", "high"))
SAT$fraccount = 1 + as.numeric(SAT$fracgrp=="medium") +
2*as.numeric(SAT$fracgrp=="high")
panel.labels = function(x, y, col='black', labels='x',...)
{ panel.text(x,y,labels,col=col,...)}
xyplot(sat ~salary, data=SAT, groups=fracgrp,
cex=0.6, panel=panel.labels, labels=SAT$state,
col=c('red','blue','green')[SAT$fraccount])

We see that within each group, the slope is positive (or at least non-negative), while overall there is a negative slope. Indeed, we see all of the hallmarks of a serious confounder in the correlation matrix for the n=50 observations:

> with(SAT, cor(cbind(sat, frac, salary)))
sat frac salary
sat 1.0000000 -0.8871187 -0.4398834
frac -0.8871187 1.0000000 0.6167799
salary -0.4398834 0.6167799 1.0000000

There's a strong negative association between SAT (Y) and fraction (Z), as well as a strong positive association between salary (X) and fraction (Z).

SAS

We begin by getting the data out of R. This is a snap, thanks to the proc_r macro we discussed here. Just read the macro in, tell it to grab the sat object on its way back from R, then write the R code to read in the data set. This technique would be great for getting data from the histdata package, discussed here, into SAS.

%include "C:\ken\sasmacros\Proc_R.sas";
%Proc_R (SAS2R =, R2SAS = sat);
Cards4;

mynorm = rnorm(10)
mynorm
library(mosaic)
head(SAT)
sat = SAT

;;;;
%Quit;

We'll skip the regression (best accomplished in proc glm) and skip to the helpful plot. We'll need to categorize the fraction taking the test. A natural way to do this in SAS would be to use formats, but we prefer to generate actual data as a more transparent process. To make the plot, we'll use the sgplot procedure. Typically this allows less control than the gplot procedure but the results in this case are quick and attractive. The group= and datalabel= options add the colors and legend, and the state names, respectively.

data sat2; set sat;
if frac le 22 then fracgrp = "Low ";
else if frac le 49 then fracgrp = "Medium";
else if frac gt 49 then fracgrp ="High";
run;

proc sgplot data=sat2;
scatter x = salary y = sat / group=fracgrp datalabel=state;
run; quit;

Monday, April 26, 2010

Example 7.34: Propensity scores and causal inference from observational studies

Propensity scores can be used to help make causal interpretation of observational data more plausible, by adjusting for other factors that may responsible for differences between groups. Heuristically, we estimate the probability of exposure, rather than randomize exposure, as we'd ideally prefer to do. The estimated probability of exposure is the propensity score. If our estimation of the propensity score incorporates the reasons why people self-select to exposure status, then two individuals with equal propensity score are equally likely to be exposed, and we can interpret them as being randomly assigned to exposure. This process is not unlike ordinary regression adjustment for potential confounders, but uses fewer degrees of freedom and can incorporate more variables.

As an example, we consider the HELP data used extensively for examples in our book. Does homelessness affect physical health, as measured by the PCS score from the SF-36?

First, we consider modeling this relationship directly. This analysis only answers the question of whether homelessness is associated with poorer physical health.

Then we create a propensity score by estimating a logistic regression to predict homelessness using age, gender, number of drinks, and mental health composite score. Finally, we include the propensity score in the model predicting PCS from homelessness. If we accept that these propensity predictors fully account for the probability of homelessness, and there is an association between homelessness and PCS in the model adjusting for propensity, and the directionality of the association flows from homelessness to PCS, then we can conclude that homelessness causes differences in PCS.

We note here that this conclusion relies on other untestable assumptions, including linearity in the relationship between the propensity and PCS. Many users of propensity scores prefer to fit models within strata of the propensity score, or to match on propensity score, rather than use the regression adjustment we present in this entry. In a future entry we'll demonstrate the use of matching.

In a departure from our usual practice, we show only pieces of the output below.


SAS

We being by reading in the data and fitting the model. This is effectively a t-test (section 2.4.1), but we use proc glm to more easily compare with the adjusted results.


proc glm data="c:\book\help";
model pcs = homeless/solution;
run;

Standard
Parameter Estimate Error t Value Pr > |t|

Intercept 49.00082904 0.68801845 71.22 <.0001
HOMELESS -2.06404896 1.01292210 -2.04 0.0422


It would appear that homeless patients are in worse health than the others.

We next use proc logistic to estimate the propensity to homelessness, using the output statement to save the predicted probabilities. We omit the output here; it could be excluded in practice using the ODS exclude all statement.


proc logistic data="c:\book\help" desc;
model homeless = age female i1 mcs;
output out=propen pred=propensity;
run;


It's important to make sure that there is a reasonable amount of overlap in the propensity scores between the two groups. Otherwise, we'd be extrapolating outside the range of the data when we adjust.


proc means data=propen;
class homeless;
var propensity;
run;
N
HOMELESS Obs Mean Minimum Maximum
---------------------------------------------------------------
0 244 0.4296704 0.2136791 0.7876000

1 209 0.4983750 0.2635031 0.9642827
---------------------------------------------------------------


The mean propensity to homelessness is larger in the homeless group. If this were not the case, we might be concerned the the logistic model is too poor a predictor of homelessness to generate an effective propensity score. However, the maximum propensity among the homeless is 20% larger than the largest propensity in the non-homeless group. This suggests that a further review of the propensities would be wise. To check them, we'll generate histograms for each group using the proc univariate (section 5.1.4).


proc univariate data=propen;
class homeless;
var propensity;
histogram propensity;
run;




The resulting histograms suggest a some risk of extrapolation. In our model, we'll remove subjects with propensities greater than 0.8.


proc glm data=propen;
where propensity lt .8;
model pcs = homeless propensity/solution;
run;
Standard
Parameter Estimate Error t Value Pr > |t|

Intercept 54.19944991 1.98264608 27.34 <.0001
HOMELESS -1.19612942 1.03892589 -1.15 0.2502
propensity -12.09909082 4.33385405 -2.79 0.0055


After the adjustment, we see a much smaller difference in the physical health of homeless and non-homeless subjects, and find no significant evidence of an association.



R


ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
attach(ds)
summary(lm(pcs ~ homeless))

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 49.001 0.688 71.220 <2e-16 ***
homeless -2.064 1.013 -2.038 0.0422 *


We use the glm() function to fit the logistic
regression model (section 4.1.1). A formula object is used to specify the model. The predicted probabilities that we'll use as propensity scores are in the fitted element of the output object.


form = formula(homeless ~ age + female + i1 + mcs)
glm1 = glm(form, family=binomial)
X = glm1$fitted


As in the SAS development, we check the resulting values. Here we use the fivenum() function (section 2.1.4) with the tapply() function (section 2.1.2) to get the results for each level of homelessness.


> tapply(X,homeless, FUN=fivenum)
$`0`
398 97 378 69 438
0.2136787 0.3464170 0.4040223 0.4984242 0.7876015

$`1`
16 18 262 293 286
0.2635026 0.4002759 0.4739152 0.5768015 0.9642833


Finding the same troubling evidence of non-overlap, we fit a histogram for each group. We do this manually, setting up two output areas with the par() function (section 5.3.6) and conditioning to use data from each homeless value in two calls to the hist() function (section 5.1.4).


par(mfrow=c(2,1))
hist(X[homeless==0], xlim=c(0.2,1))
hist(X[homeless==1], xlim=c(0.2,1))




As noted above, we'll exclude subjects with propensity greater than 0.8. This is done with the subset option to the lm() function (as in section 3.7.4).


summary(lm(pcs ~ homeless + X,subset=(X < .8)))

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 54.199 1.983 27.337 < 2e-16 ***
homeless -1.196 1.039 -1.151 0.25023
X -12.099 4.334 -2.792 0.00547 **