Showing posts with label null hypothesis. Show all posts
Showing posts with label null hypothesis. Show all posts

Thursday, June 12, 2014

Example 2014.6: Comparing medians and the Wilcoxon rank-sum test

A colleague recently contacted us with the following question: "My outcome is skewed-- how can I compare medians across multiple categories?"

What they were asking for was a generalization of the Wilcoxon rank-sum test (also known as the Mann-Whitney-Wilcoxon test, among other monikers) to more than two groups. For the record, the answer is that the Kruskal-Wallis test is the generalization of the Wilcoxon, the one-way ANOVA to the Wilcoxon's t-test.

But this question is based on a false premise: that the the Wilcoxon rank-sum test is used to compare medians. The premise is based on a misunderstanding of the null hypothesis of the test. The actual null hypothesis is that there is a 50% probability that a random value from one population exceeds an random value from the other population. The practical value of this is hard to see, and thus in many places, including textbooks, the null hypothesis is presented as "the two populations have equal medians". The actual null hypothesis can be expressed as the latter median hypothesis, but only under the additional assumption that the shapes of the distributions are identical in each group.

In other words, our interpretation of the test as comparing the medians of the distributions requires the location-shift-only alternative to be the case. Since this is rarely true, and never assessed, we should probably use extreme caution in using, and especially in interpreting, the Wilcoxon rank-sum test.

To demonstrate this issue, we present a simple simulation, showing a case of two samples with equal medians but very different shapes. In one group, the values are exponential with mean = 1 and therefore median log 2, in the other they are normal with mean and median = log 2 and variance = 1. We generate 10,000 observations and show that the Wilcoxon rank-sum test rejects the null hypothesis. If we interpret the null incorrectly as applying to the medians, we will be misled. If our interest actually centered on the medians for some reasons, an appropriate test that would not be sensitive to the shape of the distribution could be found in a quantile regression. Another, of course, would be the median test. We show that this test does not reject the null hypothesis of equal medians, even with the large sample size.

We leave aside the deeper questions of whether comparing medians is a useful substitute for comparing means, or whether means should not be compared when distributions are skewed.

R
In R, we simulate two separate vectors of data, then feed them directly to the wilcox.test() function (section 2.4.2).
y1 = rexp(10000)
y2 = rnorm(10000) + log(2)

wilcox.test(y1,y2)
This shows a very small p-value, denoting the fact not that the medians are unequal but that one or the other of these distributions generally has larger values. Hint: it might be the one with the long tail and no negative values.
 Wilcoxon rank sum test with continuity correction

data:  y1 and y2
W = 55318328, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0
To set up the quantile regression, we put the observations into a single vector and make a group indicator vector. Then we load the quantreg package and use the rq() function (section 4.4.4) to fit the median regression.
y = c(y1,y2)
c = rep(1:2, each = 10000)

library(quantreg)
summary(rq(y~as.factor(c)))
This shows we would fail to reject the null of equal medians.
Call: rq(formula = y ~ as.factor(c))

tau: [1] 0.5

Coefficients:
              Value    Std. Error t value  Pr(>|t|)
(Intercept)    0.68840  0.01067   64.53047  0.00000
as.factor(c)2 -0.00167  0.01692   -0.09844  0.92159
Warning message:
In rq.fit.br(c, y, tau = tau, ...) : Solution may be nonunique


SAS
In SAS, we make a data set with a group indicator and use it to generate data conditionally.
data wtest;
do i = 1 to 20000;
  c = (i gt 10000);
  if c eq 0 then y = ranexp(0);
    else y = normal(0) + log(2);
  output;
  end;
run;
The Wilcoxon rank-sum test is in proc npar1way (section 2.4.2).
proc npar1way data = wtest wilcoxon;
class c;
var y;
run;
As with R, we find a very small p-value.
                       The NPAR1WAY Procedure

             Wilcoxon Scores (Rank Sums) for Variable y
                      Classified by Variable c

                     Sum of     Expected      Std Dev         Mean
  c          N       Scores     Under H0     Under H0        Score

  0      10000    106061416    100005000   408258.497   10606.1416
  1      10000     93948584    100005000   408258.497    9394.8584


                      Wilcoxon Two-Sample Test

                Statistic             106061416.0000

                Normal Approximation
                Z                            14.8348
                One-Sided Pr >  Z             <.0001
                Two-Sided Pr > |Z|            <.0001

                t Approximation
                One-Sided Pr >  Z             <.0001
                Two-Sided Pr > |Z|            <.0001

             Z includes a continuity correction of 0.5.


                        Kruskal-Wallis Test

                Chi-Square                  220.0700
                DF                                 1
                Pr > Chi-Square               <.0001

The median regression is in proc quantreg (section 4.4.4). As in R, we fail to reject the null of equal medians.
proc quantreg data =wtest;
class c;
model y = c;
run;
                       The QUANTREG Procedure

                  Quantile and Objective Function

              Predicted Value at Mean          0.6909


                        Parameter Estimates

                            Standard    95% Confidence
    Parameter   DF Estimate    Error        Limits       t Value

    Intercept    1   0.6909   0.0125    0.6663    0.7155   55.06
    c         0  1   0.0165   0.0160   -0.0148    0.0479    1.03
    c         1  0   0.0000   0.0000    0.0000    0.0000     .

                        Parameter Estimates

                        Parameter   Pr > |t|

                        Intercept     <.0001
                        c         0   0.3016
                        c         1    .



An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Saturday, October 24, 2009

Example 7.16: assess robustness of permutation test to violations of exchangeability assumption

Permutation tests (section 2.4.3) are a form of resampling based inference that can be used to compare two groups. A simple univariate two-group permutation test requires that the group labels for the observations are exchangeable under the null hypothesis of equal distributions, but allows relaxation of specific distributional assumptions required by parametric procedures such as the t-test (section 2.4.1).

Here we demonstrate how to carry out a small simulation study of the robustness of the Type I error rate of a permutation test under two scenarios that violate exchangeability by having different shapes under the null hypothesis. One might encounter such a situation in practice in a study of growth in which unequal follow-up time is allotted to the groups. Our study is analogous to assessing the t-test assuming equal variances, when the variances are known to be unequal and/or the distributions non-normal under the null.

Scenario I: group 1 is distributed as a Normal random variable (section 1.10.5) with mean 0 and standard deviation 1, while group 2 is distributed as a Normal random variable with mean 0 and standard deviation 6. Here the distributions are both symmetric, but have markedly different variances.

Scenario II: group 1 is distributed as a Normal random variable with mean 6 and standard deviation 1, while group 2 is distributed as an exponential random variable (section 1.10.7) with mean 6 and standard deviation 6. Here the distributions have markedly different skewness, as well as variance.

We use an alpha level of 0.05, set the sample size in each group to 10, and generate 10,000 simulated data sets. Instead of the exact permutation test, we use the asymptotically equivalent Monte Carlo Hypothesis Test (Dwass, 1957), generating 1,000 samples from the permutation distribution for each simulation (thank heavens for fast computers!)

R
In R, we set up some constants and a vector for results that we will use. The group vector can be generated once, with the outcome generated inside the loop. The oneway_test() function from the coin library is used to run the permutation test, and return the p-value (the first element in the list provided by pvalue()). Finally, the prop.test() function (section 2.1.9) is used to calculate the 95% confidence interval for the level of the test.

For scenario I, the code is:

numsim <- 10000
numperm <- 1000
n <- 10
res <- numeric(numsim)
x <- c(rep(0, n), rep(1, n))
for (i in 1:numsim) {
y1 <- rnorm(n, 0, 1)
y2 <- rnorm(n, 0, 6)
y <- c(y1, y2)
res[i] <- pvalue(oneway_test(y ~ as.factor(x),
distribution=approximate(numperm)))[[1]]
}


The desired results can be generated as in section 2.1.9 by summarizing the res vector:

prop.test(sum(res<=0.05), numsim)


The result follows:

1-sample proportions test with continuity correction

data: sum(res <= 0.05) out of numsim, null probability 0.5
X-squared = 7574.221, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.06009200 0.06984569
sample estimates:
p
0.0648


We see that the permutation test is somewhat anti-conservative given such an extreme difference in variances.

For scenario II, the code is the same with the exception of:

y1 <- rnorm(n, 6, 1)
y2 <- rexp(n, 1/6)


The yields the following results:

> prop.test(sum(res<=0.05), numsim)

1-sample proportions test with continuity correction

data: sum(res <= 0.05) out of numsim, null probability 0.5
X-squared = 5973.744, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.1073820 0.1199172
sample estimates:
p
0.1135


The nominal Type I error rate of 0.05 is not preserved in this situation.


SAS
In SAS, we begin by making data for scenario 1, using the looping tools described in section 1.11.1.


data test;
do numsim = 1 to 10000;
do group = 0,1;
do i = 1 to 10;
if group = 0 then y = rand('NORMAL',0,1);
else y = rand('NORMAL',0,6);
output;
end;
end;
end;
run;


Next, we perform the permutation test for equal means as shown in section 2.4.3. The by statement repeats the test for each simnum, while the ods statements, as demonstrated throughout the examples in the book, prevent copious output and save relevant results. This was found using the ods trace on statement, as discussed in section A.7.1.


ods select none;
ods output datascoresmc = kkout;
proc npar1way data = test;
by numsim;
class group;
var y;
exact scores = data / mc n = 1000;
run;
ods select all;


Next, we make a new data set from the saved output, saving just the lines we need, and generating a new variable which records whether the test was rejected or not. The names and values of the variables were found using a proc print statement, which is not shown.


data props;
set kkout;
where (name1 = "MCP2_DATA");
reject = (nvalue1 lt .05);
run;

The estimated rejection probability in this setting is found as discussed in section 2.1.9:


proc freq data = props;
tables reject/binomial(level='1');
run;


The resulting output follows:

Cumulative Cumulative
reject Frequency Percent Frequency Percent
___________________________________________________________
0 9339 93.39 9339 93.39
1 661 6.61 10000 100.00

Proportion 0.0661
ASE 0.0025
95% Lower Conf Limit 0.0612
95% Upper Conf Limit 0.0710

Exact Conf Limits
95% Lower Conf Limit 0.0613
95% Upper Conf Limit 0.0711

ASE under H0 0.0050
Z -86.7800
One-sided Pr < Z <.0001
Two-sided Pr > |Z| <.0001

Sample Size = 10000

These results agree substantially with the results from R.

To produce scenario 2, we replace

if group = 0 then y = rand('NORMAL',0,1);
else y = rand('NORMAL',0,6);

with

if group = 0 then y = rand('NORMAL',6,1);
else y = rand('EXPONENTIAL') * 6;

with the following results:

Cumulative Cumulative
reject Frequency Percent Frequency Percent
___________________________________________________________
0 8877 88.77 8877 88.77
1 1123 11.23 10000 100.00

Proportion 0.1123
ASE 0.0032
95% Lower Conf Limit 0.1061
95% Upper Conf Limit 0.1185

Exact Conf Limits
95% Lower Conf Limit 0.1062
95% Upper Conf Limit 0.1187

ASE under H0 0.0050
Z -77.5400
One-sided Pr < Z <.0001
Two-sided Pr > |Z| <.0001

Sample Size = 10000


Again, these results are comfortably similar to those found in R and demonstrate that this extreme violation of the null results in a noticeable anti-conservative bias.

However, note that with additional observations, the anti-conservative bias diminishes considerably. With 100 observations per group, (replacing do i = 1 to 10 with do i = 1 to 100) the following results are generated:

Cumulative Cumulative
reject Frequency Percent Frequency Percent
___________________________________________________________
0 9452 94.52 9452 94.52
1 548 5.48 10000 100.00

Proportion 0.0548
ASE 0.0023
95% Lower Conf Limit 0.0503
95% Upper Conf Limit 0.0593

Exact Conf Limits
95% Lower Conf Limit 0.0504
95% Upper Conf Limit 0.0594

ASE under H0 0.0050
Z -89.0400
One-sided Pr < Z <.0001
Two-sided Pr > |Z| <.0001

Sample Size = 10000

revealing a negligible bias.

Our conclusion: while the permutation test is somewhat robust to violations of exchangeability, when shapes and variances differ radically the results may be anti-conservatively biased.

It bears noting here that if there were no reason to expect different distributional shapes under the null, but we just happened to observed the large differences seen here, we would interpret these results as demonstrating the poor power of a permutation test based on the means to detect differences in shape alone.


Dwass M. Modified randomization tests for nonparametric hypotheses. Annals of Mathematical Statistics, 28:181-187, 1957.