Monday, May 17, 2010

Example 7.37: calculation of Hotelling's T^2

Hotelling's T^2 is a multivariate statistic used to compare two groups, where multiple outcomes are observed for each subject.

Here we demonstrate how to calculate Hotelling's T^2 using R and SAS, and test the code using a simulation study then apply it to data from the HELP study.


We utilize an approach suggested by Peter Mandeville in a posting to the R-help mailing list.

hotelling = function(y1, y2) {
# check the appropriate dimension
k = ncol(y1)
k2 = ncol(y2)
if (k2!=k)
stop("input matrices must have the same number of columns: y1 has ",
k, " while y2 has ", k2)

# calculate sample size and observed means
n1 = nrow(y1)
n2 = nrow(y2)
ybar1= apply(y1, 2, mean); ybar2= apply(y2, 2, mean)
diffbar = ybar1-ybar2

# calculate the variance of the difference in means
v = ((n1-1)*var(y1)+ (n2-1)*var(y2)) /(n1+n2-2)

# calculate the test statistic and associated quantities
t2 = n1*n2*diffbar%*%solve(v)%*%diffbar/(n1+n2)
f = (n1+n2-k-1)*t2/((n1+n2-2)*k)
pvalue = 1-pf(f, k, n1+n2-k-1)

# return the list of results
return(list(pvalue=pvalue, f=f, t2=t2, diff=diffbar))

We begin by ensuring that the two input matrices have the same number of columns, then calculate quantities needed for the calculation. The function returns a list with a number of named objects.

To test the function, we ran a simulation study to ensure that under the null hypothesis the function rejected approximately 5% of the time.

# do a simple simulation study to ensure Type I error rate is preserved
numoutcomes = 5
mu1 = rep(0, numoutcomes); mu2 = rep(0, numoutcomes)
rho = .4
# compound symmetry covariance matrix
Sigma = rho*matrix(rep(1, numoutcomes^2), numoutcomes) +
diag(rep(1-rho, numoutcomes))
numsim = 10000
reject = numeric(numsim)
for (i in 1:numsim) {
grp1 = mvrnorm(100, mu1, Sigma)
grp2 = mvrnorm(150, mu2, Sigma)
reject[i] = hotelling(grp1, grp2)$pvalue<0.05

This yields appropriate Type I error rate (just under 5%).

> table(reject)
0 1
9537 463

We can also undertake a comparison of non-homeless and homeless subjects in the HELP dataset, comparing individual observations on the PCS (physical component score) and MCS (mental component scores) of the SF-36 and the CESD measure of depressive symptoms.

ds = read.csv("")
smallds = with(ds, cbind(pcs, mcs, cesd))
grp1 = subset(smallds, ds$homeless==0)
grp2 = subset(smallds, ds$homeless==1)
res = hotelling(grp1, grp2)

This saves the results in the object res, which we can display.

> names(res)
[1] "pvalue" "f" "t2" "diff"
> res
[1,] 0.1082217

[1,] 2.035024

[1,] 6.132267

pcs mcs cesd
2.064049 1.755975 -2.183760

While there are differences in the mean PCS, MCS, and CESD scores between
the homeless and non-homeless subjects, we do not have sufficient evidence to reject the null hypothesis that they have equivalent means back in the populations (p=0.11).


In SAS, we can easily make the two-group multivariate comparison using the MANOVA statement in proc glm. This hides the actual value of the statistic, but generates the desired p-value. As usual, we use the ODS system to reduce SAS output.

ods select multstat;
proc glm data="c:\book\help";
class homeless;
model pcs mcs cesd = homeless;
manova h=homeless;

SAS prints four valid tests, which result in equal values in this case.

MANOVA Test Criteria and Exact F Statistics for
the Hypothesis of No Overall HOMELESS Effect
H = Type III SSCP Matrix for HOMELESS
E = Error SSCP Matrix

S=1 M=0.5 N=223.5

Statistic Value F Value Num DF Den DF Pr > F

Wilks' Lambda 0.98658536 2.04 3 449 0.1082
Pillai's Trace 0.01341464 2.04 3 449 0.1082
Hotelling-Lawley Trace 0.01359704 2.04 3 449 0.1082
Roy's Greatest Root 0.01359704 2.04 3 449 0.1082

Here we also demonstrate using SAS proc iml (section 1.9) to calculate the statistic and perform the test from scratch. This roughly parallels the R development above, though we omit the test for equal number of variables.

*get the help data;
data help;
set "c:\book\help";

proc iml;
* build a function;
start hotelling;
n1 = nrow(h1);
n0 = nrow(h0);
onesum = j(n1,1,1); *make an n1*1 matrix of 1s;
zerosum = j(n0,1,1);
meanh1 = h1`*onesum/n1;
meanh0 = h0`*zerosum/n0;
i1 = i(n1); * make identity matrix;
i0 = i(n0);
vc1 = h1`*(i1-onesum*onesum`/n1)*h1/(n1-1.0);
* make variance-covariance matrix;
vc0 = h0`*(i0-zerosum*zerosum`/n0)*h0/(n0-1.0);
pooledvc = ((n1-1.0)*vc1+(n0-1.0)*vc0)/(n1+n0-2.0);
meandiff = meanh1 - meanh0;
t2 = meandiff`* inv(pooledvc) *meandiff * (n1 * n0)/(n1 + n0);
nvars = ncol(h1);
fcrit = finv(.95, nvars, df2);
print meandiff pooledvc t2 f nvars df2 p fcrit;

*use the function with the HELP data;
use help;
read all var {cesd pcs mcs} where (homeless = 1) into h1;
read all var {cesd pcs mcs} where (homeless = 0) into h0;
run hotelling;

Here's the output:

meandiff pooledvc t2 f nvars

2.1837595 155.76862 -38.46634 -108.8555 6.1322669 2.0350243 3
-2.064049 -38.46634 115.50213 14.423897
-1.755975 -108.8555 14.423897 164.44444

df2 p fcrit

449 0.1082217 2.6247689


Phillip Good said...

To obtain an exact p-value, compute the permutation distribution of Hotelling's T2. The procedure is described in the Practitioner's Guide to Resampling Methods.

Anonymous said...

i need data on hotelling t-test so that i can run it on my own pls.