Here we demonstrate how to calculate Hotelling's T^2 using R and SAS, and test the code using a simulation study then apply it to data from the HELP study.

**R**

We utilize an approach suggested by Peter Mandeville in a posting to the R-help mailing list.

hotelling = function(y1, y2) {

# check the appropriate dimension

k = ncol(y1)

k2 = ncol(y2)

if (k2!=k)

stop("input matrices must have the same number of columns: y1 has ",

k, " while y2 has ", k2)

# calculate sample size and observed means

n1 = nrow(y1)

n2 = nrow(y2)

ybar1= apply(y1, 2, mean); ybar2= apply(y2, 2, mean)

diffbar = ybar1-ybar2

# calculate the variance of the difference in means

v = ((n1-1)*var(y1)+ (n2-1)*var(y2)) /(n1+n2-2)

# calculate the test statistic and associated quantities

t2 = n1*n2*diffbar%*%solve(v)%*%diffbar/(n1+n2)

f = (n1+n2-k-1)*t2/((n1+n2-2)*k)

pvalue = 1-pf(f, k, n1+n2-k-1)

# return the list of results

return(list(pvalue=pvalue, f=f, t2=t2, diff=diffbar))

}

We begin by ensuring that the two input matrices have the same number of columns, then calculate quantities needed for the calculation. The function returns a list with a number of named objects.

To test the function, we ran a simulation study to ensure that under the null hypothesis the function rejected approximately 5% of the time.

# do a simple simulation study to ensure Type I error rate is preserved

library(MASS)

numoutcomes = 5

mu1 = rep(0, numoutcomes); mu2 = rep(0, numoutcomes)

rho = .4

# compound symmetry covariance matrix

Sigma = rho*matrix(rep(1, numoutcomes^2), numoutcomes) +

diag(rep(1-rho, numoutcomes))

numsim = 10000

reject = numeric(numsim)

for (i in 1:numsim) {

grp1 = mvrnorm(100, mu1, Sigma)

grp2 = mvrnorm(150, mu2, Sigma)

reject[i] = hotelling(grp1, grp2)$pvalue<0.05

}

This yields appropriate Type I error rate (just under 5%).

> table(reject)

reject

0 1

9537 463

We can also undertake a comparison of non-homeless and homeless subjects in the HELP dataset, comparing individual observations on the PCS (physical component score) and MCS (mental component scores) of the SF-36 and the CESD measure of depressive symptoms.

ds = read.csv("http://www.math.smith.edu/sasr/datasets/help.csv")

smallds = with(ds, cbind(pcs, mcs, cesd))

grp1 = subset(smallds, ds$homeless==0)

grp2 = subset(smallds, ds$homeless==1)

res = hotelling(grp1, grp2)

This saves the results in the object

`res`, which we can display.

> names(res)

[1] "pvalue" "f" "t2" "diff"

> res

$pvalue

[,1]

[1,] 0.1082217

$f

[,1]

[1,] 2.035024

$t2

[,1]

[1,] 6.132267

$diff

pcs mcs cesd

2.064049 1.755975 -2.183760

While there are differences in the mean PCS, MCS, and CESD scores between

the homeless and non-homeless subjects, we do not have sufficient evidence to reject the null hypothesis that they have equivalent means back in the populations (p=0.11).

**SAS**

In SAS, we can easily make the two-group multivariate comparison using the MANOVA statement in

`proc glm`. This hides the actual value of the statistic, but generates the desired p-value. As usual, we use the

`ODS`system to reduce SAS output.

ods select multstat;

proc glm data="c:\book\help";

class homeless;

model pcs mcs cesd = homeless;

manova h=homeless;

run;

quit;

SAS prints four valid tests, which result in equal values in this case.

MANOVA Test Criteria and Exact F Statistics for

the Hypothesis of No Overall HOMELESS Effect

H = Type III SSCP Matrix for HOMELESS

E = Error SSCP Matrix

S=1 M=0.5 N=223.5

Statistic Value F Value Num DF Den DF Pr > F

Wilks' Lambda 0.98658536 2.04 3 449 0.1082

Pillai's Trace 0.01341464 2.04 3 449 0.1082

Hotelling-Lawley Trace 0.01359704 2.04 3 449 0.1082

Roy's Greatest Root 0.01359704 2.04 3 449 0.1082

Here we also demonstrate using SAS

`proc iml`(section 1.9) to calculate the statistic and perform the test from scratch. This roughly parallels the R development above, though we omit the test for equal number of variables.

*get the help data;

data help;

set "c:\book\help";

run;

proc iml;

* build a function;

start hotelling;

n1 = nrow(h1);

n0 = nrow(h0);

onesum = j(n1,1,1); *make an n1*1 matrix of 1s;

zerosum = j(n0,1,1);

meanh1 = h1`*onesum/n1;

meanh0 = h0`*zerosum/n0;

i1 = i(n1); * make identity matrix;

i0 = i(n0);

vc1 = h1`*(i1-onesum*onesum`/n1)*h1/(n1-1.0);

* make variance-covariance matrix;

vc0 = h0`*(i0-zerosum*zerosum`/n0)*h0/(n0-1.0);

pooledvc = ((n1-1.0)*vc1+(n0-1.0)*vc0)/(n1+n0-2.0);

meandiff = meanh1 - meanh0;

t2 = meandiff`* inv(pooledvc) *meandiff * (n1 * n0)/(n1 + n0);

nvars = ncol(h1);

f=((n1+n0-nvars-1)/(nvars*(n1+n0-2)))*t2;

df2=n1+n0-nvars-1;

p=1-probf(f,nvars,df2);

fcrit = finv(.95, nvars, df2);

print meandiff pooledvc t2 f nvars df2 p fcrit;

finish;

*use the function with the HELP data;

use help;

read all var {cesd pcs mcs} where (homeless = 1) into h1;

read all var {cesd pcs mcs} where (homeless = 0) into h0;

run hotelling;

quit;

Here's the output:

meandiff pooledvc t2 f nvars

2.1837595 155.76862 -38.46634 -108.8555 6.1322669 2.0350243 3

-2.064049 -38.46634 115.50213 14.423897

-1.755975 -108.8555 14.423897 164.44444

df2 p fcrit

449 0.1082217 2.6247689

## 2 comments:

To obtain an exact p-value, compute the permutation distribution of Hotelling's T2. The procedure is described in the Practitioner's Guide to Resampling Methods.

i need data on hotelling t-test so that i can run it on my own pls. kingsley4all2005@yahoo.co.uk

Post a Comment