## Monday, May 17, 2010

### Example 7.37: calculation of Hotelling's T^2

Hotelling's T^2 is a multivariate statistic used to compare two groups, where multiple outcomes are observed for each subject.

Here we demonstrate how to calculate Hotelling's T^2 using R and SAS, and test the code using a simulation study then apply it to data from the HELP study.

R

We utilize an approach suggested by Peter Mandeville in a posting to the R-help mailing list.

`hotelling = function(y1, y2) {  # check the appropriate dimension  k = ncol(y1)  k2 = ncol(y2)  if (k2!=k)     stop("input matrices must have the same number of columns: y1 has ",      k, " while y2 has ", k2)       # calculate sample size and observed means  n1 = nrow(y1)  n2 = nrow(y2)     ybar1= apply(y1, 2, mean); ybar2= apply(y2, 2, mean)  diffbar = ybar1-ybar2     # calculate the variance of the difference in means  v = ((n1-1)*var(y1)+ (n2-1)*var(y2)) /(n1+n2-2)     # calculate the test statistic and associated quantities  t2 = n1*n2*diffbar%*%solve(v)%*%diffbar/(n1+n2)  f = (n1+n2-k-1)*t2/((n1+n2-2)*k)  pvalue = 1-pf(f, k, n1+n2-k-1)     # return the list of results  return(list(pvalue=pvalue, f=f, t2=t2, diff=diffbar))}`

We begin by ensuring that the two input matrices have the same number of columns, then calculate quantities needed for the calculation. The function returns a list with a number of named objects.

To test the function, we ran a simulation study to ensure that under the null hypothesis the function rejected approximately 5% of the time.

`# do a simple simulation study to ensure Type I error rate is preservedlibrary(MASS)numoutcomes = 5mu1 = rep(0, numoutcomes); mu2 = rep(0, numoutcomes)rho = .4# compound symmetry covariance matrixSigma = rho*matrix(rep(1, numoutcomes^2), numoutcomes) +   diag(rep(1-rho, numoutcomes)) numsim = 10000reject = numeric(numsim) for (i in 1:numsim) {  grp1 = mvrnorm(100, mu1, Sigma)  grp2 = mvrnorm(150, mu2, Sigma)  reject[i] = hotelling(grp1, grp2)\$pvalue<0.05}`

This yields appropriate Type I error rate (just under 5%).

`> table(reject)reject   0    1 9537  463 `

We can also undertake a comparison of non-homeless and homeless subjects in the HELP dataset, comparing individual observations on the PCS (physical component score) and MCS (mental component scores) of the SF-36 and the CESD measure of depressive symptoms.

`ds = read.csv("http://www.math.smith.edu/sasr/datasets/help.csv")smallds = with(ds, cbind(pcs, mcs, cesd))grp1 = subset(smallds, ds\$homeless==0)grp2 = subset(smallds, ds\$homeless==1)res = hotelling(grp1, grp2)`

This saves the results in the object res, which we can display.

`> names(res) "pvalue" "f"      "t2"     "diff"  > res\$pvalue          [,1][1,] 0.1082217\$f         [,1][1,] 2.035024\$t2         [,1][1,] 6.132267\$diff      pcs       mcs      cesd  2.064049  1.755975 -2.183760 `

While there are differences in the mean PCS, MCS, and CESD scores between
the homeless and non-homeless subjects, we do not have sufficient evidence to reject the null hypothesis that they have equivalent means back in the populations (p=0.11).

SAS

In SAS, we can easily make the two-group multivariate comparison using the MANOVA statement in proc glm. This hides the actual value of the statistic, but generates the desired p-value. As usual, we use the ODS system to reduce SAS output.

`ods select multstat;proc glm data="c:\book\help";  class homeless;  model pcs mcs cesd = homeless;  manova h=homeless;run;quit;`

SAS prints four valid tests, which result in equal values in this case.

`            MANOVA Test Criteria and Exact F Statistics for              the Hypothesis of No Overall HOMELESS Effect                 H = Type III SSCP Matrix for HOMELESS                         E = Error SSCP Matrix                        S=1    M=0.5    N=223.5 Statistic                      Value  F Value  Num DF  Den DF  Pr > F Wilks' Lambda             0.98658536     2.04       3     449  0.1082 Pillai's Trace            0.01341464     2.04       3     449  0.1082 Hotelling-Lawley Trace    0.01359704     2.04       3     449  0.1082 Roy's Greatest Root       0.01359704     2.04       3     449  0.1082`

Here we also demonstrate using SAS proc iml (section 1.9) to calculate the statistic and perform the test from scratch. This roughly parallels the R development above, though we omit the test for equal number of variables.

`*get the help data;data help;set "c:\book\help";run;proc iml;* build a function;start hotelling;  n1 = nrow(h1);  n0 = nrow(h0);  onesum = j(n1,1,1);  *make an n1*1 matrix of 1s;  zerosum = j(n0,1,1);    meanh1 = h1`*onesum/n1;  meanh0 = h0`*zerosum/n0;  i1 = i(n1);          * make identity matrix;  i0 = i(n0);           vc1 =  h1`*(i1-onesum*onesum`/n1)*h1/(n1-1.0);                       * make variance-covariance matrix;  vc0 =  h0`*(i0-zerosum*zerosum`/n0)*h0/(n0-1.0);  pooledvc = ((n1-1.0)*vc1+(n0-1.0)*vc0)/(n1+n0-2.0);  meandiff = meanh1 - meanh0;  t2 = meandiff`* inv(pooledvc) *meandiff * (n1 * n0)/(n1 + n0);  nvars = ncol(h1);  f=((n1+n0-nvars-1)/(nvars*(n1+n0-2)))*t2;  df2=n1+n0-nvars-1;  p=1-probf(f,nvars,df2);  fcrit = finv(.95, nvars, df2);  print meandiff pooledvc t2 f nvars df2 p fcrit;finish;*use the function with the HELP data;use help;  read all var {cesd pcs mcs} where (homeless = 1) into h1;  read all var {cesd pcs mcs} where (homeless = 0) into h0;  run hotelling;quit;`

Here's the output:

` meandiff  pooledvc                            t2         f     nvars 2.1837595 155.76862 -38.46634 -108.8555 6.1322669 2.0350243         3 -2.064049 -38.46634 115.50213 14.423897 -1.755975 -108.8555 14.423897 164.44444                           df2         p     fcrit                           449 0.1082217 2.6247689`