Monday, December 14, 2009

Example 7.18: Displaying missing value categories in a table

When displaying contingency tables (section 2.3.1), there are times when it is useful to either show or hide the missing data category. Both SAS and the typical R command default to displaying the table only for observations where both factors are observed.

In this example, we generate some multinomial data (section 1.10.4) and then produce tables with and without missing data categories.

SAS

Generate the multinomial data, uniform data, and use the latter to censor the former:

data blog;
do i = 1 to 300;
x = rand("TABLE",.3,.4);
y = rand("TABLE",.3,.4);
if uniform(0) gt .8 then x = .;
if uniform(0) gt .8 then y = .;
output;
end;
run;


Print the default table with only complete data. Note the options used to reduce output, as in section 4.6.9.

proc freq data=blog;
tables y*x / norow nocol;
run;


This produces:

Table of y by x

y x

Frequency|
Percent | 1| 2| 3| Total
---------+--------+--------+--------+
1 | 16 | 13 | 18 | 47
| 8.16 | 6.63 | 9.18 | 23.98
---------+--------+--------+--------+
2 | 18 | 32 | 22 | 72
| 9.18 | 16.33 | 11.22 | 36.73
---------+--------+--------+--------+
3 | 28 | 31 | 18 | 77
| 14.29 | 15.82 | 9.18 | 39.29
---------+--------+--------+--------+
Total 62 76 58 196
31.63 38.78 29.59 100.00

Frequency Missing = 104


The missing categories are included through the missprint option.

proc freq data = blog;
tables y*x / norow nocol missprint;
run;


This produces:

Table of y by x

y x

Frequency|
Percent | .| 1| 2| 3| Total
---------+--------+--------+--------+--------+
. | 12 | 12 | 20 | 14 | .
| . | . | . | . | .
---------+--------+--------+--------+--------+
1 | 10 | 16 | 13 | 18 | 47
| . | 8.16 | 6.63 | 9.18 | 23.98
---------+--------+--------+--------+--------+
2 | 17 | 18 | 32 | 22 | 72
| . | 9.18 | 16.33 | 11.22 | 36.73
---------+--------+--------+--------+--------+
3 | 19 | 28 | 31 | 18 | 77
| . | 14.29 | 15.82 | 9.18 | 39.29
---------+--------+--------+--------+--------+
Total . 62 76 58 196
. 31.63 38.78 29.59 100.00
Frequency Missing = 104


Note that if there are no missing values, SAS will not print the rows and columns headed with a '.' which is analogous to the "ifany" option in R shown below.

R

First, generate the data:

library(Hmisc)
x <- rMultinom(matrix(c(.3,.3,.4),1,3),300)
y <- rMultinom(matrix(c(.3,.3,.4),1,3),300)


Then, generate some random Uniforms to censor some of the observed data:

censprobx <- runif(300)
censproby <- runif(300)


Censor the data:

x[censprobx > .8] <- NA
y[censproby > .8] <- NA


Produce the default table (omits any missing data):

table(y,x)

x
y 1 2 3
1 18 18 29
2 17 21 22
3 20 30 40


Make the table which includes the missing category:

table(y, x, useNA="ifany")

x
y 1 2 3 NA
1 18 18 29 9
2 17 21 22 17
3 20 30 40 17
NA 14 5 14 9


The useNA option also allows the values "no" and "always". The value "no" corresponds to the default behavior in R or SAS, while the "always" option is not available in SAS. SAS, however, shows the total number missing in any case.

Thursday, November 12, 2009

Example 7.17: The Smith College diploma problem

Smith College is a residential women's liberal arts college in Northampton, MA that is steeped in tradition. One such tradition is to give each student at graduation a diploma at random (or more accurately, in a haphazard fashion). At the end of the ceremony, a diploma circle is formed, and students pass the diplomas that they receive to the person next to them, and step out once they've received their diploma.

This problem, also known as the "hat-check" problem, is featured in Fred Mosteller's "Fifty Challenging Problems in Probability (with solutions)". Variants provide great fodder for probability courses.

The analytic (closed-form) solution for the expected number of students who receive their diplomas when they first receive one is very straightforward. Let X_i be the event that the ith student receives their diploma. E[X_i] = 1/n for all i, since the diplomas are uniformly distributed. If T is defined as the sum of all of the events X_1 through X_n, E[T] = n * 1/n = 1 by the rules of expectations. It is sometimes surprising to students that this result does not depend on n. The variance is trickier, since the outcomes are not independent (if the first student receives their diploma, the probability that the others will increases ever so slightly).

For students, the use of empirical (simulation-based) problem solving is increasingly common as a means to complement and enhance analytic (closed-form) solutions. Here we illustrate how to simulate the expected number of students who receive their diploma as well as the standard deviation of that quantity. We assume that n=650.

R
In R, we set up some constants and a vector for results that we will use. The list of students vector can be generated once, with the permuted vector of diplomas generated inside the loop. The == operator (section B.4.2) is used to compare each of the elements of the vectors. Finally, the

numsim <- 100000
n <- 650
res <- numeric(numsim)
students <- 1:n
for (i in 1:numsim) {
diploma <- sample(students, 650)
res[i] = sum(students==diploma)
}


This generates the output:

> table(res)
0 1 2 3 4 5 6 7 8
36568 36866 18545 6085 1590 295 40 9 2
> mean(res)
[1] 1.00365
> sd(res)
[1] 0.9995232

The expected value and standard deviation of the number of students who receive their diplomas on the first try are both 1.


SAS

In SAS we prepare by making a data set with all 650 * 100,000 diplomas by looping through the 100,000 simulations, creating 650 diplomas each time. We also assign a random number to each diploma. Then we sort on the random value within each simulation.

data dips;
do sim = 1 to 100000;
do diploma = 1 to 650;
randorder = uniform(0);
output;
end;
end;
run;

proc sort data=dips;
by sim randorder;
run;

Next, we create a student identifier based on position in the data set and mark whether the reordered diploma matches the student ID. Finally, we count how many times the diploma matches the student using proc summary (section 2.1). It would also be relatively easy to count this manually within the data step.

data match;
set dips;
student = mod(_n_, 650) + 1;
match = (student eq diploma);
run;

proc summary data=match;
by sim;
var match;
output out=summatch sum=totalmatches;
run;

We can generate a table of the results using proc freq (section 2.3), and the mean and standard deviation using proc means.


proc freq data=summatch;
tables totalmatches / nocum;
run;


totalmatches Frequency Percent
0 36726 36.73
1 36734 36.73
2 18359 18.36
3 6241 6.24
4 1578 1.58
5 301 0.30
6 57 0.06
7 4 0.00


proc means data=summatch mean std;
var totalmatches;
run;


Analysis Variable : totalmatches
Mean Std Dev
1.0036200 1.0031734

Saturday, October 24, 2009

Example 7.16: assess robustness of permutation test to violations of exchangeability assumption

Permutation tests (section 2.4.3) are a form of resampling based inference that can be used to compare two groups. A simple univariate two-group permutation test requires that the group labels for the observations are exchangeable under the null hypothesis of equal distributions, but allows relaxation of specific distributional assumptions required by parametric procedures such as the t-test (section 2.4.1).

Here we demonstrate how to carry out a small simulation study of the robustness of the Type I error rate of a permutation test under two scenarios that violate exchangeability by having different shapes under the null hypothesis. One might encounter such a situation in practice in a study of growth in which unequal follow-up time is allotted to the groups. Our study is analogous to assessing the t-test assuming equal variances, when the variances are known to be unequal and/or the distributions non-normal under the null.

Scenario I: group 1 is distributed as a Normal random variable (section 1.10.5) with mean 0 and standard deviation 1, while group 2 is distributed as a Normal random variable with mean 0 and standard deviation 6. Here the distributions are both symmetric, but have markedly different variances.

Scenario II: group 1 is distributed as a Normal random variable with mean 6 and standard deviation 1, while group 2 is distributed as an exponential random variable (section 1.10.7) with mean 6 and standard deviation 6. Here the distributions have markedly different skewness, as well as variance.

We use an alpha level of 0.05, set the sample size in each group to 10, and generate 10,000 simulated data sets. Instead of the exact permutation test, we use the asymptotically equivalent Monte Carlo Hypothesis Test (Dwass, 1957), generating 1,000 samples from the permutation distribution for each simulation (thank heavens for fast computers!)

R
In R, we set up some constants and a vector for results that we will use. The group vector can be generated once, with the outcome generated inside the loop. The oneway_test() function from the coin library is used to run the permutation test, and return the p-value (the first element in the list provided by pvalue()). Finally, the prop.test() function (section 2.1.9) is used to calculate the 95% confidence interval for the level of the test.

For scenario I, the code is:

numsim <- 10000
numperm <- 1000
n <- 10
res <- numeric(numsim)
x <- c(rep(0, n), rep(1, n))
for (i in 1:numsim) {
y1 <- rnorm(n, 0, 1)
y2 <- rnorm(n, 0, 6)
y <- c(y1, y2)
res[i] <- pvalue(oneway_test(y ~ as.factor(x),
distribution=approximate(numperm)))[[1]]
}


The desired results can be generated as in section 2.1.9 by summarizing the res vector:

prop.test(sum(res<=0.05), numsim)


The result follows:

1-sample proportions test with continuity correction

data: sum(res <= 0.05) out of numsim, null probability 0.5
X-squared = 7574.221, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.06009200 0.06984569
sample estimates:
p
0.0648


We see that the permutation test is somewhat anti-conservative given such an extreme difference in variances.

For scenario II, the code is the same with the exception of:

y1 <- rnorm(n, 6, 1)
y2 <- rexp(n, 1/6)


The yields the following results:

> prop.test(sum(res<=0.05), numsim)

1-sample proportions test with continuity correction

data: sum(res <= 0.05) out of numsim, null probability 0.5
X-squared = 5973.744, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.1073820 0.1199172
sample estimates:
p
0.1135


The nominal Type I error rate of 0.05 is not preserved in this situation.


SAS
In SAS, we begin by making data for scenario 1, using the looping tools described in section 1.11.1.


data test;
do numsim = 1 to 10000;
do group = 0,1;
do i = 1 to 10;
if group = 0 then y = rand('NORMAL',0,1);
else y = rand('NORMAL',0,6);
output;
end;
end;
end;
run;


Next, we perform the permutation test for equal means as shown in section 2.4.3. The by statement repeats the test for each simnum, while the ods statements, as demonstrated throughout the examples in the book, prevent copious output and save relevant results. This was found using the ods trace on statement, as discussed in section A.7.1.


ods select none;
ods output datascoresmc = kkout;
proc npar1way data = test;
by numsim;
class group;
var y;
exact scores = data / mc n = 1000;
run;
ods select all;


Next, we make a new data set from the saved output, saving just the lines we need, and generating a new variable which records whether the test was rejected or not. The names and values of the variables were found using a proc print statement, which is not shown.


data props;
set kkout;
where (name1 = "MCP2_DATA");
reject = (nvalue1 lt .05);
run;

The estimated rejection probability in this setting is found as discussed in section 2.1.9:


proc freq data = props;
tables reject/binomial(level='1');
run;


The resulting output follows:

Cumulative Cumulative
reject Frequency Percent Frequency Percent
___________________________________________________________
0 9339 93.39 9339 93.39
1 661 6.61 10000 100.00

Proportion 0.0661
ASE 0.0025
95% Lower Conf Limit 0.0612
95% Upper Conf Limit 0.0710

Exact Conf Limits
95% Lower Conf Limit 0.0613
95% Upper Conf Limit 0.0711

ASE under H0 0.0050
Z -86.7800
One-sided Pr < Z <.0001
Two-sided Pr > |Z| <.0001

Sample Size = 10000

These results agree substantially with the results from R.

To produce scenario 2, we replace

if group = 0 then y = rand('NORMAL',0,1);
else y = rand('NORMAL',0,6);

with

if group = 0 then y = rand('NORMAL',6,1);
else y = rand('EXPONENTIAL') * 6;

with the following results:

Cumulative Cumulative
reject Frequency Percent Frequency Percent
___________________________________________________________
0 8877 88.77 8877 88.77
1 1123 11.23 10000 100.00

Proportion 0.1123
ASE 0.0032
95% Lower Conf Limit 0.1061
95% Upper Conf Limit 0.1185

Exact Conf Limits
95% Lower Conf Limit 0.1062
95% Upper Conf Limit 0.1187

ASE under H0 0.0050
Z -77.5400
One-sided Pr < Z <.0001
Two-sided Pr > |Z| <.0001

Sample Size = 10000


Again, these results are comfortably similar to those found in R and demonstrate that this extreme violation of the null results in a noticeable anti-conservative bias.

However, note that with additional observations, the anti-conservative bias diminishes considerably. With 100 observations per group, (replacing do i = 1 to 10 with do i = 1 to 100) the following results are generated:

Cumulative Cumulative
reject Frequency Percent Frequency Percent
___________________________________________________________
0 9452 94.52 9452 94.52
1 548 5.48 10000 100.00

Proportion 0.0548
ASE 0.0023
95% Lower Conf Limit 0.0503
95% Upper Conf Limit 0.0593

Exact Conf Limits
95% Lower Conf Limit 0.0504
95% Upper Conf Limit 0.0594

ASE under H0 0.0050
Z -89.0400
One-sided Pr < Z <.0001
Two-sided Pr > |Z| <.0001

Sample Size = 10000

revealing a negligible bias.

Our conclusion: while the permutation test is somewhat robust to violations of exchangeability, when shapes and variances differ radically the results may be anti-conservatively biased.

It bears noting here that if there were no reason to expect different distributional shapes under the null, but we just happened to observed the large differences seen here, we would interpret these results as demonstrating the poor power of a permutation test based on the means to detect differences in shape alone.


Dwass M. Modified randomization tests for nonparametric hypotheses. Annals of Mathematical Statistics, 28:181-187, 1957.

Tuesday, October 13, 2009

Example 7.15: A more complex sales graphic

The plot of Amazon sales rank over time generated in example 7.14 leaves questions. From a software perspective, we'd like to make the plot prettier, while we can embellish the plot to inform our interpretation about how the rank is calculated.

For the latter purpose, we'll create an indicator of whether the rank was recorded in nighttime (eastern US time) or not. Then we'll color the nighttime ranks differently than the daytime ranks.


SAS
In SAS, we use the timepart function to extract the time of day from the salestime variable which holds the date and time. This is a value measured in seconds since midnight, and we use some conditional logic (section 1.4.11) to identify hours before 8 AM or after 6 PM.


data sales2;
set sales;
if timepart(salestime) lt (8 * 60 * 60) or
timepart(salestime) gt (18 * 60 * 60) then night=1;
else night = 0;
run;



Then we can make the plot. We use the axis statement (sections 5.3.7 and 5.3.8) to specify the axis ranges, rotate some labels and headers, and remove the minor tick marks. Note that since the x-axis is a date-time variable, we have to specify the axis range using date-time data, here read in using formats (section A.6.4) and request labels every three days by requesting an interval of three days' worth of seconds. We also use the symbol statement (section 5.2.2, 5.3.11) to specify shapes and colors for the plotted points.


axis1 order = ("09AUG2009/12:00:00"dt to
"27AUG2009/12:00:00"dt by 259200)
minor = none;
axis2 order=(30000 to 290000 by 130000) label=(angle=90)
value=(angle=90) minor=none;
symbol1 i=none v=dot c=red h=.3;
symbol2 i=none v=dot c=black h=.3;


Finally, we request the plot using proc gplot. The a*b=c syntax (as in section 5.6.2) will result in different symbols for each value of the new night variable, and the symbols we just defined will be used. The haxis and vaxis options are used to associate each axis with the axis definitions specified in the axis statements.


proc gplot data=sales2;
plot rank*salestime=night / haxis=axis1 vaxis=axis2;
format salestime dtdate5.;
run; quit;


R
In R, we make a new variable reflecting the date-time at the midnight before we started collecting data. We then coerce the time values to numeric values using the as.numeric() function (section 1.4.2), while subtracting that midnight value. Next, we mod by 24 (using the %% operator, section B.4.3) and lastly round to the integer value (section 1.8.4) to get the hour of measurement. There's probably a more elegant way of doing this in R, but this works.


midnight <- as.POSIXlt("2009-08-09 00:00:00 EDT")
timeofday <- round(as.numeric(timeval-midnight)%%24,0)


Next, we prepare for making a nighttime indicator by intializing a vector with 0. Then we assign a value of 1 when the corresponding element of the hour of measurement vector has a value in the correct range.


night <- rep(0,length(timeofday))
night[timeofday < 8 | timeofday > 18] <- 1


Finally, we're ready to make the plot. We begin by setting up the axes, using the type="n" option (section 5.1.1) to prevent any data being plotted. Next, we plot the nighttime ranks by conditioning the plot vector on the value of the nighttime indicator vector; we then repeact for the daytime values, additionally specifying a color for these points. Lastly, we add a legend to the plot (section 5.2.14).


plot(timeval, rank, type="n")
points(timeval[night==1], rank[night==1], pch=20)
points(timeval[night==0], rank[night==0], pch=20,
col="red")
legend(as.POSIXlt("2009-08-22 00:00:00 EDT"), 250000,
legend=c("day", "night"), col=c("black", "red"),
pch=c(20, 20))




Interpretation: It appears that Amazon's ranking function adjusts for the pre-dawn hours, most likely to reflect a general lack of activity. (Note that these ranks are from Amazon.com. In all likelihood, Amazon.co.uk and other local Amazon sites adjust for local time similarly.) Perhaps some recency in sales allows a decline in rank for some books during these hours? In addition, we see that most sales of this book, as inferred from the discontinuous drops (improvement) in rank, tend to happen near the beginning of the day, or mid-day, rather than at night.

Tuesday, September 29, 2009

Example 7.14: A simple graphic of sales

In this example, we show a simple plot of the sales rank data read in as shown in example 7.13.

SAS
In SAS, we use the symbol statement (section 5.3) to request small (with the h option) dots (with the v option, and that the dots not be connected (with the i option. (See sections 5.2.2, 5.3.9 for more details.)
we request a scatter plot with the gplot procdure (section 5.1.1), and tell SAS how to display the date/time values using the format statement (section A.6.4).


symbol1 v=dot i=none h=.2;
proc gplot data=sales;
plot rank*salestime;
format salestime datetime7.;
run;


Note in the results that the default SAS behavior is to use round and regular axis tick marks, in this case wasting a great deal of space on both axes. Similarly, the orientation of the y-axis labels uses up much space as well. We'll fix this behavior in a later entry.

R
In contrast, the R code is both simpler and more attractive by default; the plot() function (section 5.1.1) has a default treatment for date/time-formatted variables. Our only modification to the defaults is to request dots (with the pch option (section 5.2.2) instead of the default open circles.


plot(timeval, rank, pch=20)





Interpretation: As noted in a previous entry, information about the Amazon sales rank is relatively scant. Amazon considers the number of books it sells to be a competitive secret, and it discloses little information about how the rank is calculated. However, by examining the plot above, we can make some deductions. First, sales ranks are updated at least hourly. Second, there appears to be some adjustment for time of day. This would explain the smooth changes in direction during, for example, the first series of ~20 observations. In contrast, notable discontinuous changes in rank apparently signify sales; for a book with small circulation, such as this one, we can assume that most hours contain only one sale.

Thursday, September 24, 2009

Example 7.13: Read a file with two lines per observation

In example 7.6 we showed how to retrieve the Amazon sales rank of a book. A cron job on one of our machines grabs the sales rank hourly. We’d like to use this data to satisfy our curiosity about when and how often a book sells.

A complication is that the result of the cron job is a file with the time of the sales rank retrieval on one line and the rank retrieved on the following line. Here are some example lines frome the file:

Sun Aug 9 14:38:13 EDT 2009
salesrank= 141913
Sun Aug 9 14:40:04 EDT 2009
salesrank= 141913
Sun Aug 9 15:00:05 EDT 2009
salesrank= 149556
Sun Aug 9 16:00:05 EDT 2009
salesrank= 158132
Sun Aug 9 17:00:05 EDT 2009
salesrank= 166742

In the example below, we show how to read this data into SAS or R so that there is one observation (or row) per time. In a later entry, we'll build a graphic to display it.

SAS
In SAS, we use the infile statement (section 1.1.2) to read the data, as this affords the most control. As in section 6.4.1, we read a key character from the start of the line, holding that line with the trailing ”@” character. Then, dependent on the value of that character, we use different input statements to read in the date and time values on the one hand, or the rank. If the line does not contain the rank, we read in character values containing the day and time in formats which SAS can interpret, then convert this value to a SAS date-time value using an informat (section A.6.4) applied to a new character string made up of the date and time information. We keep these variable values using the retain statement (as in section 6.4.1). If the line has a rank, we read in the rank and write out the line using the output statement. Note that we do not use either the day of week or the time zone. Finally, we check the data by printing a few lines.


data sales;
infile "c:\data\sales.dat";
retain day month date time edt year;
input @1 type $1 @;
if type ne 's' then do;
input @1 day $ Month $ year time $ edt $ year;
end;
else do;
input @12 rank;
datetime = compress(date||month||year||"/"||time);
salestime = input(datetime,datetime18.);
output;
end;
run;

proc print data=sales (obs=6);
var datetime salestime rank;
run;

Obs datetime salestime rank

1 9Aug2009/14:38:13 1565447893 141913
2 9Aug2009/14:40:04 1565448004 141913
3 9Aug2009/15:00:05 1565449205 149556
4 9Aug2009/16:00:05 1565452805 158132
5 9Aug2009/17:00:05 1565456405 166742
6 9Aug2009/18:00:05 1565460005 175812


R
In R, we begin by reading the file (this is done relative to the current working directory (see getwd(), section 1.7.3). We then calculate the number of entries by dividing the file's length by two.

Next, we create two empty vectors of the correct length and type to store the data in, once we extract it from the file.

Once this preparatory work is completed, we loop (section 1.1.11) through the file, reading in the odd-numbered lines (lines (i-1)*2+1), some date/time values from the Eastern US time zone, with daylight savings applied. The gsub() function (also used in section 6.4.1 and related to the grep() function in 1.4.6) is used to replace matches determined by regular expression matching. In this situation, it is used to remove the time zone from the line before this processing. These date/time values are read into the timeval vector. Even-numbered lines (lines i*2) are read into the rank vector, after removing the string "salesrank=" (again using gsub()).

Finally, we make a data frame (section B.4.5) from the two vectors and display the first few lines using the head() function (section 1.13.1).


file <- readLines("sales.dat")
n <- length(file)/2
rank <- numeric(n)
timeval <- as.POSIXlt(rank, origin="1960-01-01")
for (i in 1:n) {
timeval[i] <- as.POSIXlt(gsub('EST', '',
gsub('EDT', '', file[(i-1)*2+1])),
tz="EST5EDT", format="%a %b %d %H:%M:%S %Y")
rank[i] <- as.numeric(gsub('salesrank= ','',file[i*2]))
}
timerank <- data.frame(timeval, rank)

We can review the results:

> head(timerank)
timeval rank
1 2009-08-09 14:38:13 141913
2 2009-08-09 14:40:04 141913
3 2009-08-09 15:00:05 149556
4 2009-08-09 16:00:05 158132
5 2009-08-09 17:00:05 166742
6 2009-08-09 18:00:05 175812

Thursday, September 17, 2009

Example 7.12: Calculate and plot a running average

The Law of Large Numbers concerns the stability of the mean, as sample sizes increase. This is an important topic in mathematical statistics. The convergence (or lack thereof, for certain distributions) can easily be visualized in SAS and R (see also Horton, Qian and Brown, 2004).

Assume that X1, X2, ..., Xn are independent and identically distributed realizations from some distribution with center mu. We define x-bar(k) as the average of the first k observations. Recall that because the expectation of a Cauchy random variable is undefined (Romano and Siegel, 1986) the sample average does not converge to the population parameter.

SAS
In SAS, we begin by writing a macro (section A.8) to generate the random variables and calculate the running average.

The macro runs a data step to generate the data and calculate the average after each value is generated using the sum function (section 1.8.1). The macro accepts the name of a data set to hold the variates and their running average, the number of random variates to generate, and the argument(s) to the rand function (section 1.10) so that variates from many distributions can be generated. For the purposes of making a dramatic example, we set a fixed random number seed (section 1.10.9).


%macro runave(outdata=, n=, distparms=) ;
data &outdata;
call streaminit(1984);
do i = 1 to &n;
x = rand &distparms;
runtot = sum(x, runtot);
avex = runtot/i;
output;
end;
run;
%mend runave;


In this example, we compare the Cauchy distribution to a t distribution with 4 degrees of freedom (for which the average does converge to the mean) for 1000 observations. To do this, we call the macro twice.


%runave(outdata=cauchy, n=1000, distparms= ("CAUCHY"));
%runave(outdata=t4, n=1000, distparms= ('T',4));


To plot the two running averages on a single figure, we combine the two data sets, using the in= data set option to identify which data set each observation came from. Then we plot the two curves using the a*b=c syntax for proc gplot (section 5.2.2). To connect the dots and choose line styles and colors, we use the symbol statement (sections 5.3.9-11), and a title statement (section 5.2.9) to add a title. SAS generates a legend by default with the a*b=c syntax.


data c_t4;
set cauchy t4 (in=t4);
if t4 then dist="t with 4 df";
else dist="Cauchy";
run;

symbol1 i=j c=black l=1 w=2;
symbol2 i=j c=black l=2 w=2;

title"Running average of two distributions";
proc gplot data=c_t4;
plot avex * i = dist / vref=0;
run;
quit;




R
Within R, we define a function (section B.5.2) to calculate the running average for a given vector, again allowing for variates from many distributions to be generated.


runave <- function(n, gendist, ...) {
x <- gendist(n, ...)
avex <- numeric(n)
for (k in 1:n) {
avex[k] <- mean(x[1:k])
}
return(data.frame(x, avex))
}



Next, we generate the data, using our new function. To make sure we have a nice example, we first set a fixed seed (section 1.10.9).


vals <- 1000
set.seed(1984)
cauchy <- runave(vals, rcauchy)
t4 <- runave(vals, rt, 4)



Finally, we're ready to plot. We begin by making an empty plot with the correct x and y limits, using the type="n" specification (section 5.1.1). We then plot the running average using the lines function (section 5.2.1) and varying the line style (section 5.3.9) and thickness (section 5.3.11) with the lty and lwd specification. Finally we add a title (section 5.2.9) and a legend (section 5.2.14).


plot(c(cauchy$avex, t4$avex), xlim=c(1, vals), type="n")
lines(1:vals, cauchy$avex, lty=1, lwd=2)
lines(1:vals, t4$avex, lty=2, lwd=2)
abline(0, 0)
title("Running average of two distributions")
legend(vals*.6, -1, legend=c("Cauchy", "t with 4 df"), lwd=2, lty=c(1,2))


 
Creative Commons License
SAS and R Blog by Ken Kleinman and Nicholas Horton is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.
Based on a work at sas-and-r.blogspot.com.