SAS and R: 2015.2: Did the New England Patriots experience a decrease in fumbles starting in 2007?

Sunday, February 1, 2015

2015.2: Did the New England Patriots experience a decrease in fumbles starting in 2007?

Here's a timely guest entry from Jeffrey Witmer (Oberlin College).

As the “Deflate Gate” saga was unfolding, Warren Sharp analyzed “touches per fumble” for NFL teams before and after 2006, when a rule was changed so that teams playing on the road could provide their own footballs (http://www.sharpfootballanalysis.com/blog/). Sharp noted that, for whatever reason, the Patriots went from being a typical team, as regards fumbling, to a team with a very low fumble rate. Rather than rely on the data the Sharp provides at his website, I choose to collect and analyze some data on my own. I took a random sample of 30 games played by New England and 30 other games. For each game, I recorded all rushing and passing plays (except for QB kneels), but excluded kicking plays (the NFL, rather than the teams, provides special footballs for those plays). (Data source: http://www.pro-football-reference.com/play-index/play_finder.cgi.) I also recorded the weather for the game. (Data source: http://www.wunderground.com/history/.) Once I had the data (in a file that I called AllBig, which can be downloaded from http://www.amherst.edu/~nhorton/AllBig.csv), I noted whether or not there was a fumble on each play, aided by the grep() command:

grep("Fumb", AllBig$Detail, ignore.case=TRUE)

I labeled each play as Late or not according to whether it happened after the rule change:

AllBig$Late <- ifelse(AllBig$Year > 2006, 1, 0)

Now for the analysis. My data set has 7558 plays including 145 fumbles (1.9%). I used the mosaic package and the tally() command to see how often teams other than the Patriots fumble:

require(mosaic)
tally(~Fumble+Late, data=filter(AllBig,Pats==0))

             Late 
Fumble     0    1      
 0      2588 2919      
 1        54   65

Then I asked for the data in proportion terms:

tally(Fumble~Late, data=filter(AllBig,Pats==0))

and got

               Late 
Fumble       0      1      
 0      0.9796 0.9782      
 1      0.0204 0.0218

For non-Pats there is a tiny increase in fumbles. This can be displayed graphically using a mosaiplot (though it's not a particularly compelling figure). mosaicplot(Fumble~Late, data=filter(AllBig,Pats==0)) Repeating this for the Patriots shows a different picture:

tally(~Fumble+Late, data=filter(AllBig,Pats==1))       
         Late 
Fumble   0   1      
 0     996 910      
 1      19   7


tally(Fumble~Late, data=filter(AllBig,Pats==1))       
                Late 
Fumble       0       1      
 0     0.98128 0.99237      
 1     0.01872 0.00763

I fit a logistic regression model with the glm() command: glm(Fumble~Late*Pats, family=binomial, data=AllBig)

Coefficients:             
  Estimate Std. Error z value Pr(>|z|)     
(Intercept)  -3.8697     0.1375  -28.14   <2e-16 *** 
Late          0.0650     0.1861    0.35    0.727     
Pats         -0.0897     0.2693   -0.33    0.739     
Late:Pats    -0.9733     0.4819   -2.02    0.043 *

I wanted to control for any weather effect, so I coded the weather as Bad if it was raining or snowing and good if not. This led to a model that includes BadWeather and Temperature – which turn out not to make much of a difference:

AllBig$BadWeather <- ifelse(AllBig$Weather %in% c("drizzle","rain","snow"), 1, 0)

glm(formula = Fumble ~ BadWeather + Temp + Late * Pats, 
  family = binomial, data = AllBig)

Coefficients:             
               Estimate Std. Error z value Pr(>|z|)     
(Intercept) -4.23344    0.43164   -9.81   <2e-16 *** 
BadWeather   0.33259    0.29483    1.13     0.26     
Temp         0.00512    0.00612    0.84     0.40     
Late         0.08871    0.18750    0.47     0.64     
Pats        -0.14183    0.27536   -0.52     0.61     
Late:Pats   -0.91062    0.48481   -1.88     0.06 .

Because there was suspicion that something changed starting in 2007 I added a three-way interaction:

glm(formula = Fumble ~ BadWeather + Temp + IsAway * Late * Pats,
  family = binomial, data = AllBig)

Coefficients:                  
                    Estimate Std. Error z value Pr(>|z|)     
(Intercept)      -4.51110    0.47707   -9.46   <2e-16 *** 
BadWeather        0.34207    0.30013    1.14    0.254     
Temp              0.00831    0.00653    1.27    0.203     
IsAway            0.14791    0.27549    0.54    0.591     
Late              0.13111    0.26411    0.50    0.620     
Pats             -0.80019    0.54360   -1.47    0.141     
IsAway:Late      -0.07348    0.37463   -0.20    0.845     
IsAway:Pats       0.94335    0.63180    1.49    0.135     
Late:Pats         0.51536    0.71379    0.72    0.470     
IsAway:Late:Pats -3.14345    1.29480   -2.43    0.015 *

There is some evidence here that the Patriots fumble less than the rest of the NFL and that things changed in 2007. The p-values above are based on asymptotic normality, but there is a cleaner and easier way to think about the Patriots’ fumble rate. I wrote a short simulation that mimics something I do in my statistics classes, where I use a physical deck of cards to show what each step in the R simulation is doing.

#Simulation of deflategate data null hypothesis
Late = rep(1,72)  #creates 72 late fumbles
Early = rep(0,73)   #creates 73 early fumbles
alldata = append(Late,Early)   #puts the two groups together
table(alldata)  #check to see that we have what we want

cards =1:length(alldata)  # creates 145 cards, one "ID number" per fumble

FumbleLate = NULL  # initializes a vector to hold the results
for (i in 1:10000){# starts a loop that will be executed 10,000 times
  cardsgroup1 = sample(cards,119, replace=FALSE) # takes a sample of 119 cards
  cardsgroup2 = cards[-cardsgroup1]  # puts the remaining cards in group 2
  NEPats = (alldata[cardsgroup2])  #reads the values of the cards in group 2
  FumbleLate[i] = sum(NEPats)  # counts NEPats late fumbles (the only stat we need)
}

table(FumbleLate) #look at the results
hist(FumbleLate, breaks=seq(2.5,23.5)) #graph the results

sum(FumbleLate <= 7)/10000 # How rare is 7 (or fewer)? Answer: around 0.0086

Additional note: kudos to Steve Taylor for the following graphical depiction of the interaction.

An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

7 comments:

Jeff Witmer said...: Here is another way (and perhaps more natural way) to conduct a simulation, but it is one that cannot be demonstrated with a physical deck of cards.

#Second simulation of deflategate data null hypothesis
Late = rep(1,26) #creates 26 Pats fumbles
Early = rep(0,1906) #creates 1906 Pats non-fumbles
alldata = append(Late,Early) #puts the two groups together
table(alldata) #check to see that we have what we want

cards =1:length(alldata) #creates 1932 cards, one "ID number" per fumble

FumbleLate = NULL #initializes a vector to hold the results
for (i in 1:10000){#starts a loop that will be executed 10,000 times
cardsgroup1 =sample(cards,1015,replace=FALSE) #takes a sample of 1015 cards
cardsgroup2 = cards[-cardsgroup1] #puts the remaining cards in group 2
NEPats = (alldata[cardsgroup2]) #reads the values of the cards in group 2
FumbleLate[i] = sum(NEPats) #counts NEPats late fumbles (the only stat we need)
}
table(FumbleLate)
hist(FumbleLate, breaks=seq(2.5,23.5))
h=hist(FumbleLate, breaks=seq(2.5,23.5),plot=FALSE)
clr=ifelse(h$breaks<7,"red","white")
plot(h,col=clr)
sum(FumbleLate <= 7)/10000; February 2, 2015 at 1:43 PM
Unknown said...: Great analysis. Could you please also release the code for the plot? Adding error bars seems to be a toughy in R.

Cheers, Walter.; February 3, 2015 at 1:36 AM
Steve said...: This comment has been removed by the author.; February 3, 2015 at 7:20 PM
Steve said...: [Edit] The essential parts, after converting the binary variables to factors with levels c('No','Yes') are these:

library(effects)
glm1 = glm(Fumble ~ BadWeather + Temp + Late * IsAway * Pats, binomial, AllBig)
plot(effect("Late*IsAway*Pats", glm1), ylim=c(0,0.06), rescale=FALSE, cex=0.9); February 3, 2015 at 7:23 PM
Nick Horton said...: Steve, thanks for the figure and the code. I really like your personal functions (and thought that others might as well).

library(effects)

# Two of my personal functions included here as they're used below. - Steve Taylor

NoYes = function(bool) {
if(!is.logical(bool)) bool = as.logical(bool) # No warning
factor(ifelse(bool,"Yes","No"),levels=c('No','Yes'))
}

word.png = function(filename="Word_Figure_%03d.png", zoom=4, width=17, height=10, pointsize=10, ...) {
if (!grepl("[.]png$", filename, ignore.case=TRUE))
filename = paste0(filename,".png")
png(filename=filename, res=96*zoom,
width=width, height=height, units='cm', pointsize=pointsize, ...)
}

# downloaded from: http://www.amherst.edu/~nhorton/AllBig.csv
AllBig = read.csv('AllBig.csv')

AllBig = transform(AllBig,
IsAway = NoYes(IsAway==1),
BadWeather = NoYes(BadWeather==1),
Pats = NoYes(Pats==1),
Late = NoYes(Late==1),
Fumble = NoYes(Fumble==1)
)

glm1 = glm(Fumble ~ BadWeather + Temp + Late * IsAway * Pats, binomial, AllBig)
summary(glm1)

word.png('AllBig logistic regression', zoom=2)
{
palette(c('black','grey70'))
pars = trellis.par.get()
pars$fontsize$text = 9
trellis.par.set(pars)
print(plot(effect("Late*IsAway*Pats", glm1), ylim=c(0,0.06), rug=FALSE, rescale=FALSE, cex=0.9))
}
dev.off(); February 4, 2015 at 8:25 AM
scholarship essay writing service said...: i like it; April 6, 2015 at 6:12 AM
Unknown said...: This comment has been removed by the author.; August 7, 2015 at 10:07 AM

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts

Sunday, February 1, 2015

2015.2: Did the New England Patriots experience a decrease in fumbles starting in 2007?

7 comments:

About SAS and R

Topics discussed