grep("Fumb", AllBig$Detail, ignore.case=TRUE)
I labeled each play as Late or not according to whether it happened after the rule change:
AllBig$Late <- ifelse(AllBig$Year > 2006, 1, 0)Now for the analysis. My data set has 7558 plays including 145 fumbles (1.9%). I used the mosaic package and the tally() command to see how often teams other than the Patriots fumble:
require(mosaic) tally(~Fumble+Late, data=filter(AllBig,Pats==0))
             Late 
Fumble     0    1      
 0      2588 2919      
 1        54   65
Then I asked for the data in proportion terms:
tally(Fumble~Late, data=filter(AllBig,Pats==0))and got
               Late 
Fumble       0      1      
 0      0.9796 0.9782      
 1      0.0204 0.0218
For non-Pats there is a tiny increase in fumbles.
This can be displayed graphically using a mosaiplot (though it's not a particularly compelling figure).
mosaicplot(Fumble~Late, data=filter(AllBig,Pats==0))
Repeating this for the Patriots shows a different picture:
tally(~Fumble+Late, data=filter(AllBig,Pats==1))       
         Late 
Fumble   0   1      
 0     996 910      
 1      19   7
tally(Fumble~Late, data=filter(AllBig,Pats==1))       
                Late 
Fumble       0       1      
 0     0.98128 0.99237      
 1     0.01872 0.00763
I fit a logistic regression model with the glm() command:
glm(Fumble~Late*Pats, family=binomial, data=AllBig)
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.8697 0.1375 -28.14 <2e-16 *** Late 0.0650 0.1861 0.35 0.727 Pats -0.0897 0.2693 -0.33 0.739 Late:Pats -0.9733 0.4819 -2.02 0.043 *I wanted to control for any weather effect, so I coded the weather as Bad if it was raining or snowing and good if not. This led to a model that includes BadWeather and Temperature – which turn out not to make much of a difference:
AllBig$BadWeather <- ifelse(AllBig$Weather %in% c("drizzle","rain","snow"), 1, 0)
glm(formula = Fumble ~ BadWeather + Temp + Late * Pats, 
  family = binomial, data = AllBig)
Coefficients:             
               Estimate Std. Error z value Pr(>|z|)     
(Intercept) -4.23344    0.43164   -9.81   <2e-16 *** 
BadWeather   0.33259    0.29483    1.13     0.26     
Temp         0.00512    0.00612    0.84     0.40     
Late         0.08871    0.18750    0.47     0.64     
Pats        -0.14183    0.27536   -0.52     0.61     
Late:Pats   -0.91062    0.48481   -1.88     0.06 .  
Because there was suspicion that something changed starting in 2007 I added a three-way interaction:
glm(formula = Fumble ~ BadWeather + Temp + IsAway * Late * Pats,
  family = binomial, data = AllBig)
Coefficients:                  
                    Estimate Std. Error z value Pr(>|z|)     
(Intercept)      -4.51110    0.47707   -9.46   <2e-16 *** 
BadWeather        0.34207    0.30013    1.14    0.254     
Temp              0.00831    0.00653    1.27    0.203     
IsAway            0.14791    0.27549    0.54    0.591     
Late              0.13111    0.26411    0.50    0.620     
Pats             -0.80019    0.54360   -1.47    0.141     
IsAway:Late      -0.07348    0.37463   -0.20    0.845     
IsAway:Pats       0.94335    0.63180    1.49    0.135     
Late:Pats         0.51536    0.71379    0.72    0.470     
IsAway:Late:Pats -3.14345    1.29480   -2.43    0.015 *  
There is some evidence here that the Patriots fumble less than the rest of the NFL and that things changed in 2007. The p-values above are based on asymptotic normality, but there is a cleaner and easier way to think about the Patriots’ fumble rate. I wrote a short simulation that mimics something I do in my statistics classes, where I use a physical deck of cards to show what each step in the R simulation is doing.
#Simulation of deflategate data null hypothesis
Late = rep(1,72)  #creates 72 late fumbles
Early = rep(0,73)   #creates 73 early fumbles
alldata = append(Late,Early)   #puts the two groups together
table(alldata)  #check to see that we have what we want
cards =1:length(alldata)  # creates 145 cards, one "ID number" per fumble
FumbleLate = NULL  # initializes a vector to hold the results
for (i in 1:10000){# starts a loop that will be executed 10,000 times
  cardsgroup1 = sample(cards,119, replace=FALSE) # takes a sample of 119 cards
  cardsgroup2 = cards[-cardsgroup1]  # puts the remaining cards in group 2
  NEPats = (alldata[cardsgroup2])  #reads the values of the cards in group 2
  FumbleLate[i] = sum(NEPats)  # counts NEPats late fumbles (the only stat we need)
}
table(FumbleLate) #look at the results
hist(FumbleLate, breaks=seq(2.5,23.5)) #graph the results
sum(FumbleLate <= 7)/10000 # How rare is 7 (or fewer)? Answer: around 0.0086
Additional note: kudos to Steve Taylor for the following graphical depiction of the interaction.
An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Here is another way (and perhaps more natural way) to conduct a simulation, but it is one that cannot be demonstrated with a physical deck of cards.
ReplyDelete#Second simulation of deflategate data null hypothesis
Late = rep(1,26) #creates 26 Pats fumbles
Early = rep(0,1906) #creates 1906 Pats non-fumbles
alldata = append(Late,Early) #puts the two groups together
table(alldata) #check to see that we have what we want
cards =1:length(alldata) #creates 1932 cards, one "ID number" per fumble
FumbleLate = NULL #initializes a vector to hold the results
for (i in 1:10000){#starts a loop that will be executed 10,000 times
cardsgroup1 =sample(cards,1015,replace=FALSE) #takes a sample of 1015 cards
cardsgroup2 = cards[-cardsgroup1] #puts the remaining cards in group 2
NEPats = (alldata[cardsgroup2]) #reads the values of the cards in group 2
FumbleLate[i] = sum(NEPats) #counts NEPats late fumbles (the only stat we need)
}
table(FumbleLate)
hist(FumbleLate, breaks=seq(2.5,23.5))
h=hist(FumbleLate, breaks=seq(2.5,23.5),plot=FALSE)
clr=ifelse(h$breaks<7,"red","white")
plot(h,col=clr)
sum(FumbleLate <= 7)/10000
Great analysis. Could you please also release the code for the plot? Adding error bars seems to be a toughy in R.
ReplyDeleteCheers, Walter.
This comment has been removed by the author.
ReplyDelete[Edit] The essential parts, after converting the binary variables to factors with levels c('No','Yes') are these:
ReplyDeletelibrary(effects)
glm1 = glm(Fumble ~ BadWeather + Temp + Late * IsAway * Pats, binomial, AllBig)
plot(effect("Late*IsAway*Pats", glm1), ylim=c(0,0.06), rescale=FALSE, cex=0.9)
Steve, thanks for the figure and the code. I really like your personal functions (and thought that others might as well).
ReplyDeletelibrary(effects)
# Two of my personal functions included here as they're used below. - Steve Taylor
NoYes = function(bool) {
if(!is.logical(bool)) bool = as.logical(bool) # No warning
factor(ifelse(bool,"Yes","No"),levels=c('No','Yes'))
}
word.png = function(filename="Word_Figure_%03d.png", zoom=4, width=17, height=10, pointsize=10, ...) {
if (!grepl("[.]png$", filename, ignore.case=TRUE))
filename = paste0(filename,".png")
png(filename=filename, res=96*zoom,
width=width, height=height, units='cm', pointsize=pointsize, ...)
}
# downloaded from: http://www.amherst.edu/~nhorton/AllBig.csv
AllBig = read.csv('AllBig.csv')
AllBig = transform(AllBig,
IsAway = NoYes(IsAway==1),
BadWeather = NoYes(BadWeather==1),
Pats = NoYes(Pats==1),
Late = NoYes(Late==1),
Fumble = NoYes(Fumble==1)
)
glm1 = glm(Fumble ~ BadWeather + Temp + Late * IsAway * Pats, binomial, AllBig)
summary(glm1)
word.png('AllBig logistic regression', zoom=2)
{
palette(c('black','grey70'))
pars = trellis.par.get()
pars$fontsize$text = 9
trellis.par.set(pars)
print(plot(effect("Late*IsAway*Pats", glm1), ylim=c(0,0.06), rug=FALSE, rescale=FALSE, cex=0.9))
}
dev.off()
i like it
ReplyDeleteThis comment has been removed by the author.
ReplyDelete