Sunday, February 1, 2015

2015.2: Did the New England Patriots experience a decrease in fumbles starting in 2007?

Here's a timely guest entry from Jeffrey Witmer (Oberlin College).

As the “Deflate Gate” saga was unfolding, Warren Sharp analyzed “touches per fumble” for NFL teams before and after 2006, when a rule was changed so that teams playing on the road could provide their own footballs (http://www.sharpfootballanalysis.com/blog/). Sharp noted that, for whatever reason, the Patriots went from being a typical team, as regards fumbling, to a team with a very low fumble rate. Rather than rely on the data the Sharp provides at his website, I choose to collect and analyze some data on my own. I took a random sample of 30 games played by New England and 30 other games. For each game, I recorded all rushing and passing plays (except for QB kneels), but excluded kicking plays (the NFL, rather than the teams, provides special footballs for those plays). (Data source: http://www.pro-football-reference.com/play-index/play_finder.cgi.) I also recorded the weather for the game. (Data source: http://www.wunderground.com/history/.) Once I had the data (in a file that I called AllBig, which can be downloaded from http://www.amherst.edu/~nhorton/AllBig.csv), I noted whether or not there was a fumble on each play, aided by the grep() command:
grep("Fumb", AllBig$Detail, ignore.case=TRUE)
I labeled each play as Late or not according to whether it happened after the rule change:
AllBig$Late <- ifelse(AllBig$Year > 2006, 1, 0)
Now for the analysis. My data set has 7558 plays including 145 fumbles (1.9%). I used the mosaic package and the tally() command to see how often teams other than the Patriots fumble:
require(mosaic)
tally(~Fumble+Late, data=filter(AllBig,Pats==0))  
             Late 
Fumble     0    1      
 0      2588 2919      
 1        54   65
Then I asked for the data in proportion terms:
tally(Fumble~Late, data=filter(AllBig,Pats==0))
and got
               Late 
Fumble       0      1      
 0      0.9796 0.9782      
 1      0.0204 0.0218
For non-Pats there is a tiny increase in fumbles. This can be displayed graphically using a mosaiplot (though it's not a particularly compelling figure). mosaicplot(Fumble~Late, data=filter(AllBig,Pats==0)) Repeating this for the Patriots shows a different picture:
tally(~Fumble+Late, data=filter(AllBig,Pats==1))       
         Late 
Fumble   0   1      
 0     996 910      
 1      19   7


tally(Fumble~Late, data=filter(AllBig,Pats==1))       
                Late 
Fumble       0       1      
 0     0.98128 0.99237      
 1     0.01872 0.00763
I fit a logistic regression model with the glm() command: glm(Fumble~Late*Pats, family=binomial, data=AllBig)
Coefficients:             
  Estimate Std. Error z value Pr(>|z|)     
(Intercept)  -3.8697     0.1375  -28.14   <2e-16 *** 
Late          0.0650     0.1861    0.35    0.727     
Pats         -0.0897     0.2693   -0.33    0.739     
Late:Pats    -0.9733     0.4819   -2.02    0.043 *  
I wanted to control for any weather effect, so I coded the weather as Bad if it was raining or snowing and good if not. This led to a model that includes BadWeather and Temperature – which turn out not to make much of a difference:
AllBig$BadWeather <- ifelse(AllBig$Weather %in% c("drizzle","rain","snow"), 1, 0)

glm(formula = Fumble ~ BadWeather + Temp + Late * Pats, 
  family = binomial, data = AllBig)

Coefficients:             
               Estimate Std. Error z value Pr(>|z|)     
(Intercept) -4.23344    0.43164   -9.81   <2e-16 *** 
BadWeather   0.33259    0.29483    1.13     0.26     
Temp         0.00512    0.00612    0.84     0.40     
Late         0.08871    0.18750    0.47     0.64     
Pats        -0.14183    0.27536   -0.52     0.61     
Late:Pats   -0.91062    0.48481   -1.88     0.06 .  
Because there was suspicion that something changed starting in 2007 I added a three-way interaction:
glm(formula = Fumble ~ BadWeather + Temp + IsAway * Late * Pats,
  family = binomial, data = AllBig)

Coefficients:                  
                    Estimate Std. Error z value Pr(>|z|)     
(Intercept)      -4.51110    0.47707   -9.46   <2e-16 *** 
BadWeather        0.34207    0.30013    1.14    0.254     
Temp              0.00831    0.00653    1.27    0.203     
IsAway            0.14791    0.27549    0.54    0.591     
Late              0.13111    0.26411    0.50    0.620     
Pats             -0.80019    0.54360   -1.47    0.141     
IsAway:Late      -0.07348    0.37463   -0.20    0.845     
IsAway:Pats       0.94335    0.63180    1.49    0.135     
Late:Pats         0.51536    0.71379    0.72    0.470     
IsAway:Late:Pats -3.14345    1.29480   -2.43    0.015 *  
There is some evidence here that the Patriots fumble less than the rest of the NFL and that things changed in 2007. The p-values above are based on asymptotic normality, but there is a cleaner and easier way to think about the Patriots’ fumble rate. I wrote a short simulation that mimics something I do in my statistics classes, where I use a physical deck of cards to show what each step in the R simulation is doing.
#Simulation of deflategate data null hypothesis
Late = rep(1,72)  #creates 72 late fumbles
Early = rep(0,73)   #creates 73 early fumbles
alldata = append(Late,Early)   #puts the two groups together
table(alldata)  #check to see that we have what we want

cards =1:length(alldata)  # creates 145 cards, one "ID number" per fumble

FumbleLate = NULL  # initializes a vector to hold the results
for (i in 1:10000){# starts a loop that will be executed 10,000 times
  cardsgroup1 = sample(cards,119, replace=FALSE) # takes a sample of 119 cards
  cardsgroup2 = cards[-cardsgroup1]  # puts the remaining cards in group 2
  NEPats = (alldata[cardsgroup2])  #reads the values of the cards in group 2
  FumbleLate[i] = sum(NEPats)  # counts NEPats late fumbles (the only stat we need)
}

table(FumbleLate) #look at the results
hist(FumbleLate, breaks=seq(2.5,23.5)) #graph the results

sum(FumbleLate <= 7)/10000 # How rare is 7 (or fewer)? Answer: around 0.0086
Additional note: kudos to Steve Taylor for the following graphical depiction of the interaction.


An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

7 comments:

Jeff Witmer said...

Here is another way (and perhaps more natural way) to conduct a simulation, but it is one that cannot be demonstrated with a physical deck of cards.

#Second simulation of deflategate data null hypothesis
Late = rep(1,26) #creates 26 Pats fumbles
Early = rep(0,1906) #creates 1906 Pats non-fumbles
alldata = append(Late,Early) #puts the two groups together
table(alldata) #check to see that we have what we want

cards =1:length(alldata) #creates 1932 cards, one "ID number" per fumble

FumbleLate = NULL #initializes a vector to hold the results
for (i in 1:10000){#starts a loop that will be executed 10,000 times
cardsgroup1 =sample(cards,1015,replace=FALSE) #takes a sample of 1015 cards
cardsgroup2 = cards[-cardsgroup1] #puts the remaining cards in group 2
NEPats = (alldata[cardsgroup2]) #reads the values of the cards in group 2
FumbleLate[i] = sum(NEPats) #counts NEPats late fumbles (the only stat we need)
}
table(FumbleLate)
hist(FumbleLate, breaks=seq(2.5,23.5))
h=hist(FumbleLate, breaks=seq(2.5,23.5),plot=FALSE)
clr=ifelse(h$breaks<7,"red","white")
plot(h,col=clr)
sum(FumbleLate <= 7)/10000

Unknown said...

Great analysis. Could you please also release the code for the plot? Adding error bars seems to be a toughy in R.

Cheers, Walter.

Steve said...
This comment has been removed by the author.
Steve said...

[Edit] The essential parts, after converting the binary variables to factors with levels c('No','Yes') are these:

library(effects)
glm1 = glm(Fumble ~ BadWeather + Temp + Late * IsAway * Pats, binomial, AllBig)
plot(effect("Late*IsAway*Pats", glm1), ylim=c(0,0.06), rescale=FALSE, cex=0.9)

Nick Horton said...

Steve, thanks for the figure and the code. I really like your personal functions (and thought that others might as well).

library(effects)

# Two of my personal functions included here as they're used below. - Steve Taylor

NoYes = function(bool) {
if(!is.logical(bool)) bool = as.logical(bool) # No warning
factor(ifelse(bool,"Yes","No"),levels=c('No','Yes'))
}

word.png = function(filename="Word_Figure_%03d.png", zoom=4, width=17, height=10, pointsize=10, ...) {
if (!grepl("[.]png$", filename, ignore.case=TRUE))
filename = paste0(filename,".png")
png(filename=filename, res=96*zoom,
width=width, height=height, units='cm', pointsize=pointsize, ...)
}

# downloaded from: http://www.amherst.edu/~nhorton/AllBig.csv
AllBig = read.csv('AllBig.csv')

AllBig = transform(AllBig,
IsAway = NoYes(IsAway==1),
BadWeather = NoYes(BadWeather==1),
Pats = NoYes(Pats==1),
Late = NoYes(Late==1),
Fumble = NoYes(Fumble==1)
)

glm1 = glm(Fumble ~ BadWeather + Temp + Late * IsAway * Pats, binomial, AllBig)
summary(glm1)


word.png('AllBig logistic regression', zoom=2)
{
palette(c('black','grey70'))
pars = trellis.par.get()
pars$fontsize$text = 9
trellis.par.set(pars)
print(plot(effect("Late*IsAway*Pats", glm1), ylim=c(0,0.06), rug=FALSE, rescale=FALSE, cex=0.9))
}
dev.off()

scholarship essay writing service said...

i like it

Unknown said...
This comment has been removed by the author.