In the first piece, here, I discussed checking the replacement level for WAR. This was re-blogged by Tom Tango and led to a discussion with him and Mitchel Lichtman, one of the other co-authors of The Book (a seminal sabermetric book), about how the previous method may have had some survivor bias. With their recommendation I altered the method used to define the value of the players.
Instead of summing up the WAR for all the players then dividing it by the total number of plate appearances, their suggested method was to calculate a per PA/IP WAR for each player and then average them together. This would remove the bias of giving higher weight to the players who played in more games.
Using this methodology the value of the hitters came out as -0.37 WAR/600PA for hitters and 0.13 WAR/180IP for pitchers. This brought the values down for both but still left them within reasonable ranges of 0. Their performance for 2019 so far has averaged 0.32 WAR/600PA for hitters and -0.03 WAR/180IP for pitchers (for players with more than 10 PA or 5 IP).
These findings still suggest that the 1,000 WAR per 2,430 Major League games, currently used by FanGraphs and Baseball-Reference, is roughly in-line with what is currently going on in MLB. Although it might be a chicken and egg issue, with that value coming out because teams are using it in their own formula to calculate replacement level.
The next step after determining the replacement level is to split these wins between hitters and pitchers. This is really a question about the value of defence. Right now, FanGraphs splits it so that 570 of those 1,000 WAR go to position players, with the remaining 430 credited to the pitchers.
The FanGraphs WAR split is actually 500/70/430 (Hitting/Defence/Pitching) with each year, meaning that approximately 70 wins above replacement is given to defence each season. Baseball-Reference has a 590/0/410 split. Their defensive WAR has, across all teams, a total of 0 wins each season, even though teams and players can have positive or negative defensive WAR.
Firstly let us look at what teams are actually doing. If these splits are a good approximation to the real world, we would expect them to spend the wage budget accordingly. Using the data available on Spotrac I got the payroll positional splits for all seasons since 2011. That gave me the following.

The payroll data suggests that MLB on average has not been too far away from this split but have generally erred on the side of paying more for pitchers than the splits might suggest.
Side Note: An interesting feature of this data is that the closer a team has been to these splits, the higher the average team winning percentage is for the season. In the table you can see that Win% goes up as spending split increases to 40%-45% for pitchers and then dips down once you go above that level.

In theory, a run saved is a run scored. But whereas the relationship between singles, doubles (etc.), and runs produced is easily parsed with linear weights, defence is more difficult to measure. The steps between the events on the field and the runs being saved require more estimation, and that potentially injects more error into the final result.
Because of this error we can look at a regression model to see if the current split under- or over-values the impact of defence on run suppression. To do this we need a model that incorporates the runs a team concedes based on the performance of both the pitching and the defence. I will look at FanGraphs and Baseball-Reference splits separately.
FanGraphs
Looking at FanGraphs data from 2002-2018, I built the following linear model.
(team runs per 9 in year i) ~ ((pitcher WAR in year i) + (defensive WAR in year i))
Runs per 9 and team pitching WAR were simply taken from Fangraphs data, but the defensive WAR had to be calculated. I created defensive WAR by taking the team’s Def, which is a team’s Fielding Runs Above Average + positional adjustment. Then I added replacement value, by taking 12% (7/57) of the hitter replacement level for each team and season, to give Def above replacement. Finally I divided by the runs per win for each season from the FanGraphs seasonal constants page. This gave me roughly 70 Def WAR for each season.
With these three values I built a model which predicted runs per 9 (R/9) based off the other two. I first looked at the fit of the model when we use the default, defensive WAR. Then, I varied the defensive WAR factor, both higher and lower than the default (r = 1). I scored the fit of the model to R/9 by looking at the root-mean-square error (RMSE); when this value is high, the model fit is poor, and when low, the model fits the data better.

This regression suggests that the defensive might be slightly over-accounted for in the current model with the lowest error coming at r = 0.82. The difference to default isn’t that significant.
I then re-ran the model for ERA instead of R/9 and the result was very similar. The best model had lower RSME compared to the R/9 model. This is probably due to FIP (Fielding Independent Pitching), a core part of FanGraphs pitching WAR, which is scaled to ERA and not R/9.
The time the lowest error is occurring around r = 0.73, which would make sense as some element of defence has already been accounted for in ERA (with runs scoring via errors being removed).

Baseball-Reference
I ran the same model as before but used the WAR values from Baseball-Reference. B-Ref lists dWAR values for each season so there was no need for me to calculate them.

For both these models (R/9 and ERA) the regression modelling suggests that defensive WAR is undervalued with the lowest error models being r =1.18 and r = 1.13 respectively. This is interesting as the Baseball-Reference method of calculating pitching WAR has a factor for team defence.
These results aren’t entirely surprising. Defensive WAR is not a truth revealed from on high; it was designed (by very capable sabermetricians) with full knowledge of the fact that it improved the overall understanding of runs allowed. See the difference between the no defence models (r = 0) and default, all the models are better with some accounting for defence. But none of them have claimed that it is perfect and that is why they are constantly evolving.
The coefficients which translate defensive play into runs weren’t chosen arbitrarily or from a random number generator, but rather calibrated with at least some attention given to their resulting models’ ability to fit things like ERA and R/9. For this reason, we shouldn’t be surprised to find that these defensive metrics are well-suited to predicting them.
This is by no means saying that FanGraphs’ or Baseball-Reference’s defensive metrics are inaccurate as removing the defensive element made the models much worse in all occasions. It is just a matter of scaling. This is a very simple piece of analysis to see how well these are calibrated within their own framework and if I were to come up with a more definite split suggestion, this would need further investigation of how the pitching and defensive WAR are calculated.
What it may show is that teams might be working off a slightly different distribution of wins than these models, unlike what we saw with the replacement level checks.
“In theory, a run saved is a run scored”
I’ve been thinking about this quite a bit recently, and the more I do, the more skeptical I become of the assumption. This philosophy, for me, competes with the idea that the run scoring environment must impact how valuable either of these actions are.
To research this, I got game results data for 2010-2018. For every score, I tried to determine what was most predictive of each team’s runs.
For example, a single row might look like this:
– Team A plays Team B in 2014.
– Team A averaged 4.5 runs scored per game in 2014
– Team B averaged 5.2 runs allowed per game in 2014.
– Team A scores 5 runs in this game.
– Conclusion (with a very slight degree of confidence): Team B’s pitching and defense are more predictive of Team A’s run scoring than Team A’s historic run scoring. In other words, runs are more dependent on and better predicted by “run prevention” capabilities than “run creation” capabilities.
Over the nine year sample, I found that 52.5% of all scores were better predicted by the “run preventors” than the “run scorers” (note: this analysis was done without controlling for home/road splits, lineups, starting pitchers or weather but I believer this noise should get drowned out across the sample). This difference held true year-to-year, with this value (in favor of preventors) fluctuating between 51.6% (2013) and 53.5% (2012)
If valid, do you think this research suggests that perhaps our 57:43 split undervalues either pitching or defense or both?