How to measure a pitchers overall performance?

The question of how good someone is at a task extends beyond baseball. In professional sports where there are now very large sums of money involved in player development and contracts, it has become a question people want to be able to answer more accurately than ever before. Over the decades of baseball different stats have been used to determine the who is best. When it comes to pitching we have had win-loss records, saves, ERA (earned runs allowed) and WHIP (walks and hits per innings pitched) all be used to describe how good a player is.

But In 1999, Voros McCracken became the first person to detail and publicize the effects of defense on pitchers to the baseball research community and take the evaluating of a pitcher to a new level. Until the publication of a more widely read article in 2001, on Baseball Prospectus, most of the baseball research community believed that individual pitchers had an inherent ability to prevent hits on balls in play. McCracken reasoned that if this ability existed, it would be noticeable in a pitcher’s BABIP (Batting Average on Balls In Play). His research found the opposite to be true: that while a pitcher’s ability to cause strikeouts or prevent home runs remained somewhat constant from season to season, his ability to prevent hits on balls in play did not.

After this breakthrough there have been various pitcher statistics. These statistics focus on attempting to remove the influence of the defenders around a pitcher, from the metric describing the pitchers performance to give a true reflection of their ability (FIP, xFIP, cFIP, SIERA, DRA). These stats all focus on slightly different things, with some trying to describe the now and some trying to help describe the future.

How well do these perform as descriptive and predictive stats

I will be using RE24 to be the measure of a pitchers performance. Unlike ERA instead of hitting one pitcher with the entire consequence of a handed-off runner, RE24 debits a departing pitcher solely for the run expectancy of the situation left behind. It also similarly debits a reliever only for the runs scored in light of that pre-existing expectancy. The reliever who gets out of an inherited jam will be credited accordingly. In RE24, runs are runs, regardless of whether they are “earned” or not.

RE24 is not perfect, either. It does not consider defense and holds a pitcher fully responsible for everything that happens on the field on a play. But these shortcomings are equally true of ERA. In this article, I will use RE24 per plate appearance (RE24/PA) to compare the abilities of these metrics to one another. RE24 is published at FanGraphs.

The pitching metrics all have an inverse correlation with Fangraphs’ RE24, so remember -1.0 is the highest possible score, showing a perfect negative correlation, with 0.0 the worst (meaning no correlation). I am using a Pearson correlation split into 3 scenarios (starters, relievers and both combined). The date range for all this work is 2002 – 2018, this is largely due to 2002 being the earliest Fangraphs has data on fly ball rates which are used for the expected stats.

For the predictive table pitchers had to fit in the same category for two consecutive years i.e. if a pitcher threw 200 IP as a starter two seasons in a row they would part of the data set for “>100 IP, >80% GS” and “>30 IP”. In 2002 to 2018 there were 1,369 back to back “>100 IP, >80% GS” seasons, 2,216 “>30 IP <50% GS” seasons and 4,859 “>30 IP” seasons.

Not surprisingly, ERA’s park and league adjusted variant (ERA-) and ERA provided the tightest correlation to RE24/PA. In the predictive table you can see all the more advanced analytics coming out on top with xFIP-, the park, league and home run adjusted version of FIP, is best. Also If you look at BABIP in both you can see that you can see that it has reasonable descriptive correlation to the current season but next to no predictive correlation. This backs up what McCracken said back in 1999 is still relevant to today.

Making my own stat

At the start of the off season I challenged myself to create a better statistic that these ones. One of the most difficult parts of any statistical investigation is defining what question it is you are trying to answer. This is particularly important when it comes to pitcher performance metrics. What is it, exactly, that I want to know?

  • Do I care mostly about a pitcher’s historic performance?
  • Am I more concerned with how many runs the pitcher will allow going forward?
  • Or do I want to know how truly talented the pitcher is, divorced from his results this year or next?

I care most about 2 and 3 as I want to create a predictive statistic on overall pitcher performance.

I am going to work from the base assumption, from McCracken, that a pitcher has no influence over the balls hit in play. I may investigate that further in the future, to see if it still holds true in the light of newer data publicly available via MLB Statcast. I will also use the FIP and xFIP model but question why for all its parts.

Here are the formula for FIP and xFIP.

Run Value Constants for HR, BB, HBP, SO

The base FIP formula contains run value constants for a home run (13), walk (3), hit by pitch (3) and strikeout (-2). These are to convert a pitchers performance into a standard run environment, the values are approximately derived from the run value for each event multiplied by 9 to bring the end result in line with ERA, which is measures over 9 innings. But these values don’t match up to the current run values for those events. Using the yearly data collated by MLB Statcast and Retrosheet I calculated the run expectancy matrix for each year and with that the run value of all the outcomes of an at bat, last 3 years exampled below.

I am building a stat that I want to useful for predicting performance and showing a pitchers true performance so I want to use the actual run values for these events and not estimations. These values change each year and this new stat will do so as well, that makes this more inline with wOBA (weighted on base percentage) were the weightings for the different types of hits changes each year.

Using the same time frame (2002-2018) and method (Pearson Correlation) as detailed in the section above I get the following correlations for my updated version of FIP, RFIP (Run Value FIP).

This change has had minimal impact on descriptive performance of the stat but has improved the predictive performance (compared to FIP and xFIP) by 2 base points.

Infield Fly Balls

Infield flies are, for all practical purposes, the same as a strikeout. They are basically an automatic out, runners do not advance. But perhaps most importantly we can state with a high level of confidence that the abilities of the defenders have nothing to do with the outcome of the play, MLB infield fly conversion rate was 99.4% in 2018. Based on those characteristics, an argument could actually be made that infield flies are essentially another fielding independent outcome.

There is no 100% fielding independent outcome as catchers do have some framing ability which can influence BB and K rates, and occasionally a HR is either robbed or knocked over the wall by an outfielder. BB/K/HRs are mostly independent of the pitcher’s teammates, which is why they are the three variables in FIP. So if an infield fly has the same logistical outcome as a strikeout, should we just give a pitcher credit for IFFBs in the same way we give them credit for Ks?

For predictive purposes, you definitely want to make a distinction between the two, as getting strikeouts are far more consistent from year to year than generating popups. A few years back Bill Petti ran the year-to-year correlations for basically any measure you can think of, and he got .82 for K% and .37 for IFFB%. There is no question that strikeout rate is more predictive than infield fly rate is. That being said, the year on year home run rate had a correlation of .42 which is much lower than either strikeout rate or walk rate, but it is included in FIP.

For all of the reasoning above I am going to include infield flies in my version of FIP. Infield flies are, in fact, taken into account by Fangraphs for their FIP when calculating its fWAR but isn’t using in their main FIP calculation. New formula below.

This change has had minimal impact on descriptive performance of the stat and has further improved the predictive performance of the expected version (xRFIP).

Innings Pitched

While the creators of DIPS, FIP and similar statistics all suggest they are “defense independent”, others have pointed out that their formulas involve innings pitched (IP). Innings pitched is a statistical measure of how many outs were made while a pitcher was pitching and this is dependent on the fielders and a truer independent stat would look at total batters faced instead of innings pitched. This change also takes us off the same scaling as ERA so I have removed the constant which is used to bring it back in line with ERA and removed the factor of 9 included in the weighting to align it with ERA.

We now have a metric which tells us how much run expectancy a pitcher adds/removes for an individual at bat based the outcomes which are in their control. For example in 2018 Jacob deGrom‘s RFIP calculated to -0.0585. That means that for each batter he faces he reduces the expected runs scored by that much, which equates to 1 run less for every 17 batters he faces. For context, in 2018, the average runs per batter faced was 0.1168. That means that in 2018, through the things he could control, deGrom reduced the run expectancy of an at bat by 50% compared to the average MLB pitcher.


In making these three changes in an attempt to create a better predictive statistic, which I have done so, I have also improved the descriptive element as well. RFIP has the second best, behind FIP-, descriptive correlation for any of the advanced statistics which attempt to account for the influence of defenders on a pitchers performance. It is best for all which haven’t accounted for park factors and all stats other improve when that is taken into account. xRFIP has the best predictive correlation for any of the stats, even when park factors have yet to be taken into account.

So that was the stats tests, does the output data match the eye test as well? Let’s look at the top and bottom pitchers by RFIP over the 2002-2018 period, for starters (>150 IP) and relievers (30-100 IP). Remembering that a negative score is good and a positive score is bad.

The best starter seasons since 2002 have belonged to Chris Sale and Jacob deGrom from 2018, with a couple of seasons from Clayton Kershaw and Pedro Martinez in the top 10. These are names and season I was expecting. Chris Sale‘s performance from last season did initially surprise me but a quick look showing his K rate was 38.4% the best season by any pitcher ever (minimal 150IP) meant it made much more sense. The same goes for the relievers with multiple seasons from Kenley Jansen and Craig Kimbrel, and the monster, Cy Young winning, season from Eric Gagne back in 2003 are what I expected.

Next Developments

Generate park factors to enable fairer comparison and RFIP- converting RFIP on to a 100 scale similar to ERA-, FIP- & xFIP-.

Investigating if a factor to multiply RFIP would make it easier to show the overall value to the average fan, e.g. 24 (4 batters for 6 innings) would show the number of runs different to the average pitcher a starter may get have over 6 innings.

A season long cumulative version stat useful when regarding a pitchers overall impact on a season. Build a WAR based on this stat.

Investigate expected Statcast expected wOBA to determine if there is a better method determining expected home runs than adjusting to league average.

Use Statcast data to take a look at the principle that pitchers have no control over BABIP and see if there are any other factors to be included.

Thanks to

MLB Statcast, Fangraphs, Baseball Prospectus & Retrosheet for having the MLB data publicly available.

Bill Petti, Max Marchi & Jim Albert for sharing code in R which I have used to extract and manipulate the MLB data.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.