I have been making something of a ruckus recently about where I feel the state of current defensive analysis is. I have been long on listing problems, and short on proposing solutions.
Well, allow me to make amends there. I don’t pretend to have the problem solved. I’m not sure any of us will ever see it truly solved. But I think—or at least, hope—this can point us in the right direction.
The Two Problems
We can really subdivide our problems neatly in two. One is the issue of bias, the other of uncertainty.
Let us start with the latter. What we are trying to do here is measure, and then compare, two things:

How many plays a player has made, and
 How many plays we think an average player at that position would have made, given the same chances.
The first, we all think we can measure directly—given the record, we can readily come up with a total. We may have some disagreement over what to count, but if we agree on what we should be counting, we can come to an agreement. The second is an estimate and, as such, is subject to error. Over time, the error in our estimate should come down (as a proportion of our estimate, that is).
Now, what modern defensive metrics (one based on observational data, like battedball types, hit locations, etc.) are trying to do is to cut down on the effects of measurement error on our estimate of plays made by an average player.
By attempting to reduce measurement error, those metrics have introduced the potential for bias into their estimates, however. The two key ones are:

Parkscorer biases. To the extent that a park influences the scoring of batted balls, that has an impact on our estimates. It could have to do with the identity of the scorer in different parks. It could relate to the vantage point of the scorers in each park. Regardless of the source, it distorts the estimates of a fielder’s chances.
 Range biases. To the extent that a fielder’s range (or the range of his teammates) influences the scoring of batted balls (either by type or location), that also distorts the picture of a fielder’s abilities. The most obvious possible effect is that a good fielder will raise the number of estimated chances he gets by getting to more balls (or at least getting closer to them)—and vice versa for a poor fielder. This would both artificially compress the observed spread of fielding performance, and systemically underestimate fielders with good range (and overestimate fielders with poor range).
So what we have is some presumption of increased accuracy, in exchange for additional bias. What we do not know, as of yet, is how much accuracy we are gaining, at the expense of how much bias. And I think that’s an important thing to know—if your gain in accuracy is less than the amount of bias you’re introducing, you haven’t actually gotten better, you’ve gotten worse.
And we know how to solve the accuracy problem—get more data! Over a long enough timeline, the estimates will improve on their own. Adding more data doesn’t make bias any better, though—in fact, over time, the effect of bias becomes more powerful.
Just the Facts, Ma’am
So let’s take a different approach. Let’s try to design a fielding metric with no bias—or, at least, attempt to minimize the effect of bias. What we can do is:

Restrict ourselves to looking at only factual data—data we can validate objectively. That means no battedball data, no hit location data, etc.

For estimating the amount of plays an average player at that position would have made, ignore data about the outcomes of batted balls whenever possible.
 Err on the side of caution when deciding whether or not to adjust—in other words, make as few adjustments as possible. We can allow the data to be expressive by getting the metric out of its way whenever we can.
Over time, the potential inaccuracies of our data should wash out, and because we think we are minimizing our potential for bias, over a long period of time we should be able to be confident of our measure of a fielder’s ability.
Figuring Plays Made
Looking at playbyplay data available from Retrosheet, we can start off with counting the plays a player actually made on the field.
Ideally, what we would do is separate the fielding of balls hit on the ground (OK, OK—ground balls) from balls hit in the air (popups, liners, and fly balls). But we’ve already committed to not using that sort of data. Is there anything we can do, simply looking at facts, to determine what sort of plays a player made?
For outfielders, it’s a simple matter. We just count an outfielder’s unassisted putouts as his plays made. (His assists we can examine separately at a later date.)
For an infielder, how are we to determine whether he caught the ball on the fly or fielded it on the ground, without resorting to battedball categorizations? It’s simple (if a bit messy for first basemen and pitchers):

An assist by the infielder who first fielded the ball counts as a play made on a “ground ball.” (This is not always the case—a fielder who deflects a ball that is then fielded by another player for an out is credited with an assist. But this is rare enough that over time we can ignore it, and in the short run we can do little about it.

An unassisted putout of a baserunner, other than the batter, by an infielder is a play made on a “ground ball.” For catchers, second basemen, third basemen, and shortstops, an unassisted putout of the batter is a play made on an “air ball.” There are rare occasions, mostly for second basemen, where this isn’t the case, but again over time we shouldn’t have to worry about this.
 For first basemen, an unassisted putout of the batter is a “groundball” out when it was either on a bunt attempt or hit by a lefthanded batter. For pitchers, an unassisted putout of the batter is a “groundball” out on a bunt attempt only. All others are classified as “airball” outs. This is probably the leastconfident part of the system, but for now we’ll leave it as it is.
So this gives us, at the team level, outs on the ground versus outs in the air. And what we see is a strong negative relationship between ground plays and air plays, with a correlation of 0.77. So when a team makes a lot of groundball plays, the most likely explanation is that they saw a lot of ground balls.
So, let’s adjust for that. What we can do is look at how many plays a team made in total, compared to the average team, and then look at how many groundball plays a team made compared to how many airball plays they made. A team with superior groundball fielders will not only have more groundball plays but likely more plays made overall.
So for a team that’s aboveaverage on making groundball plays but belowaverage in making total plays, we “shift” the responsibility toward the groundball plays (in other words, inflate the amount of groundball plays we think the team should have made, but deflate the amount of airball plays we think the team should have made), while keeping the total number of plays we think the team should have made constant.
This is, for lack of a better term, our “groundball rate” adjustment. It’s a bit of a misnomer, because we ignore any scorer data on the number of ground balls a defense saw. And it is possible that including that scorer data could improve the process here as well. But for now, let’s err on the side of excluding that data.
Breaking Down the Fielders
What we do now is apply the process from above to individual fielders. As we did for teams, we break down outfielder plays, infielder plays on the ground, and infielder plays in the air. That tells us how many plays each fielder made.
Then we look at each batted ball and estimate the likelihood that each fielder makes a play on it. The only data we are considering right now is the handedness of the batter who hit the ball. (For first basemen, we’re also considering whether or not they had to hold a runner at first.) We aren’t considering who eventually fielded the ball, whether or not the ball was an out or a hit, etc. Why? Because the outcome of the batted ball is a potential source of bias. By giving up some accuracy in the short run, we allow truly great fielders to look truly great—otherwise, we artificially compact the spread of the impact of top fielders over time.
So we have our measure of plays made, and our estimate of chances. We can leave off there, at least for infielders (Outfielders will require a bit more work, I’m afraid—and that will have to wait for another day). But we discussed uncertainty—can we at least try and measure it?
Ignore, at least for now, uncertainty about actual plays made—for first basemen and pitchers especially we do have some, but enough that we can afford to at least set it aside for a while. But for our estimate of how many plays a fielder should have made, we know there is a margin of error. What we can do is calculate the uncertainty of our estimate per ball in play, and use that to figure our total uncertainty for any given player.
What I did is figure the root mean square error between the average number of plays made and the actual plays made, on an individual basis.
For example: In 2009, with a righthanded hitter batting, a shortstop will make a play on a ball in play roughly 12 percent of the time. (For a lefthanded batter, a shortstop will make a play on a ball in play roughly six percent of the time.) But the margin of error around our estimate of how often a shortstop will make any single play is about 30 percent. (Notably the error is asymmetrical—obviously there is no chance of a shortstop making a negative play, even if in exasperation I may have accused Alex Gonzalez of it during the ’03 playoffs.)
To attribute that margin of error over a number of chances, we take:
What’s interesting about this is that the margin of error per BIP drops, the more BIP we observe. So, after 100 BIP, the margin of error for any one play drops all the way to three percent.
(That’s why, to me, uncertainty is preferable to bias—with enough statistical power, we can plow through uncertainty readily. Without an accounting of what the bias is, we’re essentially powerless against it.)
Some Examples
After taking you all this way, surely I wouldn’t leave you without something to look at, would I? Here are the top 10 seasons by a shortstop since 1950, according to our new fielding metric:
Name 
Year 
Chances 
Plays 
AvgPlays 
+/ 
MOE 
+/R 
MOE_R 
Guillen, Ozzie 
1988 
4480 
515 
442.3 
72.7 
19.5 
55.9 
15.0 
Ryan, Brendan 
2009 
2507 
325 
259.0 
66.0 
14.5 
53.7 
11.8 
Fermin, Felix 
1989 
4217 
480 
411.0 
69.0 
19.0 
53.3 
14.7 
Belanger, Mark 
1975 
3996 
467 
403.9 
63.1 
18.9 
49.2 
14.7 
Tulowitzki, Troy 
2007 
4294 
490 
432.0 
58.0 
19.1 
48.4 
15.9 
Sanchez, Rey 
1999 
3666 
391 
336.9 
54.1 
17.5 
46.4 
15.1 
Thon, Dickie 
1983 
4271 
481 
423.0 
58.0 
19.3 
45.5 
15.1 
Smith, Ozzie 
1980 
4618 
570 
512.0 
58.0 
20.2 
45.2 
15.7 
Martinez, Felix 
2000 
2818 
318 
265.8 
52.2 
15.4 
45.2 
13.4 
Sanchez, Rey 
2000 
3785 
394 
342.8 
51.2 
17.9 
44.3 
15.5 
I’ve provided a tentative conversion of plays to runs, although it still needs a little work. Note, for instance—Ozzie Guillen is being credited with about 73 plays above the average shortstop for 1988. That’s pretty impressive. It’s also pretty imprecise, with a margin of error around 20 plays.
What’s important to note is that the error is not symmetrical—we think there’s practically no chance that Guillen really made over 90 plays above average, for instance.
So, on a singleseason level, we see some quizzical results. (Brendan Ryan? Really?) The important thing to remember is—we aren’t very confident in those results! Our confidence increases as we move to the career level, though:
Name 
+/R 
MOE_R 
Smith, Ozzie 
322.1 
61.4 
Belanger, Mark 
237.2 
50.1 
Sanchez, Rey 
217.7 
37.6 
Russell, Bill 
190.6 
49.7 
Valentin, Jose 
177.4 
43.3 
Guillen, Ozzie 
168.3 
52.3 
Templeton, Garry 
150.3 
53.2 
Groat, Dick 
144.0 
49.0 
Maxvill, Dal 
139.3 
37.5 
Gagne, Greg 
130.4 
49.8 
It isn’t to say there’s no uncertainty. We can say, given the statistical evidence we have at hand, there’s a small (but not impossible) chance that Mark Belanger saved more runs compared to the average shortstop than Ozzie Smith did. And after that, well, nobody else is in the running.
And nobody has really disputed how good Ozzie Smith was—but other metrics haven’t fully captured the magnitude of it. Our own FRAA, for instance, gives Smith 266 runs above average. Sean Smith’s TotalZone says 239 runs above average. In reality, Ozzie was better than that—a lot better.
What’s Next
Well, obviously I have to produce outfielder measurements as well. And there are probably still some tweaks to be made to this system that could improve it.
But past that—these values cannot simply be used in place of FRAA to calculate WARP the way we’re doing it now. We have this measure of uncertainty. We can similarly compute uncertainty for our offensive metrics (and it’s quite a bit smaller on a perplay basis). We cannot, in coming up with a single value to express a player’s season, add defense to offense as though we are equally certain of both.
So we’re going to be revising WARP to account for this uncertainty. Along the way, we’ll be adding some other enhancements to WARP as well. And we’ll be looking at pitching—after all, a lot of what we’ve always thought was pitching is fielding, isn’t it? And so any uncertainty we’ve had in measuring fielding spills over into pitching as well.
So, consider this a beginning, not an end.
Notes and Asides
I should give a nod to Bill James’ Fielding Win Shares, which served as an inspiration for some of my efforts here. I should also give a nod to the work Smith has done on TotalZone, which was also something I spent a lot of time thinking about.
For some discussion on the spread of defense, I cannot recommend these enough:

How lucky has Scott Rolen been with his opportunities to field?

Best, worst WOWY since 1993, through age 34
 How many runs is a good fielding SS worth?
Thank you for reading
This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.
Subscribe now
A commonlyreferenced rule of thumb for the other defensive metrics out there is that three seasons worth of data are required before a person can start to draw some meaningful conclusions. As you stated in the article, estimates should improve over time using this approach. Would you consider three years to be a reasonable timeframe to start drawing meaningful conclusions?
So if you multiply your balls in play by three (i.e., three seasons instead of one), your percentage error will drop by 1.7x. (The square root of 3 is about 1.7.) If you multiply balls in play by 10, your percentage error will drop by about 3x.
I'm trying to digest this and see if there is any way to squeeze any more adjustments into it without adding too much bias.
Particularly, I'm wondering if there's a way to adjust for pitcher tendencies without looking to the results of the plays made. The only ones I can think of where we have historical data are pitcher handedness and the groundballairball split (as you defined it) by pitcher.
If the ~70 run difference for Ozzie Smith is due to range bias, and 1 play = 0.8 runs, and Ozzie played about the equivalent of 17 seasons, then 70 / 0.8 / 17 = about 5 runs per season due to range bias.
If we apply the same method to a large group of players, we ought to be able to estimate the range bias.
Then, I wonderthinking out loud hereif the range bias for large samples of players is known, could you then turn around and estimate the park bias?
Normally, when sabermetrics parameterizes a model or tests a hypothesis, these questions are mostly irrelevant. But when the argument is based on the idea that the estimator is consistent or unbiased, they become central. Perhaps your methodology allows the bias to disappear asymptotically, but "asymptotically" can mean "after 10,000 seasons of a player with unchanging defensive ability" not "fifteen during which the player ages and/or learns". Intuitively, I would prefer to introduce some measure of bias if in the preasymptotics real world it on average gives us more information than it takes away.
Do you have evidence that the estimate converges fast enough for individual players to outweigh the additional information current methodologies provide?
They would have to be external factors, of course  we don't suspect that a fielder's very presence on the field changes the distribution of batted balls (and if we do, we probably want to measure that as part of a fielder's skill). So what persistent effects are there that could prevent the estimate from converging over time?
I would be more comfortable if you could say what the preasymptotic distribution of the errors might look like, because that would give a much more genuine basis for evaluating its accuracy against what we have now. The error bar that you suggested sounds like a good rule of thumb after having established the errors are (for instance) normally distributed, but by itself I don't think it provides a basis for evaluating it for individual players.
I think the reply is that the system is worse without that adjustment, and I would intuitively agree. But it seems that it ought to be noted that this is a possible problem with the system. It would also seem to set up the next question: can we make the system better by using simple batted ball types, or do you feel there is too much systematic bias in even that data? What's the tradeoff?
And yes, there's potential for batted ball data, used in a coarse sense, to improve upon what I have right now. What I want to do first is finish the system the way it is for outfielders, and then examine that issue more closely.
Dave, you might enjoy the thread that Tango started in response to Colin's article at the Book blog where similar questions are being discussed:
http://www.insidethebook.com/ee/index.php/site/comments/reducing_bias_in_fielding_metrics/#comments
FWIW, I would advocate using two standard deviations. 68% is okay, but not compelling, and using two standard deviations, or 95%, is more understandable and intuitive to the average reader.
Obviously, down the line, you can make statements of how likely it is that Ozzie *isn't* the best fielding shortstop in your dataset.