Athlete Rankings: What’s In Your Bucket?
You might have noticed the Athlete Rankings tab at the top of the page. If you click on that tab, rather than navigating through the mouse over menus you’ll see a brief description of what they are, but I thought I’d put a quick note on them here.
First I need to explain why the heck I thought the world needed my own special athlete ranking. The answer is actually not in the ranking itself, it’s in the error estimate. If you go and look at the graphs, you’ll see that I’ve included an error bar for each athlete.
If you check out parts of the FIS website you’ll find that they compile athlete ranking lists several times per year. The basic methodology is to average a skier’s best (lower FIS points are better) five races over the previous year.1 I’ve always been bothered by the fact that they are measured to two decimal places but they provide no error estimate.
It leads a person like me to wonder whether a skier with an average of 4.58 FIS points is really “faster” than a skier with 5.28 FIS points. So this is why I was interested in creating my own FIS point-like rankings. I simplified things somewhat, looking at races within a season, rather than over the previous year, and I simply omit skiers with fewer than 5 races in a season.
The point of these rankings is not that they do a better job of ordering skiers, but that they give you a sense of what magnitudes of differences in FIS points are meaningful. My method for calculating the error bars is somewhat involved, but I had a rationale for making it that way.
The notion of variability in a skiers races over a season is a little subtle. Typically, statistics focuses on what’s called sampling variability2, which arises when we collect a sample from a population. If we could record data on every individual of a population, that would be a census, and there would be no variability to measure and hence no need for statistics. But if we only have access to a sample (hopefully random), then we need to measure the variation that arises from the sampling process. Namely, we might have gotten slightly different data.
When it comes to the races a skier does in a season, one could argue that we’re actually doing a census. A skier did 10 races and we have access to them all. This doesn’t seem like collecting a sample at all. So where does the variability coming from?
I prefer to take the following view. It requires a rather silly extended metaphor to explain, so bear with me. Imagine each athlete has with them at all times an enormous bucket of chips (eg poker chips, or something) that represent “race efforts”. These race efforts vary quite a bit in quality. Skiers can influence the general content of their bucket via training and other forms of preparation, but there will always be some level of variability in the quality of chips available in the bucket.
When they actually do a race, they reach in and grab a chip at random. If they’ve done their preparation well, their bucket will be filled with tons and tons of superb race efforts, and they are likely (but not guaranteed!) to do well. If they haven’t prepared well, their bucket is likely to contain far too many poor race effort chips, and things might go badly.
I consider the collection of races a skier actually does to be a sample (again, hopefully random) from their bucket of race effort chips. So that’s where the variability comes from, in my view, and we need to measure it somehow.
There’s a slick technique in statistics for handling “non-standard” situations like these, called bootstrapping. The goal is to somehow estimate how variable the contents of some skier’s race effort bucket was, using only the races efforts we saw during the season. Bootstrapping tells us that we can get a rough sense of this by drawing many, many samples, with replacement, from our actual data.
That probably got too technical for some of you, so here’s what’s going on. Let’s say we have someone who did 5 races, scoring 1, 2, 3, 4 and 5 FIS points. When I say “draw a sample with replacement from this collection of races”, what I mean is that we generate another slightly different collection of five scores, using only these particular 5 scores. Some examples might be (2,2,4,5,5) or (1,1,1,1,1) or (5,3,2,2,1). Some of our original races might not appear at all, and some might appear multiple times. These are called “bootstrap samples”, in contrast with our original sample (1,2,3,4,5).
For each of these new sets of scores, we calculate whatever measure we’re using (average, average of the best five, etc.), and then measure how much these values vary.
So that’s what those blue lines are in my ranking graphs. I generate a bunch of bootstrap samples and calculate the skier’s average of their best five races within each bootstrap sample. The blue lines represent how much variability we saw. This gives a sense of the contents of each skier’s race effort bucket, at least during that season.
What’s in your bucket?