Athlete Similarity

This idea comes from a similar sort of game that baseball stat heads play.  The idea is to match up similar athletes based upon their performances.  In baseball, we’d look at a whole slew of statistics for some current player’s career and then scan through the historical records and find other players who had similar careers (at least, where they overlap in terms of age).

Obviously, there’s a whole heck of a lot more information floating around about baseball players than skiers, so we shouldn’t have very high expectations.  But as I’ll demonstrate, we can actually create something that works ok.

First, the obvious stuff.  I’m going to measure similarity in sprint and distance events separately.  Most of my time getting this to work sensibly was spent trying different ways of measuring how “similar” two athletes’ results are.  There are tons of ways to do this, but what I settled on works pretty well.

I’ll walk through the process using an example.  Let’s take the Japanese skier Sumiko Yokoyama.  She’s primarily done distance events, so we’ll only look for skiers who are similar in that respect.  Why her?  Well, no reason in particular, except that she’s had a long career with a bunch of races, so it’s a fairly easy case to start with.1

So here’s Yokoyama’s distance results from major international events (WC, OWG and WSC):

I wasted a ton of time brainstorming all sorts of features of a skier’s career to include in my similarity measure: averaging things by season, counting the number of results at various levels, blah blah blah.

Waste of time.

Instead, I decided to convert a skier’s result graph, as shown above, into a 2-d density estimate:

Now, hopefully I haven’t lost too many of you already.  Every single one of you has seen a graph like this.  It’s called a topo map.  So the colors are darker where the points in the first graph are denser.

Why do this?  Well, this second plot is really just a matrix of values (like a topo map!).  Each pixel has a number associated with it that tells us how dark to color that pixel.  The topo map analogy would just be elevation.  To compare two athletes, all I need to do is subtract their images from each other, pixel by pixel.

The beauty of this is that it automatically incorporates every single data point.  Collapsing someone’s career down to arbitrarily defined variables will always miss stuff, and it’s just a headache.

The basic idea is to scan through athletes looking for skiers who’s 2-d density looks like the one above.  Now, there are a lot of skiers around, and at the moment I haven’t put too much energy into coding this in a speedy fashion2.  So I actually do a tiny bit of pruning of the potential candidates using some crude measures.  Mainly I’m just tossing people who are really really different using a simpler, faster measure.  This means I generally only need to scan through ~100-200 athletes instead of 1000.

So let’s see how well this works.  The graph below is the FIS points vs. age plots for Yokoyama and her eight “most similar” athletes, in no particular order (click through for larger version):

Not bad, if you ask me.  For a somewhat different look, here’s a single plot with a trend line for each athlete:

All roughly similar, I’d say.

Neat!

Obviously, I’ve chosen an example skier with a long career and a ton of races.  Finding athletes similar to, say, Liz Stephen, isn’t going to work nearly so well.  Garbage in, garbage out, as they say3.  On the other hand, on a technical level, this method does work with small amounts of data.  It will simply find athletes who are similar primarily over the portions of their careers where they overlap.  So if you take a skier who’s got 3 years of results, my method will find skiers who had similar results during that age range.

Baseball geeks use this kind of stuff to make projections about player’s career: i.e. Joe Mauer is really similar to some collection of guys from the 1970’s or something, so he’s likely to have a similar career.

I don’t particularly have that level of confidence, to make those kinds of projections.  But I’ll show you the results for various skiers in future posts, just cause I still think this is super cool.

  1. Yeah, I’m making my method look good here.  I’ll talk about when it doesn’t work in a bit.
  2. Actually, calculating 2-d density estimates is just notoriously slow, no matter what you do.  So it’s not entirely my fault.
  3. Liz Stephen isn’t garbage.  Small data sets are garbage.  You know what I mean.

Related posts:

  1. Athlete Profile: Tim Burke
  2. Athlete Rankings: What’s In Your Bucket?
  3. Head-to-Head
  4. How I Learned To Start Worrying and Hate the F-Factor (Part 2)
  5. How Well Prepared Are World Cup Rookies? (Part 1a: Distance)

About Joran

Comments

5 Responses to “Athlete Similarity”
  1. Mountainmums says:

    Great post.
    Making predictions career predictions is tricky thoug.
    What could be fun is to select skiers that do have a long long career, and just use the data for the first few years, and match them to other skiers using the partial data set only. Then once you’ve got your “matches” you could check with the rest of the data set if they did have similar careers afterwards. It could give you some idea if careers are at all predictable from early results.
    You’d probably have to stick to racers having raced at approximately the same time though. Comparing FIS points from the interval start era and the mass start era might put a lot of noise in the data…

    • Joran says:

      I was being intentionally circumspect about about making prospective predictions, instead focusing on the idea of which athletes have had (retrospective) careers similar to this one. I didn’t pursue that kind of career projection mostly because I assumed the data would be far too noisy for it to be of much use (as you point out). Still, it might be a fun exercise down the road…

Trackbacks

Check out what others are saying about this post...
  1. […] that I have some athlete similarity code up and running, let’s take it for a spin, shall […]

  2. […] debuted an interesting way to mine skiing results data to search for skiers with similar careers and used it to look at “similar” skiers […]



Speak Your Mind

Tell us what you're thinking...
and oh, if you want a pic to show with your comment, go get a gravatar!