Classifying Olympic Athletes by Sport and Event (Part 2)

This is the second post in a three-part series. The first post, giving some background and describing the data, is here. In that post I pointed out David Epstein’s claim that he could identify an Olympian’s event knowing only her height and weight. The sheer number of Olympians–about 10,000–makes me skeptical, but I decided to see whether machine learning could accurately produce the predictions Mr. Epstein claims he could.

To do this, I tried four different machine learning methods. These are all well-documented methods implemented in existing R packages. Code and data for is here (for sports) and here (for events).

The first two methods, conditional inference trees (using the party package) and evolutionary trees (using evtree), are both decision tree-based approaches. That means that they sequentially split the data based on binary decisions. If the data falls on one side of the split (say, height above 1.8 meters) you continue down one fork of the tree, and if not you go down the other fork. The difference between these two methods is how the tree is formed: the first recursively partitions the data based on conditional probability, while the second method (as the name suggests) uses an evolutionary algorithm. To get a feel for how this actually divides the data, see the figure below and this post.


If a single tree is good, a whole forest must be better–or at least that’s the thinking behind random forests, the third method I used. This method generates a large number of trees (500 in this case), each of which has access to only some of the features in the data. Once we have a whole forest of trees, we combine their predictions (usually through a voting process). The combination looks a little bit like the figure below, and a good explanation is here.


The fourth and final method used–artificial neural networks–is a bit harder to visualize. Neural networks are sort of a black box, making them difficult to interpret and explain. At a coarse level they are intended to work like neurons in the brain: take some input, and produce output based on whether the input crosses a certain threshold. The neural networks I used have a single hidden layer with 30 (for sports classification) or 50 hidden nodes (for event classification). To get a better feel for how neural networks work, see this three part series.

That’s a very quick overview of the four machine learning methods that I applied to classifying Olympians by sport and event. The data and R code are available at the link above. In the next post, scheduled for Friday, I’ll share the results.

Classifying Olympic Athletes by Sport and Event (Part 1)

Note: This post is the first in a three-part series. It describes the motivation for this project and the data used. When parts two and three are posted I will link to them here.

Can you predict which sport or event an Olympian competes in based solely on her height, weight, age and sex? If so, that would suggest that physical features strongly drive athletes’ relative abilities across sports, and that they pick sports that best leverage their physical predisposition. If not, we might infer that athleticism is a latent trait (like “grit“) that can be applied to the sport of one’s choice.

SportsGeneDavid Epstein argues that sporting success is largely based on heredity in his book, The Sports Gene. To support his argument, he describes how elite athletes’ physical features have become more specialized to their sport over time (think Michael Phelps). At a basic level Epstein is correct: males and females differ at both a genetic level and in their physical features, generally speaking.

However, Epstein advanced a stronger claim in an interview (at 29:46) with Russ Roberts:

Roberts: [You argue that] if you simply had the height and weight of an Olympic roster, you could do a pretty good job of guessing what their events are. Is that correct?

Epstein: That’s definitely correct. I don’t think you would get every person accurately, but… I think you would get the vast majority of them correctly. And frankly, you could definitely do it easily if you had them charted on a height-and-weight graph, and I think you could do it for most positions in something like football as well.

I chose to assess Epstein’s claim in a project for a machine learning course at Duke this semester. The data was collected by The Guardian, and includes all participants for the 2012 London Summer Olympics. There was complete data on age, sex, height, and weight for 8,856 participants, excluding dressage (an oddity of the data is that every horse-rider pair was treated as the sole participant in a unique event described by the horse’s name). Olympians participate in one or more events (fairly specific competitions, like a 100m race), which are nested in sports (broader categories such as “Swimming” or “Athletics”).

Athletics is by far the largest sport category (around 20 percent of athletes), so when it was included it dominated the predictions. To get more accurate classifications, I excluded Athletics participants from the sport classification task. This left 6,956 participants in 27 sports, split into a training set of size 3,520 and a test set of size 3,436. The 1,900 Athletics participants were classified into 48 different events, and also split into training (907 observations) and test sets (993 observations). For athletes participating in more than one event, only their first event was used.

What does an initial look at the data tell us? The features of athletes in some sports (Basketball, Rowing, Weightlifting, and Wrestling) and events (100m hurdles, Hammer throw, High jump, and Javelin) exhibit strong clustering patters. This makes it relatively easy to guess a participant’s sport or event based on her features. In other sports (Archery, Swimming, Handball, Triathlon) and events (100m race, 400m hurdles, 400m race, and Marathon) there are many overlapping clusters making classification more difficult.


Well-defined (left) and poorly-defined clusters of height and weight by sport.

Well-defined (left) and poorly-defined clusters of height and weight by event.

Well-defined (left) and poorly-defined clusters of height and weight by event.

The next post, scheduled for Wednesday, will describe the machine learning methods I applied to this problem. The results will be presented on Friday.

Two Unusual Papers on Monte Carlo Simulation

For Bayesian inference, Markov Chain Monte Carlo (MCMC) methods were a huge breakthrough. These methods provide a principled way for simulating from a posterior probability distribution, and are useful for integrating distributions that are computationally intractable. Usually MCMC methods are performed with computers, but I recently read two papers that apply Monte Carlo simulation in interesting ways.

The first is Markov Chain Monte Carlo with People. MCMC with people is somewhat similar to playing the game of telephone–there is input “data” (think of the starting word in the telephone game) that is transmitted across stages where it can be modified and then output at the end. In the paper the authors construct a task so that human learners approximately follow an MCMC acceptance rule. I have summarized the paper in slightly more detail here.

The second paper is even less conventional: the authors approximate the value of π using a “Mossberg 500 pump-action shotgun as the proposal distribution.” Their simulated value is 3.131, within 0.33% of the true value. As the authors state, “this represents the first attempt at estimating π using such method, thus opening up new perspectives towards computing mathematical constants using everyday tools.” Who said statistics has to be boring?


What Really Happened to Nigeria’s Economy?

You may have heard the news that the size Nigeria’s economy now stands at nearly $500 billion. Taken at face value (as many commenters have seemed all to happy to do) this means that the West African state “overtook” South Africa’s economy, which was roughly $384 billion in 2012. Nigeria’s reported GDP for that year was $262 billion, meaning it roughly doubled in a year.

How did this “growth” happen? As Bloomberg reported:

On paper, the size of the economy expanded by more than three-quarters to an estimated 80 trillion naira ($488 billion) for 2013, Yemi Kale, head of the National Bureau of Statistics, said at a news conference yesterday to release the data in the capital, Abuja….

The NBS recalculated the value of GDP based on production patterns in 2010, increasing the number of industries it measures to 46 from 33 and giving greater weighting to sectors such as telecommunications and financial services.

The actual change appears to be due almost entirely to Nigeria including figures in GDP calculation that had been excluded previously. There is nothing wrong with this, per se, but it makes comparisons completely unrealistic. This would be like measuring your height in bare feet for years, then doing it while wearing platform shoes. Your reported height would look quite different, without any real growth taking place. Similar complications arise when comparing Nigeria’s new figures to other countries’, when the others have not changed their methodology.

Nigeria’s recalculation adds another layer of complexity to the problems plaguing African development statistics. Lack of transparency (not to mention accuracy) in reporting economic activity makes decisions about foreign aid and favorable loans more difficult. For more information on these problems, see this post discussing Morten Jerven’s book Poor NumbersIf you would like to know more about GDP and other economic summaries, and how they shape our world, I would recommend Macroeconomic Patterns and Stories (somewhat technical), The Leading Indicators, and GDP: A Brief but Affectionate History.