Merging Arthur Banks’ Time Series with COW

Recently I needed to combine data from two of the most widely used (at least in my subfield) cross-national time-series data sets: Arthur Banks’ time series and the Correlates of War Project (COW). Given how often these data sets are used, I was a bit surprised that I could not find a record of someone else combining them. The closest attempt I could find was Andreas Beger’s country names to COW codes do file.

Beger’s file has all of the country names in lower-case, so I used Ruby’s upcase command to change that. That change took care of just over 75 percent of the observations (10,396 of 14,523). Next, I had to deal with the fact that a bunch of the countries in Arthur Banks’ data do not exist any more (they have names like Campuchea, Ceylon, and Ciskei; see here and here). This was done with the main file. After that, the data was all set in Stata as desired.

I am not going to put the full combined data up because the people in control of Arthur Banks’ time series are really possessive. But if you already have both data sets, combining them should be much easier using these scripts.

Wednesday Nerd Fun: Rock, Paper, Scissors, Lizard, Spock

Fans of The Big Bang Theory will already know about this variant on the traditional game, Rock-Paper-Scissors. Adding two new moves–“Lizard” and “Spock”–increases the number of possible combinations in a two-player game from three to 10 (assuming we do not care about who is the first or second player), which greatly decreases the chance of a tie. This is exactly the reasoning that Sheldon gives for using the extended version:

But do you know the real story behind the game? Karen Heyman shares the account of how engineer Sam Kass invented the game with his now-wife Karen Byrla when they were students at Carnegie-Mellon University.

For a pair of geeks, Spock’s Vulcan Salute was an obvious choice for an additional gesture. But what else could they use? At first they considered another geek favorite, a sock puppet, since it’s easy to mime. They quickly discarded it, because really, a sock against Spock? “We came up with a poisonous lizard,” says Kass, “Lizard Poisons Spock was the first of the new rules, and everything else kind of fell out from there.”

The expanded RPS turned out to serve a secondary role as a “Geek Test,” according to Kass. As he described the rules, he discovered the world is divided into our kind, and those who need to have “Paper Disproves Spock” explained to them.

Why five gestures? To quote Spock himself: “It’s logical.”

“So long as you have an odd number of hand signals, you can create a fully balanced graph where everything is beaten and beats the same number of things,” says Kass, “Four doesn’t work, because one will be unbalanced; but five works.”

It is a bit easier to understand the game from the following graphic:

Big Bang revisited the game again in the clip below, giving Kass credit for developing the game:

Now, there is a robot that beats humans at Rock-Paper-Scissors every time using high-speed vision. According to the Automaton blog, it’s too fast for players to tell that the robot is cheating. No word on whether the robot will be programmed to include Kass’s extended moves, but at least Sheldon’s dream of the singularity is one tiny step closer.

The Future of Checkpoint Security

Those who know me in person follow me on Twitter are probably aware of my disdain for current TSA procedures. However, there is a glimmer of hope in this article from USA Today. The article describes new technology to be deployed at Dallas’s Love Field over the next few years.

Here’s the problem:

The Federal Aviation Administration projects the number of passengers flying inside the USA will nearly double in the next 20 years, to 1.2 billion. Security has slowed since the attacks of Sept. 11, 2001. Before then, about 350 people passed through checkpoints each hour, the IATA says. A November survey at 142 airports found processing times fell to 149 an hour, with the worst at 60, Dunlap says.

The proposed solution will involve implementing technology to identify the “riskiest” passengers and expose them to additional scrutiny, while letting most passengers move through much more quickly and conveniently than they can today. I have some privacy concerns, but benefits are tangible:

Passengers would walk with their carry-ons through a screening tunnel, where they’d undergo electronic scrutiny — replacing what now happens at as many as three different stops as they’re scanned for metal objects, non-metallic items and explosives.

Passengers would no longer have to empty carry-ons of liquids and laptops before putting them on conveyor belts for X-ray scans. They could keep their belts and shoes on. They could avoid a backlog at full-body scanners and a finger swab for explosive residue.

Additional scrutiny since 9/11 has indeed stopped several attempts by amateur terrorists hapless idiots, but the unseen cost of increased travel time is immense. This effort to speed up the process is much-needed.

Update: Scientific American says, “Outdated screening rules aren’t making for safer skies—just longer lines.”

Getting Started with Prediction

From historians to financial analysts, researchers of all stripes are interested in prediction. Prediction asks the question, “given what I know so far, what do I expect will come next?” In the current political season, presidential election forecasts abound. This dates back to the work of Ray Fair, whose book is ridiculously cheap on Amazon. In today’s post, I will give an example of a much more basic–and hopefully, relatable–question: given the height of a father, how do we predict the height of his son?

To see how common predictions about children’s traits are, just Google “predict child appearance” and you will be treated to a plethora of websites and iPhone apps with photo uploads. Today’s example is more basic and will follow three questions that we should ask ourselves for making any prediction:

1. How different is the predictor from its baseline?
It’s not enough to just have a single bit of information from which to predict–we need to know something about the baseline of the information we are interested in (often the average value) and how different the predictor we are using is. The “Predictor” in this case will refer to the height of the father, which we will call U. The “outcome” in this case will be the height of the son, which we will call V.

To keep this example simple let us assume that U and V are normally distributed–in other words their distributions look like the familiar “bell curve” when they are plotted. To see how different our given observations of U or V are from their baseline, we “standardize” them into X and Y

X = {{u - \mu_u} \over \sigma_u }

Y = {{v - \mu_v} \over \sigma_v },

where \mu is the mean and \sigma is the standard deviation. In our example, let \mu_u = 69, \mu_v=70, and \sigma_v = \sigma_u = 2.

2. How much variance in the outcome does the predictor explain?
In a simple one-predictor, one-outcome (“bivariate”) example like this, we can answer question #2 by knowing the correlation between  X and Y, which we will call \rho (and which is equal to the correlation between U and V in this case). For simplicity’s sake let’s assume \rho={1 \over 2}. In real life we would probably estimate \rho using regression, which is really just the reverse of predicting. We should also keep in mind that correlation is only useful for describing the linear relationship between X and Y, but that’s not something to worry about in this example. Using \rho, we can set up the following prediction model for Y:

Y= \rho X + \sqrt{1-\rho^2} Z.

Plugging in the values above we get:

Y= {1 \over 2} X + \sqrt{3 \over 4} Z.

Z is explained in the next paragraph.

3. What margin of error will we accept? No matter what we are predicting, we have to accept that our estimates are imperfect. We hope that on average we are correct, but that just means that all of our over- and under-estimates cancel out. In the above equation, Z represents our errors. For our prediction to be unbiased there has to be zero correlation between X and Z. You might think that is unrealistic and you are probably right, even for our simple example. In fact, you can build a decent good career by pestering other researchers with this question every chance you get. But just go with me for now. The level of incorrect prediction that we are able to accept affects the “confidence interval.” We will ignore confidence intervals in this post, focusing instead on point estimates but recognizing that our predictions are unlikely to be exactly correct.

The Prediction

Now that we have set up our prediction model and nailed down all of our assumptions, we are ready to make a prediction. Let’s predict the height of the son of a man who is 72″ tall. In probability notation, we want


which is the expected son’s height given a father with a height of 72”.

Following the steps above we first need to know how different 72″ is from the average height of fathers.  Looking at the standardizations above, we get

X = {U-69 \over 2}, and

Y = {V - 70 \over 2}, so

\mathbb{E}(V|U=72) = \mathbb{E}(2Y+70|X=1.5) = \mathbb{E}(2({1 \over 2}X + \sqrt{3 \over 4}Z)+70|X=1.5),

which reduces to 1.5 + \mathbb{E}(Z|X=1.5) + 70. As long as we were correct earlier about Z not depending on X and having an average of zero, then we get a predicted son’s height of 71.5 inches, or slightly shorter than his dad, but still above average.

This phenomenon of the outcome (son’s height) being closer to the average than the predictor (father’s height) is known as regression to the mean and it is the source of the term “regression” that is used widely today in statistical analysis. This dates back to one of the earliest large-scale statistical studies by Sir Francis Galton in 1886, entitled, “Regression towards Mediocrity in Hereditary Stature,” (pdf) which fits perfectly with today’s example.

Further reading: If you are already comfortable with the basics of prediction, and know a bit of Ruby or Python, check out Prior Knowledge.

Wednesday Nerd Fun: Where Things Come From

Sourcemap for a Laptop Computer, by Leo Bonanni

Where did your shoes come from? Your coffee? Your laptop? One of the beautiful things about the modern world is that you can hold a piece of technology in your hands–or wear it on your feet–without having to know the answer to this question.

As anyone who has read or heard “I, Pencil” knows, the genealogies of even the most banal products are immensely complicated. The punch line of that story is that pencils have too many components for one person to make or even fully understand, but through innovations like the price system they can be produced without central planning. In the author’s words, the moral is to “Leave all creative energies uninhibited.”

On today’s nerd fun site, Sourcemap, you can see cool visualizations of where all kinds of things come from. The maps for pencils are probably a little oversimplified, but some others are really neat, like the ones at the beginning of the post. My favorite are the food sourcemaps–check out Tropicana, Chicken of the Sea tuna, and Nutella. Now you knowl.

Gelman’s Five Essential Books on American Elections

The Browser today has an interview with Andrew Gelman, one of the best-informed researchers on American elections (and other things). His selections are a bit strange eclectic, but readers of this blog might find them interesting. The one that I am most likely to read is The 480, a novel about political consultants.

Notes on the Sinaloa Cartel

From NYT over the weekend. Some of the article is hyperbolic, but I present the interesting parts here without comment.

On logistics:

From the remote mountain redoubt where he is believed to be hiding, surrounded at all times by a battery of gunmen, Chapo oversees a logistical network that is as sophisticated, in some ways, as that of Amazon or U.P.S. — doubly sophisticated, when you think about it, because traffickers must move both their product and their profits in secret, and constantly maneuver to avoid death or arrest…. In its longevity, profitability and scope, it might be the most successful criminal enterprise in history.

On profitability:

The Sinaloa cartel can buy a kilo of cocaine in the highlands of Colombia or Peru for around $2,000, then watch it accrue value as it makes its way to market. In Mexico, that kilo fetches more than $10,000. Jump the border to the United States, and it could sell wholesale for $30,000. Break it down into grams to distribute retail, and that same kilo sells for upward of $100,000 — more than its weight in gold. And that’s just cocaine. Alone among the Mexican cartels, Sinaloa is both diversified and vertically integrated, producing and exporting marijuana, heroin and methamphetamine as well.

On corporate structure:

The organizational structure of the cartel also seems fashioned to protect the leadership. No one knows how many people work for Sinaloa, and the range of estimates is comically broad. Malcolm Beith, the author of a recent book about Chapo, posits that at any given moment, the drug lord may have 150,000 people working for him. John Bailey, a Georgetown professor who has studied the cartel, says that the number of actual employees could be as low as 150. The way to account for this disparity is to distinguish between salaried employees and subcontractors. A labor force of thousands may be required to plow all that contraband up the continent, but a lot of the work can be delegated to independent contractors, people the Mexican political scientist and security consultant Eduardo Guerrero describes as working “for the cartel but outside it.”

On violence:

“In illegal markets, the natural tendency is toward monopoly, so they fight each other,” Antonio Mazzitelli, an official with the United Nations Office on Drugs and Crime in Mexico City, told me. “How do they fight: Go to court? Offer better prices? No. They use violence.” The primal horror of Mexico’s murder epidemic makes it difficult, perhaps even distasteful, to construe the cartel’s butchery as a rational advancement of coherent business aims. But the reality is that in a multibillion-dollar industry in which there is no recourse to legally enforceable contracts, some degree of violence may be inevitable.

“It’s like geopolitics,” Tony Placido said. “You need to use violence frequently enough that the threat is believable. But overuse it, and it’s bad for business.”

Latitude, Longitude, and Culture

It is rare to see a “big idea” in social science that also lends itself to real-world analysis. A pessimistic categorization of the field might group researchers into “storytellers” and “regression runners.” Each group has a few stars who do their work very well, with many more who wish to imitate them. There is little cross-pollination between the groups, however. That is why I was excited to see a pre-print article from the Proceedings of the National Academy of Sciences where a leading empirical researcher, David Laitin, tests a theory of Jared Diamond’s.*

Diamond’s book Guns, Germs, and Steel is part of a genre that tries to explain much of world history in a few themes. Laitin describes one of those ideas in the article’s abstract:

Jared Diamond’s Guns, Germs, and Steel has provided a scientific foundation for answering basic questions, such as why Eurasians colo- nized the global South and not the other way around, and why there is so much variance in economic development across the globe. Diamond’s explanatory variables are: (i) the susceptibility of local wild plants to be developed for self-sufficient agriculture; (ii) the domesticability of large wild animals for food, transport, and agricultural production; and (iii) the relative lengths of the axes of continents with implications for the spread of human populations and technologies. This third “continental axis” thesis is the most difficult of Diamond’s several explanatory factors to test, given that the number of continents are too few for statistical analysis. This article provides a test of one observable implication of this thesis, namely that linguistic diversity should be more persistent to the degree that a geographic area is oriented more north-south than east-west. Using both modern states and artificial geographic entities as the units of analysis, the results provide significant confirmation of the relationship between geographic orientation and cultural homogenization. Beyond providing empirical support for one observable implication of the continental axis theory, these results have important implications for understanding the roots of cultural diversity, which is an important determinant of economic growth, public goods provision, local violence, and social trust.

A gated version of the paper can be found here, and Zoë Corbyn has a good summary here.


* These categorizations, as I said, are overly pessimistic and should not be taken too seriously. Laitin has ideas and Diamond tests his theories. But they have different comparative advantages.

Eliminate File Redundancy with Ruby

Say you have a file with many repeated, unnecessary lines that you want to remove. For safety’s sake, you would rather make an abbreviated copy of the file rather than replace it. Ruby makes this a cinch. You just iterate over the file, putting all lines the computer has already “seen” into a dictionary. If a line is not in the dictionary, it must be new, so write it to the output file. Here’s the code designed with .tex files in mind, but easily adaptable:

puts 'Filename?'
filename = gets.chomp
input ='.tex')
output ='2.tex', 'w')
seen = {}
input.each do |line|
  if (seen[line]) 
    seen[line] = true

Where would this come in handy? Well, the .tex extension probably already gave you a clue that I am reducing redundancy in a \LaTeX file. In particular, I have an R plot generated as a tikz graphic. The R plot includes a rug at the bottom (tick marks indicating data observations)–but the data set includes over 9,000 observations, so many of the lines are drawn right on top of each other. The \LaTeX compiler got peeved at having to draw so many lines, so Ruby helped it out by eliminating the redundancy. One special tweak for using the script above to modify tikz graphics files is to change the line

if (seen[line])


if (seen[line]) && !(line.include? 'node') &&  !(line.include? 'scope') && !(line.include? 'path') && !(line.include? 'define')

if your plot has multiple panes (e.g. par(mfrow=c(1,2)) in R) so that Ruby won’t ignore seemingly redundant lines that are actually specifying new panes. The modified line is a little long and messy, but it works, and that was the main goal here. The resulting \LaTeX file compiles easily and more quickly than it did with all those redundant lines, thanks to Ruby.

Wednesday Nerd Fun: Games (and More) in Stata

Stata is a software program for running statistical analysis, as readers who have been to grad school in the social sciences in the last couple of decades will know. Compared to R Stata is like an old TI-83 calculator, but it remains popular with those who spent the best years of their lives typing commands into its green-on-black interface. I recently discovered that Stata shares one important feature with the TI-83 calculator: the ability to play games. (For TI-83 games, see here and here.)

Eric Booth of Texas A&M shares this implementation of Blackjack in Stata:

The game is played by typing -blackjack- into the command window and then the game prompts the user for the amount she wants to bet (default is $500 which replenishes after you lose it all or you exit Stata), and whether to hit or stay.  It doesn’t accurately represent all the rules and scenarios of a real game a blackjack (e.g., no doubling down), so don’t use it to prep for your run at taking down a Vegas casino.

Booth’s blog provides other fun, unconventional uses of Stata as well. There’s a script that lets you Google from the Stata interface, one that lets you control iTunes, and even one for running commands from an iPhone.

This post is probably less “general interest” than most of the nerd fun posts, but I hope you enjoyed it.