A Checklist for Using Open Source Software in Production

A great majority of the web is built on open source software. Approximately two-thirds of public servers on the internet run a *nix operating system, and over half of those are Linux. The most popular server-side programming languages also tend to be open source (including my favorite, Ruby). This post is about adding a new open source library to an existing code base. What questions should you ask before adding such a dependency to a production application?

The first set of questions are the most basic. A “no” to any of these should prompt you to look elsewhere.
  • Is the project written in a language you support? Is it in a language you support? If not, is it compatible (e.g. through stdin/stdout or by compiling to your language of choice)?
  • Is the project in a version of of the language you support? If it’s written in Python 3 and you only support Python 2, for example, using this library could lead to headaches.
  • Can you use the project in your framework of choice (e.g. Rails or Django)?
  • Are there conflicts with other libraries or packages you’re currently using? (This is probably the hardest question to answer, and you might not know until you try it.)
Assuming there are no immediate technical barriers, the next questions to ask are of the legal variety. Open source licenses come in many flavors. In the absence of a license, traditional copyright rules apply. Be especially careful if the project you are investigating uses the GPL license–even basing the code you write off of a GPL open source project can have serious legal ramifications. There’s a great guide to OSS licenses on Github. If you’re the author or maintainer of an open source project checkout choosealicense.com.
The next thing to consider is whether and how the project is tested. If there is not an automated test suite, consider starting one as your first contribution to the project and be very reluctant to add the project to your application. Other related questions include:
  • Are there unit tests?
  • Are there integration tests?
  • What is the test coverage like?
  • Do the tests run quickly?
  • Are the tests clearly written?
Finally, by using an open source project you are also joining a community of developers. None of these questions are necessarily show-stoppers but knowing the size of the community and the tone of its discourse can save you pain down the road.
  • Is the project actively maintained? When was the last commit?
  • Does the community have a civil, professional style of debate and discussion?
  • Is there only one developer/maintainer who knows everything? This doesn’t have to be a deal breaker. However, if there is a single gatekeeper you should make sure you understand the basics of the code and could fork the project if necessary.

This is by no means an exhaustive list but these questions can serve as a useful checklist before adding an open source as a dependency for your project.

Grad Student Gift Ideas

My sister is starting a graduate program this fall, so I wanted to put together a “grad school survival kit” gift basket for her. When I was looking, though, most search results for things like that were put together by gift basket companies and a large number of them include junk food as filler. While junk food can be a great stress reliever, I would not recommend making that the bulk of your gift to a grad student. Instead, consider some of the following gifts that range from practical to fun:


Managing Memory and Load Times in R and Python

Once I know I won’t need a file again, it’s gone. (Regular back-ups with Time Machine have saved me from my own excessive zeal at least once.) Similar economy applies to runtime: My primary computing device is my laptop, and I’m often too lazy to fire up a cloud instance unless the job would take more than a day.

Working with GDELT data for the last few weeks I’ve had to be a bit less conservative than usual. Habits are hard to break, though, so I found myself looking for a way to

  1. keep all the data on my hard-drive, and
  2. read it into memory quickly in R and/or Python.

The .zip files you can obtain from the GDELT site accomplish (1) but not (2). A .rda
binary helps with part of (2) but has the downside of being a binary file that I might not be able to open at some indeterminate point in the future–violating (1). And a memory-hogging CSV that also loads slowly is the worst option of all.

So what satisficing solution did I reach? Saving gzipped files (.gz). Both R and Python can read these files directly (R code shown below; in Python use gzip.open
or the compression option for read_csv in pandas). It’s definitely smaller–the 1979 GDELT historical backfile compresses from 115.3MB to 14.3MB (an eighth of its former size). Reading directly into R from a .gz file has been available since at least version 2.10.

Is it faster? See for yourself:

> system.time(read.csv('1979.csv', sep='\t', header=F, flush=T, as.is=T) )
user system elapsed
48.930 1.126 50.918
> system.time(read.csv('1979.csv.gz', sep='\t', header=F, flush=T, as.is=T))
user system elapsed
23.202 0.849 24.064
> system.time(load('gd1979.rda'))
user system elapsed
5.939 0.182 7.577

Compressing and decompressing .gz files is straightforward too. In the OS X Terminal, just type gzip filename or gunzip filename, respectively

Reading the gzipped file takes less than half as long as the unzipped version. It’s still nowhere near as fast as loading the rda binary, but I don’t have to worry about file readability for many years to come given the popularity of *nix operating systems. Consider using .gz files for easy memory management and quick loading in R and Python.

JavaScript Politics

r-anarchismIn a recent conversation on Twitter, Christopher Zorn said that Stata is fascism, R is anarchism, and SAS is masochism. While only one of these is plausibly a programming language, it’s an interesting political analogy. We’ve discussed the politics of the Ruby language before.

Today I wanted to share a speaker deck by Angus Croll on the politics of Javascript. He describes periods of anarchy (1995-2004), revolution (2004-2006), and coming of age (2007-2010). We’re currently in “the itch” (2011-2013). There are a number of other political dimensions in the slides as well. Click the image below to see the deck in full.


If anyone knows of a video of the presentation, I’d love to see it. Croll also wrote an entertaining article with Javascript code in the style of famous authors like Hemingway, Dickens, and Shakespeare.

Python for Political Scientists, Spring 2013 Recap

pythonThis spring Josh Cutler‘s Python course was back by popular demand. (This time it was known as “Computational Political Economy” but I like the less formal title.) I participated this time around as a teaching assistant rather than student, and it was a thoroughly enjoyable experience. The course syllabus and schedule is on Github.

Class participants were expected to have a basic familiarity with Python from going through Zed Shaw’s book over Christmas break outside of class. Each Tuesday Josh would walk them through a new computer science concept and explain how it could be used for social science research. These topics included databases, networks, web scraping, and linear programming. On Thursdays they would come to a lab session and work together in small groups to solve problems or answer questions based on some starter code that I supplied. I generally tried to make the examples relevant and fun but you would have to ask them whether I succeeded.

The class ended this past Saturday with final presentations, which were all great. The first project scraped data from the UN Millenium Development Goal reports and World Bank statistics to compare measures of maternal mortality in five African countries and show how they differed–within the same country! This reminded me of Morten Jerven’s book Poor Numbers on the inaccuracy of African development statistics (interview here).

In the second presentation, simulated students were treated with one of several education interventions to see how their abilities changed over time. These interventions could be applied uniformly to everyone or targeted at those in the bottom half of the distribution. Each child in the model had three abilities that interacted in different ways, and interventions could target just one of these abilities or several in combination. Combining these models with empirical data on programs like Head Start is an interesting research program.

The third presentation also used a computational model. Finding equilibrium networks of interstate alliances is incredibly difficult (if not impossible) to do analytically when the number of states is large. The model starts with pre-specified utility functions and runs until the network hits an equilibria. Changing starting values allows for the discovery of multiple equilibria. This model will also be combined with empirical data in the future.

For the fourth and final presentation, one participant collected data on campaign events in Germany for each of the political parties during the current election cycle. This reminded me of a Washington Post website (now taken down) detailing campaign visits in 2008 that I scraped last year and used in lab once this semester.

These examples show the wide variety of uses for programming in social science. From saving time in data collection to generating models that could not be done without the help of algorithms, a little bit of programming knowledge goes a long way. Hopefully courses like this will become more prominent in social science graduate (and undergraduate) programs over the coming years. Thanks to Josh and all the class participants for making it a great semester!


Note: I am happy to give each of the presenters credit for their work, but did not want to reveal their names here due to privacy concerns. If you have questions about any of the presentations I can put you in touch with their authors via email.

The Political Economy of Scrabble: Currency, Innovation, and Norms

Scrabble ornaments made by Jennifer Bormann, 2011

Scrabble Christmas ornaments made by Jennifer Bormann, 2011

In Scrabble, there is a finite amount of resources (letter tiles) that players use to create value (points) for themselves. Similarly, in the real world matter cannot be created so much of human effort is rearranging the particles that exist into more optimal combinations. The way that we keep track of how desirable those new combinations are in the economy is with money. Fiat currency has no intrinsic value–it is just said to be worth a certain amount. Sometimes this value changes in response to other currencies. Other times, governments try to hold it fixed. The “law of Scrabble” has remained unchanged since 1938 when it was introduced–but that may be about to change.

Like any well-intentioned dictator, Scrabble inventor Alfred Butts tried to base the value of his fiat money–er, tiles–on a reasonable system:  the frequency of their appearance on the front page of the New York Times. As the English language and the paper of record have evolved over the years, though, the tiles’ stated value has remained static. This has opened the door for arbitrage opportunities, although some players try to enforce norms to discourage this type of play:

What has changed in the intervening years is the set of acceptable words, the corpus, for competitive play. As an enthusiastic amateur player I’ve annoyed several relatives with words like QI and ZA, and I think the annoyance is justified: the values for Scrabble tiles were set when such words weren’t acceptable, and they make challenging letters much easier to play.

That is a quote from Joshua Lewis, who has proposed updating Scrabble scoring using his open source software package Valett. He goes on to say:

For Scrabble, Valett provides three advantages over Butts’ original methodology. First, it bases letter frequency on the exact frequency in the corpus, rather than on an estimate. Second, it allows one to selectively weight frequency based on word length. This is desirable because in a game like Scrabble, the presence of a letter in two- or three-letter words is valuable for playability (one can more easily play alongside tiles on the board), and the presence of a letter in seven- or eight-letter words is valuable for bingos. Finally, by calculating the transition probabilities into and out of letters it quantifies the likelihood of a letter fitting well with other tiles in a rack. So, for example, the probability distribution out of Q is steeply peaked at U, and thus the entropy of Q’s outgoing distribution is quite low.

Lewis’s idea seems to fit with a recent finding by Peter Norvig of Google. Norvig was contacted last month by Mark Mayzner, who studied the same kind of information as the Valett package but did it back in the early 1960s. Mayzner asked Norvig whether his group at Google would be interested in updating those results from five decades ago using the Google Corpus Data. Here’s what Norvig has to say about the process:

The answer is: yes indeed, I (Norvig) am interested! And it will be a lot easier for me than it was for Mayzner. Working 60s-style, Mayzner had to gather his collection of text sources, then go through them and select individual words, punch them on Hollerith cards, and use a card-sorting machine.

Here’s what we can do with today’s computing power (using publicly available data and the processing power of my own personal computer; I’m not not relying on access to corporate computing power):

1. I consulted the Google books Ngrams raw data set, which gives word counts of the number of times each word is mentioned (broken down by year of publication) in the books that have been scanned by Google.

2. I downloaded the English Version 20120701 “1-grams” (that is, word counts) from that data set given as the files “a” to “z” (that is, http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-1gram-20120701-a.gz to http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-1gram-20120701-z.gz). I unzipped each file; the result is 23 GB of text (so don’t try to download them on your phone).

3. I then condensed these entries, combining the counts for all years, and for different capitalizations: “word”, “Word” and “WORD” were all recorded under “WORD”. I discarded any entry that used a character other than the 26 letters A-Z. I also discarded any word with fewer than 100,000 mentions. (If you want you can download the word count file; note that it is 1.5 MB.)

4. I generated tables of counts, first for words, then for letters and letter sequences, keyed off of the positions and word lengths.

Here is the breakdown of word lengths that resulted (average=4.79):


Sam Eifling then took Norvig’s results and translated them into updated Scrabble values:

While ETAOINSR are all, appropriately, 1-point letters, the rest of Norvig’s list doesn’t align with Scrabble’s point values….

This potentially opens a whole new system of weighing the value of your letters….  H, which appeared as 5.1 percent of the letters used in Norvig’s survey, is worth 4 points in Scrabble, quadruple what the game assigns to the R (6.3 percent) and the L (4.1 percent) even though they’re all used with similar frequency. And U, which is worth a single point, was 2.7 percent of the uses—about one-fifth of E, at 12.5 percent, but worth the same score. This confirms what every Scrabble player intuitively knows: unless you need it to unload a Q, your U is a bore and a dullard and should be shunned.

However, Norving included repeats like “THE”–not much fun to play in Scrabble, and certainly not with the same frequency it appears in the text corpus (1 out of 14 turns). With the help of his friend Kyle Rimkus, Eifling conducted a letter-frequency survey of words from the Scrabble dictionary and came up with these revisions to the scoring system:

Image from Slate

Image from Slate

Eifling points out that Q and J seem quite undervalued in the present scoring system. So what is an entrepreneurial player to do? “Get rid of your J and your Q as quickly as possible, because they’re just damn hard to play and will clog your rack. The Q, in fact, is the worst offender,” he says.

Now as with any proposed policy update that challenges long-standing norms, there has been some pushback against these recent developments.  at Slate quotes the old guard of Scrabble saying that the new values “take the fun out” of the game. Fatsis seems to hope that the imbalance between stated and practical values will persist:

Quackle co-writer John O’Laughlin, a software engineer at Google, said the existing inequities also confer advantages on better players, who understand the “equity value” of each tile—that is, its “worth” in points compared with the average tile. That gives them an edge in balancing scoring versus saving letters for future turns, and in knowing which letters play well with others. “If we tried to equalize the letters, this part of the game wouldn’t be eliminated, but it would definitely be muted,” O’Laughlin said. “Simply playing the highest score available every turn would be a much more fruitful strategy than it currently is.”

In political economy this is known as rent-seeking behavior. John Chew, doctoral student in mathematics at the University of Toronto and co-president of the North American Scrabble Players Association, went so far as to call Valett a “catastrophic outrage.”

Who knew that the much beloved board game could provoke such strong feelings? With a fifth edition of the Scrabble dictionary due in 2014 it seems possible but highly unlikely that there could be a response to these new findings. A more probable outcome is that we begin to see “black market” Scrabble valuations that incorporate the new data, much like underground economies emerge in states with strict official control over the value of their money. Yet again, evidence for politics in everyday life.

For more fun with letter games, data, and coding, check out Jeff Knups’ guide to “Creating and Optimizing a Letterpress Cheating Program in Python.”

Simulating the NLDS: Can the Giants Win?

In Allen Downey’s new book, Think Bayes, he relates the “Boston Bruins” problem. The problem is to estimate the Bruins’ probability of winning the 2010-2011 NHL championship after two wins and two losses. I will briefly describe Downey’s approach, and then relate it to the current situation of the San Francisco Giants.

One (naive) approach would be to model this as a gambler’s ruin problem. There are two problems with that model for this problem: the total number of rounds to be played is uncertain (i.e. the championship is a best of n rather than play until one side is totally defeated), and it throws away important information about the score of the games.

Instead, we model baseball as a Poisson process, in which it is equally likely for a run to be scored at any time during the game. This is still somewhat of an oversimplification (the odds are better when you have runners on base, for example), but we are getting closer to the “true” model. Second, we assume that games between the Reds and Giants in this year’s National League Division Series are similar enough that they can be considered as outcomes from Poisson distributions in which each team’s scoring distribution is consistent between games with parameter λ. Different pitchers could cause this assumption to be thrown off (no pun intended), but we will again use it as a not-entirely-implausible simplification.

Having made our assumptions, we now use a four-step process proposed by Downey:

1. Use statistics from previous games to choose a prior distribution for λ.
2. Use the score from the first four games to estimate λ for each team.
3. Use the posterior distributions of λ to compute distribution of goals for each team, the distribution of the goal differential, and the probability that each team wins.
4. Simulate the rest of the series to estimate the probability of each possible outcome.

To calculate λ, we will use the team batting stats from ESPN and the thinkbayes Python package from Downey’s site.

Here is the distribution of λ, using regular season scoring as the prior and updating with the results of the first four games of the division series:

And here are the predicted runs-per-game by team, using simulations:

According to the model’s predictions, the probability that the Giants win today’s game (and the division series) is 0.387. I would have preferred to use a Gamma prior for λ and run some more simulations in R, but I wanted to use Downey’s example and get this up before the game started… which was a few minutes ago (although as I post, the score is still 0-0). Either way, enjoy the game!

Wednesday Nerd Fun: Python for iOS

This one is short and sweet. Would you like to be able to write Python code on an iOS device? Now you can, with this app.

I have spent some time playing around with the app this week, and it seems to have two main uses. The first is entertaining yourself while waiting in line/riding the bus/whatever dead time you have where you would like to do something slightly more productive than checking Twitter. The second usage I see is making small edits with an iPad or iPhone while you happen to be away from your computer. (See the screenshot below showing the file loading functionality, more here.) In other words, this would not be my primary programming environment but it is a nice complement.

If you have never given programming a chance, having this app might make learning Python seem more like a game. If that’s what it takes for you to plunge into coding, go for it!

Getting Started with Prediction

From historians to financial analysts, researchers of all stripes are interested in prediction. Prediction asks the question, “given what I know so far, what do I expect will come next?” In the current political season, presidential election forecasts abound. This dates back to the work of Ray Fair, whose book is ridiculously cheap on Amazon. In today’s post, I will give an example of a much more basic–and hopefully, relatable–question: given the height of a father, how do we predict the height of his son?

To see how common predictions about children’s traits are, just Google “predict child appearance” and you will be treated to a plethora of websites and iPhone apps with photo uploads. Today’s example is more basic and will follow three questions that we should ask ourselves for making any prediction:

1. How different is the predictor from its baseline?
It’s not enough to just have a single bit of information from which to predict–we need to know something about the baseline of the information we are interested in (often the average value) and how different the predictor we are using is. The “Predictor” in this case will refer to the height of the father, which we will call U. The “outcome” in this case will be the height of the son, which we will call V.

To keep this example simple let us assume that U and V are normally distributed–in other words their distributions look like the familiar “bell curve” when they are plotted. To see how different our given observations of U or V are from their baseline, we “standardize” them into X and Y

X = {{u - \mu_u} \over \sigma_u }

Y = {{v - \mu_v} \over \sigma_v },

where \mu is the mean and \sigma is the standard deviation. In our example, let \mu_u = 69, \mu_v=70, and \sigma_v = \sigma_u = 2.

2. How much variance in the outcome does the predictor explain?
In a simple one-predictor, one-outcome (“bivariate”) example like this, we can answer question #2 by knowing the correlation between  X and Y, which we will call \rho (and which is equal to the correlation between U and V in this case). For simplicity’s sake let’s assume \rho={1 \over 2}. In real life we would probably estimate \rho using regression, which is really just the reverse of predicting. We should also keep in mind that correlation is only useful for describing the linear relationship between X and Y, but that’s not something to worry about in this example. Using \rho, we can set up the following prediction model for Y:

Y= \rho X + \sqrt{1-\rho^2} Z.

Plugging in the values above we get:

Y= {1 \over 2} X + \sqrt{3 \over 4} Z.

Z is explained in the next paragraph.

3. What margin of error will we accept? No matter what we are predicting, we have to accept that our estimates are imperfect. We hope that on average we are correct, but that just means that all of our over- and under-estimates cancel out. In the above equation, Z represents our errors. For our prediction to be unbiased there has to be zero correlation between X and Z. You might think that is unrealistic and you are probably right, even for our simple example. In fact, you can build a decent good career by pestering other researchers with this question every chance you get. But just go with me for now. The level of incorrect prediction that we are able to accept affects the “confidence interval.” We will ignore confidence intervals in this post, focusing instead on point estimates but recognizing that our predictions are unlikely to be exactly correct.

The Prediction

Now that we have set up our prediction model and nailed down all of our assumptions, we are ready to make a prediction. Let’s predict the height of the son of a man who is 72″ tall. In probability notation, we want


which is the expected son’s height given a father with a height of 72”.

Following the steps above we first need to know how different 72″ is from the average height of fathers.  Looking at the standardizations above, we get

X = {U-69 \over 2}, and

Y = {V - 70 \over 2}, so

\mathbb{E}(V|U=72) = \mathbb{E}(2Y+70|X=1.5) = \mathbb{E}(2({1 \over 2}X + \sqrt{3 \over 4}Z)+70|X=1.5),

which reduces to 1.5 + \mathbb{E}(Z|X=1.5) + 70. As long as we were correct earlier about Z not depending on X and having an average of zero, then we get a predicted son’s height of 71.5 inches, or slightly shorter than his dad, but still above average.

This phenomenon of the outcome (son’s height) being closer to the average than the predictor (father’s height) is known as regression to the mean and it is the source of the term “regression” that is used widely today in statistical analysis. This dates back to one of the earliest large-scale statistical studies by Sir Francis Galton in 1886, entitled, “Regression towards Mediocrity in Hereditary Stature,” (pdf) which fits perfectly with today’s example.

Further reading: If you are already comfortable with the basics of prediction, and know a bit of Ruby or Python, check out Prior Knowledge.

Schelling’s Model of Segregation (Python)

Over the weekend I implemented a version of the agent-based model from Thomas Schelling’s 1971 paper in Python. Schelling’s story about segregation is simple: there are two colors of agents, happiness is based on whether two or more of your neighbors are the same color as you, and you move if you are unhappy.

The model has been studied thoroughly by smarter people than me, and is included as one of the early discussions in Scott Page’s Model Thinking course. My instantiation comes from an exercise in chapter 10 of Allen Downey’s Think Complexity, which is accompanied by the following comments:

You might not be surprised to hear that this model leads to some segregation, but you might be surprised by the degree. Fairly quickly, clusters of similar agents appear. The clusters grow and coalesce over time until there are a small number of large clusters and most agents live in homogeneous neighborhoods.
If you did not know the process and only saw the result, you might assume that the agents were racist, but in fact all of them would be perfectly happy in a mixed neighborhood. Since they prefer not to be greatly outnumbered, they might be considered xenophobic at worst. Of course, these agents are a wild simplification of real people, so it may not be appropriate to apply these descriptions at all.

Here is a screenshot from the running of my model:

Screenshot from my implementation of Schelling’s model in Python

Unhappy agents are represented by X’s, happy ones by O’s. The only substantial departure from the exercise here is that I base happiness on four neighbors (up, down, left, right), rather than eight (including the diagonals) because I misread the instructions the first time. This departure makes the agents less tolerant than in Schelling’s original. The neighborhood shown above is 10×10, but the code should be generalizable. Ninety percent of the houses are filled, which leaves some empty ones for moving around.

My code is here, including some comments indicating planned improvements. For the pointer to Downey’s book, I thank Josh Cutler.