Automatically Setup New R and LaTeX Projects

You have a finite amount of keystrokes in your life. Automating repetitive steps is one of the great benefits of knowing a bit about coding, even if the code is just a simple shell script. That’s why I set up an example project on Github using a Makefile and sample article (inspired by Rob Hyndman). This post explains the structure of that project and explains how to modify it for your own purposes.

Running the Makefile with the example article in an otherwise-empty project directory will create:

  • a setup.R file for clearing the workspace, setting paths, and loading libraries
  • a data directory for storing files (csv, rda, etc)
  • a drafts directory for LaTeX, including a generic starter article
  • a graphics library for storing plots and figures to include in your article
  • an rcode directory for your R scripts

It also supplies some starter code for the setup.R file in the main directory and a start.R file in the rcode directory. This takes the current user and sets relative paths to the project directories with simple variable references in R. For example, after running setup.R in your analysis file you can switch to the data directory with setwd(pathData), then create a plot and save it after running setwd(pathGraphics). Because of the way the setup.R file works, you could have multiple users working on the same project and not need to change any of the other scripts.

If you want to change this structure, there are two main ways to do it. You can add more (or fewer) directories by modifying the mkdir lines in the Makefile. You can add (or remove) text in the R files by changing the echo lines.

If you decide to make major customizations to this file or already have your own structure for projects like this, leave a link in the comments for other readers.

H/T to Josh Cutler for encouraging me to write a post about my project structure and automation efforts. As a final note, this is the second New Year’s Eve in a row that I have posted a tech tip. Maybe it will become a YSPR tradition!

Update: If you’re interested in your own automated project setup, check out ProjectTemplate by John Myles White. Thank to Trey Causey on Twitter and Zachary Jones in the comments for sharing this.

Merging Arthur Banks’ Time Series with COW

Recently I needed to combine data from two of the most widely used (at least in my subfield) cross-national time-series data sets: Arthur Banks’ time series and the Correlates of War Project (COW). Given how often these data sets are used, I was a bit surprised that I could not find a record of someone else combining them. The closest attempt I could find was Andreas Beger’s country names to COW codes do file.

Beger’s file has all of the country names in lower-case, so I used Ruby’s upcase command to change that. That change took care of just over 75 percent of the observations (10,396 of 14,523). Next, I had to deal with the fact that a bunch of the countries in Arthur Banks’ data do not exist any more (they have names like Campuchea, Ceylon, and Ciskei; see here and here). This was done with the main file. After that, the data was all set in Stata as desired.

I am not going to put the full combined data up because the people in control of Arthur Banks’ time series are really possessive. But if you already have both data sets, combining them should be much easier using these scripts.

Getting Started with Prediction

From historians to financial analysts, researchers of all stripes are interested in prediction. Prediction asks the question, “given what I know so far, what do I expect will come next?” In the current political season, presidential election forecasts abound. This dates back to the work of Ray Fair, whose book is ridiculously cheap on Amazon. In today’s post, I will give an example of a much more basic–and hopefully, relatable–question: given the height of a father, how do we predict the height of his son?

To see how common predictions about children’s traits are, just Google “predict child appearance” and you will be treated to a plethora of websites and iPhone apps with photo uploads. Today’s example is more basic and will follow three questions that we should ask ourselves for making any prediction:

1. How different is the predictor from its baseline?
It’s not enough to just have a single bit of information from which to predict–we need to know something about the baseline of the information we are interested in (often the average value) and how different the predictor we are using is. The “Predictor” in this case will refer to the height of the father, which we will call U. The “outcome” in this case will be the height of the son, which we will call V.

To keep this example simple let us assume that U and V are normally distributed–in other words their distributions look like the familiar “bell curve” when they are plotted. To see how different our given observations of U or V are from their baseline, we “standardize” them into X and Y

X = {{u - \mu_u} \over \sigma_u }

Y = {{v - \mu_v} \over \sigma_v },

where \mu is the mean and \sigma is the standard deviation. In our example, let \mu_u = 69, \mu_v=70, and \sigma_v = \sigma_u = 2.

2. How much variance in the outcome does the predictor explain?
In a simple one-predictor, one-outcome (“bivariate”) example like this, we can answer question #2 by knowing the correlation between  X and Y, which we will call \rho (and which is equal to the correlation between U and V in this case). For simplicity’s sake let’s assume \rho={1 \over 2}. In real life we would probably estimate \rho using regression, which is really just the reverse of predicting. We should also keep in mind that correlation is only useful for describing the linear relationship between X and Y, but that’s not something to worry about in this example. Using \rho, we can set up the following prediction model for Y:

Y= \rho X + \sqrt{1-\rho^2} Z.

Plugging in the values above we get:

Y= {1 \over 2} X + \sqrt{3 \over 4} Z.

Z is explained in the next paragraph.

3. What margin of error will we accept? No matter what we are predicting, we have to accept that our estimates are imperfect. We hope that on average we are correct, but that just means that all of our over- and under-estimates cancel out. In the above equation, Z represents our errors. For our prediction to be unbiased there has to be zero correlation between X and Z. You might think that is unrealistic and you are probably right, even for our simple example. In fact, you can build a decent good career by pestering other researchers with this question every chance you get. But just go with me for now. The level of incorrect prediction that we are able to accept affects the “confidence interval.” We will ignore confidence intervals in this post, focusing instead on point estimates but recognizing that our predictions are unlikely to be exactly correct.

The Prediction

Now that we have set up our prediction model and nailed down all of our assumptions, we are ready to make a prediction. Let’s predict the height of the son of a man who is 72″ tall. In probability notation, we want

\mathbb{E}(V|U=72),

which is the expected son’s height given a father with a height of 72”.

Following the steps above we first need to know how different 72″ is from the average height of fathers.  Looking at the standardizations above, we get

X = {U-69 \over 2}, and

Y = {V - 70 \over 2}, so

\mathbb{E}(V|U=72) = \mathbb{E}(2Y+70|X=1.5) = \mathbb{E}(2({1 \over 2}X + \sqrt{3 \over 4}Z)+70|X=1.5),

which reduces to 1.5 + \mathbb{E}(Z|X=1.5) + 70. As long as we were correct earlier about Z not depending on X and having an average of zero, then we get a predicted son’s height of 71.5 inches, or slightly shorter than his dad, but still above average.

This phenomenon of the outcome (son’s height) being closer to the average than the predictor (father’s height) is known as regression to the mean and it is the source of the term “regression” that is used widely today in statistical analysis. This dates back to one of the earliest large-scale statistical studies by Sir Francis Galton in 1886, entitled, “Regression towards Mediocrity in Hereditary Stature,” (pdf) which fits perfectly with today’s example.

Further reading: If you are already comfortable with the basics of prediction, and know a bit of Ruby or Python, check out Prior Knowledge.

Wednesday Nerd Fun: Games (and More) in Stata

Stata is a software program for running statistical analysis, as readers who have been to grad school in the social sciences in the last couple of decades will know. Compared to R Stata is like an old TI-83 calculator, but it remains popular with those who spent the best years of their lives typing commands into its green-on-black interface. I recently discovered that Stata shares one important feature with the TI-83 calculator: the ability to play games. (For TI-83 games, see here and here.)

Eric Booth of Texas A&M shares this implementation of Blackjack in Stata:

The game is played by typing -blackjack- into the command window and then the game prompts the user for the amount she wants to bet (default is $500 which replenishes after you lose it all or you exit Stata), and whether to hit or stay.  It doesn’t accurately represent all the rules and scenarios of a real game a blackjack (e.g., no doubling down), so don’t use it to prep for your run at taking down a Vegas casino.

Booth’s blog provides other fun, unconventional uses of Stata as well. There’s a script that lets you Google from the Stata interface, one that lets you control iTunes, and even one for running commands from an iPhone.

This post is probably less “general interest” than most of the nerd fun posts, but I hope you enjoyed it.

Ethical Programming

xkcd on good code

Don’t let anyone tell you that as a programmer you don’t have to make moral or ethical decisions. Every time you decide that making users feel stupid is better than fixing your code, you’re making an ethical decision. – Joel Spolsky

This may belong in a separate post, but while we’re on the subject of programming, I am trying to come up with a project that will help researchers/bloggers/writers/academics save time/money/frustration. The requirements are that it be doable within the next month or so (at least a minimally viable version) while I work on other projects, and that it have a substantial Python component with maybe some R, SQL, or Django thrown in. Requests or suggestions in the comments or by email are welcome.

[quote via @newsyc20]

PyCon 2012 Video Round-Up

The videos from PyCon 2012 are posted. Here are the ones I plan to watch, along with their summaries:

Checking Mathematical Proofs Written in TeX

ProofCheck is a set of Python scripts which parse and check mathematics written using TeX. Its homepage is http://www.proofcheck.org. Unlike computer proof assistants which require immersion in the equivalent of a programming language, ProofCheck attempts to handle mathematical language formalized according to the author’s preferences as much as possible.

Sketching a Better Product

If writing is a means for organizing your thoughts, then sketching is a means for organizing your thoughts visually. Just as good writing requires drafts, good design requires sketches: low-investment, low-resolution braindumps. Learn how to use ugly sketching to iterate your way to a better product.

Bayesian Statistics Made (as) Simple (as Possible)

This tutorial is an introduction to Bayesian statistics using Python. My goal is to help participants understand the concepts and solve real problems. We will use material from my (nb: Allen Downey’s) book, Think Stats: Probability and Statistics for Programmers (O’Reilly Media).

SQL for Python Developers

Relational databases are often the bread-and-butter of large-scale data storage, yet they are often poorly understood by Python programmers. Organizations even split programmers into SQL and front-end teams, each of which jealously guards its turf. These tutorials will take what you already know about Python programming, and advance into a new realm: SQL programming and database design.

Web scraping: Reliably and efficiently pull data from pages that don’t expect it

Exciting information is trapped in web pages and behind HTML forms. In this tutorial, you’ll learn how to parse those pages and when to apply advanced techniques that make scraping faster and more stable. We’ll cover parallel downloading with Twisted, gevent, and others; analyzing sites behind SSL; driving JavaScript-y sites with Selenium; and evading common anti-scraping techniques.

Some of it may be above my head at this stage, but I think it’s great that the Python community makes all of these resources available.

Meta-Blogging, Pt. 2: Weekly Trend in Tweets, Likes, and Comments

This post begins to describe the blog data collected (separately) by Anton Strezhnev and myself. One of the first things I did was to set the date variable in R format so that I could do some exploration.

library(foreign)
# setwd('get your own')
monkey1 <- read.dta('finalMonkeyCageData.dta')

monkey1$newdate <- as.Date(monkey1$date, "%m/%d/%Y")

monkey1$weekdaynum <- format(monkey1$newdate, "%w")

day_abbr_list <- c("Sun","Mon","Tue","Wed","Thu","Fri","Sat")

par(mfrow=c(3,1))

boxplot(monkey1$tweets ~ monkey1$weekdaynum, xaxt='n',xlab='',ylab="Tweets",col='blue')
axis(1,labels=day_abbr_list, at=c(1,2,3,4,5,6,7))

boxplot(monkey1$likes ~ monkey1$weekdaynum, xaxt='n',xlab='',ylab="Likes",col='red')
axis(1,labels=day_abbr_list, at=c(1,2,3,4,5,6,7))

boxplot(monkey1$comments ~ monkey1$weekdaynum,xlab='',xaxt='n',ylab="Comments",col='green')
axis(1,labels=day_abbr_list, at=c(1,2,3,4,5,6,7))

The result was this plot:

Monkey Cage Activity by Weekday

For tweets and likes it looks like earlier in the week (Sunday, Monday) is better, while comments get an additional bump on Saturday and Wednesday. In the next couple of posts we’ll look at how these three activities are correlated with page views, and how comments are distributed on the other blogs I scraped.