Managing Memory and Load Times in R and Python

Once I know I won’t need a file again, it’s gone. (Regular back-ups with Time Machine have saved me from my own excessive zeal at least once.) Similar economy applies to runtime: My primary computing device is my laptop, and I’m often too lazy to fire up a cloud instance unless the job would take more than a day.

Working with GDELT data for the last few weeks I’ve had to be a bit less conservative than usual. Habits are hard to break, though, so I found myself looking for a way to

  1. keep all the data on my hard-drive, and
  2. read it into memory quickly in R and/or Python.

The .zip files you can obtain from the GDELT site accomplish (1) but not (2). A .rda
binary helps with part of (2) but has the downside of being a binary file that I might not be able to open at some indeterminate point in the future–violating (1). And a memory-hogging CSV that also loads slowly is the worst option of all.

So what satisficing solution did I reach? Saving gzipped files (.gz). Both R and Python can read these files directly (R code shown below; in Python use
or the compression option for read_csv in pandas). It’s definitely smaller–the 1979 GDELT historical backfile compresses from 115.3MB to 14.3MB (an eighth of its former size). Reading directly into R from a .gz file has been available since at least version 2.10.

Is it faster? See for yourself:

> system.time(read.csv('1979.csv', sep='\t', header=F, flush=T, )
user system elapsed
48.930 1.126 50.918
> system.time(read.csv('1979.csv.gz', sep='\t', header=F, flush=T,
user system elapsed
23.202 0.849 24.064
> system.time(load('gd1979.rda'))
user system elapsed
5.939 0.182 7.577

Compressing and decompressing .gz files is straightforward too. In the OS X Terminal, just type gzip filename or gunzip filename, respectively

Reading the gzipped file takes less than half as long as the unzipped version. It’s still nowhere near as fast as loading the rda binary, but I don’t have to worry about file readability for many years to come given the popularity of *nix operating systems. Consider using .gz files for easy memory management and quick loading in R and Python.

Statistics as Principled Argument

correlationThat’s the title of a book I recently came across by the late Robert P. Abelson. The thesis of the book is that statistics is a tool for organizing an argument. Abelson’s focus is his own discipline of psychology but many of his points apply to social science more broadly.

Throughout the book Abelson accumulates a list of his “laws”:

  1. Chance is lumpy.
  2. Overconfidence abhors uncertainty.
  3. Never flout a convention just once.
  4. Don’t talk Greek if you don’t know the English translation.
  5. If you have nothing to say, don’t say anything.
  6. There is no free hunch.
  7. You can’t see the dust if you don’t move the couch.
  8. Criticism is the mother of methodology.

My main gripe with the book is how much of it hinders on frequentist hypothesis testing. For example, I don’t consider the difference between a p-value of .05 and one of .07 to be a “principled argument.” Abelson does give some attention to Bayesian methods, but a book developing the idea of statistics as rhetoric from a Bayesian point of view would be more coherent.  Perhaps we will see something along these lines from Andrew Gelman’s work on ethical statistics.

Dissertations as Essays Rather than Treatises

The essay format is increasingly popular in economics, according to a new paper by Wendy Stock and John Siegfried in the American Economic Review (gated, ungated). They find that “most of the evidence suggests that essay-style dissertations enhance economists’ early career research productivity.”

Here are some other trends they identify:

  • Economics dissertations in the form of essays rose from 0.3 percent of the total in 1970 to 69 percent in 2010
  • Economists who take an academic position are more likely to have written a dissertation consisting of essays (it would be interesting to see this conditional probability reversed)
  • Students in higher-ranking programs, in the micro-economics subfield, and from outside of the US adopted this strategy earlier than others

I am grateful that my own department permits the multiple-essay format. Although I have not submitted a dissertation prospectus yet I anticipate that I will go this route myself.

[via Organizations and Markets]

JavaScript Politics

r-anarchismIn a recent conversation on Twitter, Christopher Zorn said that Stata is fascism, R is anarchism, and SAS is masochism. While only one of these is plausibly a programming language, it’s an interesting political analogy. We’ve discussed the politics of the Ruby language before.

Today I wanted to share a speaker deck by Angus Croll on the politics of Javascript. He describes periods of anarchy (1995-2004), revolution (2004-2006), and coming of age (2007-2010). We’re currently in “the itch” (2011-2013). There are a number of other political dimensions in the slides as well. Click the image below to see the deck in full.


If anyone knows of a video of the presentation, I’d love to see it. Croll also wrote an entertaining article with Javascript code in the style of famous authors like Hemingway, Dickens, and Shakespeare.

Micro-Institutions Everywhere: Virus Naming

Giant stuffed microbes make the lethal loveable

Giant stuffed microbes make the lethal loveable

The alphabet soup of naming new viruses rivals Pentagonese. AIDS. SARS. MRSA. Where do these names come from? One major source of influence in this area is the International Committee on Taxonomy of Viruses (ICTV).

Their latest innovation is MERS, referring to a new form of coronavirus that was first reported in September, 2012. In the meantime the virus has gone by the various abbreviations hCov-EMC, HCOV, NCoV, and nCoV (the last two referring to a “novel coronavirus”).

Coming up with a good name is tricky. It should be descriptive and memorable, but naming a virus after a geographic area has major downsides:

Historically, many infectious disease agents—or the diseases themselves—have been named after the place where they were first found. But increasingly, scientists and public health officials have shied away from that system to avoid stigmatizing a particular country or city. When a serious new type of pneumonia started spreading from Asia in 2003, officials at WHO coined the term severe acute respiratory syndrome (SARS) to prevent the disease from being named “Chinese flu” or something similar. (As it happened, the name ruffled feathers in Hong Kong anyway, because the city’s official name is Hong Kong SAR, for special administrative region—a fact that WHO had overlooked.)…

The new name is only a recommendation—one which the study group hopes will be adopted widely but which it has no power to enforce, Gorbalenya says. That’s because ICTV has the authority only to classify and name entire virus species

For more, check out this post from Science.

More on Food Truck Regulation

Popular Durham-area food truck Chirba Chirba serves dumplings. Photo via livewell.

Popular Durham-area food truck Chirba Chirba serves dumplings. Photo via livewell.

More on the plight of food truck operators in NYC, from the Times:

There are numerous (and sometimes conflicting) regulations required by the departments of Health, Sanitation, Transportation and Consumer Affairs. These rules are enforced, with varying consistency, by the New York Police Department. As a result, according to City Councilman Dan Garodnick, it’s nearly impossible (even if you fill out the right paperwork) to operate a truck without breaking some law. Trucks can’t sell food if they’re parked in a metered space . . . or if they’re within 200 feet of a school . . . or within 500 feet of a public market . . . and so on.

Enforcement is erratic. Trucks in Chelsea are rarely bothered, Nafziger said. In Midtown South, where I work and can attest to the desperate need for more lunch options, the N.Y.P.D. has a dedicated team of vendor-busting cops. “One month, we get no tickets,” Thomas DeGeest, the founder of Wafels & Dinges, a popular mobile-food businesses that sells waffles and things, told me. “The next month, we get tickets every day.” DeGeest had two trucks and five carts when he decided he couldn’t keep investing in a business that was so vulnerable to overzealous cops or city bureaucracy. Instead, DeGeest reluctantly decided to open a regular old stationary restaurant.

We’ve discussed food truck regulations and the competition between vendors before. There is certainly a place for regulation, but inconsistent and seemingly arbitrary enforcement undermines the goal of clarifying expectations between all parties.

Net Neutrality: Why You Should Care

Image via TheNextWeb

Image via TheNextWeb

What is net neutrality? It’s the idea that Internet service providers (ISPs) should treat all traffic equally, not giving preferential treatment to certain users, types of data, or equipment. With FCC Chairman Julius Genachowski on the way out, nominee Tom Wheeler may not be able to avoid this fight if he succeeds Genachowski.

Here’s the Tim Wu of the New Yorker on the essence of the issue:

An important aspect of the Internet’s original design is that many prices were set at zero—what have been called zero-price rules. The price to join the network is zero. The price that users and sites pay to reach others is zero: a blogger doesn’t need to pay to reach Comcast’s customers. And the price that big Web sites charge broadband operators to carry their content is also zero. It’s a subtle point, but these three zeros are a large part of what makes the Internet what it is. If net neutrality goes away, so does the agreement to freeze prices at zero….

Admittedly, it is hard to know exactly how things would work out if the zero-price rules are abandoned. Cable still has serious market power, and might, on balance, be able to charge more than it gets charged. But if you’re a cable operator, why take that bet when you’re already sitting on giant profit margins? Why risk the best business going? Beyond cable operators, a battle royale over Internet programming and termination fees would ultimately be terrible for consumers; the Internet would start to get both worse and more expensive.

Think of it this way: net neutrality, which sets all these prices at zero, is effectively a grand truce between the big app firms and the infrastructure providers. It eliminates an unnecessary middleman: consumers deal directly with content vendors and app firms. That’s a much healthier market dynamic than one driven by hidden, passed-on costs. If cable TV isn’t a good enough example, consider the dysfunction of the health-care industry, where consumers never see what they are paying for. That’s what the present rule avoids.

YSPR will continue to monitor this issue and provide updates here.

Great Gatsby, Copyright, and the Public Domain

f_scott_fitzgerald_in_carIs the Great Gatsby in the public domain? The book was written in 1925 and Fitzgerald passed away in 1940. Copyright generally expires 70 years after the author’s death, so you could be forgiven for thinking the answer is “yes.”

If you live in Australia, Canada, or another jurisdiction outside the US, you can already get the book through sites like Project Gutenberg Australia. US residents should not click that link–had SOPA been passed, this site could have been censored for even providing the link. In these United States, however, Gatsby is still not in the public domain.

Here’s Duke’s Kevin Smith (who we’ve talked to before) on the convoluted reasoning behind this:

Let’s look for a minute at F. Scott.  Because he died in December of 1940, his unpublished works do enter the public domain in the United States as of 1/1/11.  His published works, however, are another story.  If a Fitzgerald work was published between 1920 and 1922, as This Side of Paradise was, for example, it is in the public domain.  But any works published in 1923 0r later, such as The Great Gatsby, are still protected.  After 1922 (and prior to 1963), a work that was published with copyright notice  and the copyright in which was renewed is given a term of 95 years from publication (the initial 28 year term plus a renewal term, after the Sonny Bono Copyright Term Extension Act, of 67 years).  Thus published works from this time period are protected until at least 2019; — 1923 plus 95 years equals 2018, so works published that year will rise into the public domain on 1/1/2019.  The author’s date of death does not make any difference for these works.

This distinction seems designed to confuse librarians and other users of works.  An archive of Fitzgerald manuscripts, for example, could digitize and make available those items that were never published, or that were published earlier in F. Scott’s career (like Tales of the Jazz Age).  But a manuscript of Gatsby or Tender is the Night is still subject to protection.

The EFF had a nice explainer on this topic recently as well. Copyright restrictions aren’t just tougher in the US, they’re also subject to the whims of Congress. Congressional action can remove books from the public domain even after they’re put there by law, thanks to this Supreme Court decision.

How does this regulation affect the availability of books? Rebecca Rosen of The Atlantic called it the “missing 20th century” based on Paul Heald’s study, “Do Bad Things Happen When Works Fall Into the Public Domain?” Here’s a chart of books available from Amazon by decade of publication:

Amazon pub domain-thumb-615x368-83391

Continuing to extend copyright protection every time Mickey Mouse gets close to being put in the public domain helps Disney, but it does not help the spread of knowledge. Don’t get me started on Hollywood, though–I’m off to see the movie.

Internet Sales Tax FAQ

sales-tax-santaWe’ve got a week of Internet politics-related topics queued up for you this week. Today we’ll take a look at the prospect of an internet sales tax. Later in the week we’ll discuss why The Great Gatsby still isn’t in the public domain, and then take an overview of the net neutrality debate. The FAQ’s below are a summary of this explainer from CNN.

What’s the current state of sales tax law? 

In the US Supreme Court’s last major decision on the issue (Quill Corp. v. North Dakota), it ruled that a retailer must have a physical presence in a state in order to be required to collect sales taxes in that state. Technically you are required to pay a use tax by your state if you order online from another state–just as you would be required to do so when purchasing physical goods outside your home state. But who actually does that? Virtually no one.

How much revenue would an online sales tax bring in?

The National Conference of State Legislatures estimated that states could gain $23 billion from sales taxes on internet commerce.

What’s going to change, and when? 

Last week the Senate voted 69-27 in favor of the so-called Marketplace Fairness Act. It now has to pass the House, where it will likely face more resistance. The Obama administration supports the bill, so if it passes the House it will become law. Even if passed the changes will go into effect no earlier than October 1, 2013. If you have any major online purchases in mind you may want to make them before then–another stimulus of sorts.

Micro-Institutions Everywhere: Gypsy Law

Cartoon gypsy Esmerelda in Disney's "The Hunchback of Notre Dame"

Cartoon gypsy Esmerelda in Disney’s “The Hunchback of Notre Dame”

Forthcoming from Peter Leeson (who previously brought us an analysis of pirate democracy), a new paper on self-governance among Gypsies (via Mike Munger):

Gypsies are nomads. They’re often separated from one another, which precludes direct monitoring. Further, Gypsies’ locations are changing continuously. In the past Gypsies arranged debris on roadsides and configured bits of torn cloth in nearby tree branches to communicate messages to passing fellow Roms (Yoors 1967: 126). Still, “As most of these Roms” were “constantly travelling about, the problem of communication with one another [was] a serious one” (Brown 1929: 158). Nomadism rendered direct monitoring impossible for all but a few and made society-wide communication very expensive for Gypsies. (pp. 12-13)

Gypsies’ inability to rely on government for many of their most important relationships means not only that they must enforce social rules regulating such relationships privately. More fundamentally still, they must create those rules in the first place. Romaniya superstition achieves this by folding worldly crimes—traditional antisocial behaviors, such as theft and contractual breach—into its “spiritual” crimes, such as using the wrong bar of soap to clean one’s head. Thus the “unbending notion of purity (and impurity) which governs most [of Gypsies’] behaviour” described above has two meanings: one “spiritual” and the other very much of this world (Liégeois 1986: 84). (pp. 15-16)