Once I know I won’t need a file again, it’s gone. (Regular back-ups with Time Machine have saved me from my own excessive zeal at least once.) Similar economy applies to runtime: My primary computing device is my laptop, and I’m often too lazy to fire up a cloud instance unless the job would take more than a day.
Working with GDELT data for the last few weeks I’ve had to be a bit less conservative than usual. Habits are hard to break, though, so I found myself looking for a way to
- keep all the data on my hard-drive, and
- read it into memory quickly in R and/or Python.
.zip files you can obtain from the GDELT site accomplish (1) but not (2). A
binary helps with part of (2) but has the downside of being a binary file that I might not be able to open at some indeterminate point in the future–violating (1). And a memory-hogging CSV that also loads slowly is the worst option of all.
So what satisficing solution did I reach? Saving gzipped files (
.gz). Both R and Python can read these files directly (R code shown below; in Python use
or the compression option for
read_csv in pandas). It’s definitely smaller–the 1979 GDELT historical backfile compresses from 115.3MB to 14.3MB (an eighth of its former size). Reading directly into R from a
.gz file has been available since at least version 2.10.
Is it faster? See for yourself:
> system.time(read.csv('1979.csv', sep='\t', header=F, flush=T, as.is=T) ) user system elapsed 48.930 1.126 50.918 > system.time(read.csv('1979.csv.gz', sep='\t', header=F, flush=T, as.is=T)) user system elapsed 23.202 0.849 24.064 > system.time(load('gd1979.rda')) user system elapsed 5.939 0.182 7.577
Compressing and decompressing
.gz files is straightforward too. In the OS X Terminal, just type
gzip filename or
gunzip filename, respectively
Reading the gzipped file takes less than half as long as the unzipped version. It’s still nowhere near as fast as loading the rda binary, but I don’t have to worry about file readability for many years to come given the popularity of *nix operating systems. Consider using
.gz files for easy memory management and quick loading in R and Python.