# Eliminate File Redundancy with Ruby

Say you have a file with many repeated, unnecessary lines that you want to remove. For safety’s sake, you would rather make an abbreviated copy of the file rather than replace it. Ruby makes this a cinch. You just iterate over the file, putting all lines the computer has already “seen” into a dictionary. If a line is not in the dictionary, it must be new, so write it to the output file. Here’s the code designed with .tex files in mind, but easily adaptable:

```puts 'Filename?'
filename = gets.chomp
input = File.open(filename+'.tex')
output = File.open(filename+'2.tex', 'w')
seen = {}
input.each do |line|
if (seen[line])
else
output.write(line)
seen[line] = true
end
end
input.close()
output.close()```

Where would this come in handy? Well, the .tex extension probably already gave you a clue that I am reducing redundancy in a $\LaTeX$ file. In particular, I have an R plot generated as a tikz graphic. The R plot includes a rug at the bottom (tick marks indicating data observations)–but the data set includes over 9,000 observations, so many of the lines are drawn right on top of each other. The $\LaTeX$ compiler got peeved at having to draw so many lines, so Ruby helped it out by eliminating the redundancy. One special tweak for using the script above to modify tikz graphics files is to change the line

`if (seen[line])`

to

`if (seen[line]) && !(line.include? 'node') &&  !(line.include? 'scope') && !(line.include? 'path') && !(line.include? 'define')`

if your plot has multiple panes (e.g. `par(mfrow=c(1,2))` in R) so that Ruby won’t ignore seemingly redundant lines that are actually specifying new panes. The modified line is a little long and messy, but it works, and that was the main goal here. The resulting $\LaTeX$ file compiles easily and more quickly than it did with all those redundant lines, thanks to Ruby.

# Meta-Blogging, Pt. 2: Weekly Trend in Tweets, Likes, and Comments

This post begins to describe the blog data collected (separately) by Anton Strezhnev and myself. One of the first things I did was to set the date variable in R format so that I could do some exploration.

```library(foreign)

monkey1\$newdate <- as.Date(monkey1\$date, "%m/%d/%Y")

monkey1\$weekdaynum <- format(monkey1\$newdate, "%w")

day_abbr_list <- c("Sun","Mon","Tue","Wed","Thu","Fri","Sat")

par(mfrow=c(3,1))

boxplot(monkey1\$tweets ~ monkey1\$weekdaynum, xaxt='n',xlab='',ylab="Tweets",col='blue')
axis(1,labels=day_abbr_list, at=c(1,2,3,4,5,6,7))

boxplot(monkey1\$likes ~ monkey1\$weekdaynum, xaxt='n',xlab='',ylab="Likes",col='red')
axis(1,labels=day_abbr_list, at=c(1,2,3,4,5,6,7))

axis(1,labels=day_abbr_list, at=c(1,2,3,4,5,6,7))```

Monkey Cage Activity by Weekday

For tweets and likes it looks like earlier in the week (Sunday, Monday) is better, while comments get an additional bump on Saturday and Wednesday. In the next couple of posts we’ll look at how these three activities are correlated with page views, and how comments are distributed on the other blogs I scraped.

# Finding a Series of Confidence Intervals in R

[Note: While many of my posts appeal only to readers with certain interests–specifically, mine–this one is meant to provide a public good in the form of an R script that can be run to find multiple confidence intervals around the same sample value. This and other methodological posts in the future may not appeal to the general reader, so they are posted here under the category “Technical.” Read only the “Uncategorized” posts if you prefer my random miscellany. Comments from readers of all methodological traditions and experience levels are invited.]

Purpose: Find a series of lower and higher bounds for the confidence interval around a sample statistic.

Script:

```###################################################
# Computing a Series of Confidence Levels in R
# Matt Dickenson
# yspr dot wordpress dot com
# Released under a Creative Commons Licence
###################################################

# INSTRUCTIONS
# First, you will need your desired confidence levels, sample statistic, and standard error of the sample statistic
# Compute those, and then run this script
CONFIDENCE <- function(x,y,se){
intervals <- matrix(NA,nrow=(length(x)),ncol=2)
levels <- matrix(rep(x,2),nrow=(length(x)),ncol=2)
rnames <- c()

for (i in 1:(length(x)*2)){
if(i %% 2 != 0){
zi <- qnorm((1-(levels[(floor(i/2)+1),1]))/2)
low <- y+(zi*se)
intervals[floor(i/2)+1,1] <- low
}
else{
zi <- qnorm((1-(levels[(i/2),2]))/2)
high <- y-(zi*se)
intervals[(i/2),2] <- high

}
}
row.names(intervals) <- x
colnames(intervals) <-c("Lower Bound", "Higher Bound")
intervals
}
# Now, run the command as
CONFIDENCE(x,y,se)
# CONDITIONS:
# where x is the vector containing your desired confidence levels (0<x[i]<1 for all i)
# and y is your sample statistic and se is the standard error of your sample statistic
# note that the actual variable names can be anything you want, as long as they are entered in this order```

Notice that in less than twenty actual lines of code we have developed a function that can find any number of normal distribution confidence intervals. It is also flexible, and with just a bit of tweaking you could change this to another distribution, like Student’s t, binomial, or Poisson. Here is an example of one use for this function:

```# EXAMPLE
# Finding the porportion of mortgage holders who had subprime mortgages
ptrue <- mean(subprime\$high.rate, na.rm=T)

# Drawing a simple random sample of the population
set.seed(126)
phat <- mean(subprime.sample, na.rm=T)
phat
# =0.22
# computing standard error of sample mean
sampleSE <- sqrt((phat*(1-phat)/length(subprime.sample)))

# Say we want to find the 50, 95, and 99 percent confidence interval estimates...
conlev <- c(.50,.95,.99)
CONFIDENCE(conlev, phat, sampleSE)
> CONFIDENCE(conlev, phat, sampleSE)
Lower Bound Higher Bound
0.5    0.2023289    0.2376711
0.95   0.1686504    0.2713496
0.99   0.1525152    0.2874848
# What about including the ninety percent confidence intervals?
conlev2 <-c(.50, .90, .95, .99)
CONFIDENCE(conlev2, phat, sampleSE)
> CONFIDENCE(conlev2, phat, sampleSE)
Lower Bound Higher Bound
0.5    0.2023289    0.2376711
0.9    0.1769061    0.2630939
0.95   0.1686504    0.2713496
0.99   0.1525152    0.2874848
# How about an interval for each percentile?
CONFIDENCE(seq(0,1,by=.01))
# output not shown for the sake of space, BUT
# We can graph these:
x <- seq(0,1,by=.01)
y <- CONFIDENCE(x,phat,sampleSE)
library(grDevices)
plot(x,y[,1], type="l", ylim=c(0,0.4), xlim=c(0,1), ylab="Confidence Interval", xlab=expression(alpha))
polygon(c(x[1],x[2:98],x[99],x[99],x[98:2],x[1]), c(y[1,2],y[2:98,2],y[99,2],y[99,1],y[98:2,1],y[1,1]), border="grey",col="grey")
lines(x,y[,1], type="l")
lines(x,y[,2], type="l")```

Here is the plot:

The size of the confidence level (the grey area) increases as we seek more confidence about our estimate. Since we set the seed to 126, your random sample should be exactly the same as mine, which allows for direct comparison of results for the example.

```# How does this compare to the true population proportion?
points(x,y=(rep(ptrue, length(x))),type="l", col="red")```

Here is the new plot:

We can see that we have included the true population proportion by the time our confidence level grows to include 60 percent of the distribution of our sample proportion.

Feedback is welcome in the comments, especially links to scripts that modify this function.