Random Buffy the Vampire Slayer Episode Generator

Let me first say that I'm a fan of everything Joss Whedon.  So when I was asked last year to create a 'Random Episode Generator' for Buffy, I jumped at the chance.  I also think this is a great way to illustrate the usefulness of Excel.  That's right, I just wrote the terms 'useful' and 'Excel' in the same sentence.  I think Excel is vastly underrated in the nerdy-data community and is very helpful.  Again, certain things it does very well, and certain (many) things is does horribly.

But this project is perfect because of it's simplicity.  Excel can do this in a heartbeat and look halfway decent as well.  Here is a screenshot of the spreadsheet:

BuffyGeneratorScreenShotAll you have to do is hit 'F9' (windows) or 'Command + =' (mac) and it will pick a random episode, and tell you which season, disk and episode number to find it.  I've attached it here.


Posted in excel, Visualization | Tagged | Leave a comment

Uber Taxi Bar Chart Fail

I thought I would share something I found in an advertisement/marketing email from Uber Taxi.  See anything amiss about the bar charts?

Uber_Taxi_FailLet's hope Uber doesn't charge rates like they scale their bar charts.  Could be bad for business.

Posted in analysis, data, Visualization | Tagged , , | Leave a comment

The Hobbies of the Scripps 2014 Spelling Bee Contestants

I downloaded text data from  http://public.spellingbee.com/public/results/2014/round_results  and performed some text mining on the contestant interviews and came up with some interesting results.  The hobby with the highest round average was Volunteering followed by Chess and Movies.  I filtered for hobbies that occurred in at least 10 contestants and stopped at round 13.  The reason that I stopped at round 13 was because there was a tie this year and the top two contestants went many rounds after 13.  If I kept all the rounds in, the top hobby would be playing the Oboe, as both winners played the oboe.

Anyways, here are the results:


As a side note, I'll post the two R scripts to my git soon.   (One script for web scraping, one script for analysis).

Posted in analysis, data, R | Tagged , , , | Leave a comment

On the Seasonality of Hoppy Beers

Hops are what give beer a bitter, tangy, citrusy, earthy, fruity, floral, herbal, woody, piney, and/or a spicy aromas and flavors. Originally they were used as a preservative because of their natural antimicrobial powers, now they are a defining characteristic of beer. Now with the craft beer movement in full swing they are gaining lots of attention. Hoppy beers like IPAs and pale ales seem to be everyone’s favorites. These days, it seems every brewery is judged on its version of an IPA or pale ale. Everyone is falling behind the view that the hoppier a beer is, the better it must be. I’ve always disagreed with this and come to the realization that I only like a few hoppy beers, and I have to be in the mood for them. This leads me to ask the question, are there more people out there like me? Are hops craved seasonally?

To answer this question, I built an R web scraper to download all the top 100 beers and reviews/ratings/dates for each of those beers (code for that scraper will be posted on GitHub soon). Here is the first six rows of a data set with nearly 90,000 entries:

> head(data)
name date rating state
1 90 Minute IPA 2007-04-09 4.33 Illinois
2 90 Minute IPA 2007-04-07 4.80 California
3 90 Minute IPA 2007-04-02 4.85 Pennsylvania
4 90 Minute IPA 2005-07-03 4.47 Oregon
5 Adam From The Wood 2005-09-12 4.53 Ontario (Canada)
6 AleSmith IPA 2013-09-30 4.20 Wisconsin
1 redoing this as my review was unceremoniously...
2 Received in a trade from geexploitation. An excellent...
3 This was really an exceptional beer from start to finish...
4 This IPA pours a beautiful golden color, it leaves...
5 Poured with a medium amount of heat behind it the...
6 The liquid portion is a hazy marmalade orange color...
style abv brewery
1 American Double / Imperial IPA 9.00 Dogfish Head Brewery
2 American Double / Imperial IPA 9.00 Dogfish Head Brewery
3 American Double / Imperial IPA 9.00 Dogfish Head Brewery
4 American Double / Imperial IPA 9.00 Dogfish Head Brewery
5 Old Ale 12.00 Hair of the Dog Brewing Company / Brewery and Tasting Room
6 American IPA 7.25 AleSmith Brewing Company
1 http://beeradvocate.com/beer/profile/10099/2093/
2 http://beeradvocate.com/beer/profile/10099/2093/
3 http://beeradvocate.com/beer/profile/10099/2093/
4 http://beeradvocate.com/beer/profile/10099/2093/
5 http://beeradvocate.com/beer/profile/173/20767/
6 http://beeradvocate.com/beer/profile/396/3916/


Yup, the first four entries are for the world famous Dogfish Head 90 minute IPA, a classic representation of this style. What styles are there and what are we including in ‘Hoppy Beers’? The styles I chose are “American Double”, “Imperial IPA”, “American IPA”, and “American Pale Ale”. There are many styles that are hoppy, and you might ask, “What about an English pale ale or an Extra Special Bitter?”, and the reason they weren’t included is that those styles didn’t appear in the top 100 at the time the data was retrieved.

Looking at the exact point plot of ratings over time is meaningless, so let’s look at a smoothed representation of these beers.


Definitely a trend upwards. This could be the hoppy craft beer culture at work here. What about seasonality? I added an over-smoothed line (in red) to the chart to see the yearly trend. To look at the seasonality within years, let’s subtract the two lines and remove the yearly trend.


Hmmmm, maybe a possible dip every year in the late winter/spring and a peak in the fall? Interesting. Let’s verify seasonality by finding the peak of the Fourier spectrum of this signal. This will tell us the dominant frequencies in the above time series, which turns out to be around f=0.003006012 days. To find the period of this frequency we invert it to get:

T = 332.67 days.


I bet with more data, and controlling for quantity, we could narrow this down to exactly one year. In the future I’m going to do some resampling to figure out the error in the period and hopefully 365.25 will fall within the error bounds. For now, I think I’m going to have a beer.


R code for analysis:

library(TSA) # Time series analysis package
##----Plot Ratings of Hoppy Beers----
hoppy_beerstyles = c("American Double / Imperial IPA","American IPA","American Pale Ale (APA)")
date.seq = seq(from=as.Date("2008-01-01"),to=range(data$date)[2],by=1)
hoppy_ratings_avg = sapply(date.seq,function(x){
mean(data$rating[data$style%in%hoppy_beerstyles & data$date==x],na.rm=TRUE)
date.seq = date.seq[!is.na(hoppy_ratings_avg)]
hoppy_ratings_avg = hoppy_ratings_avg[!is.na(hoppy_ratings_avg)]
hoppy_ratings_smooth = lowess(date.seq,hoppy_ratings_avg,f=0.08)$y
hoppy_over_smooth = lowess(date.seq,hoppy_ratings_avg,f=0.4)$y
grid_lines=seq(as.Date("2008-01-01"),as.Date("2014-01-01"), by="+6 month")
plot(date.seq,hoppy_ratings_smooth,main="Avg.Daily Rating of Hoppy Beers",type="l",
xlab="Date",ylab="Avg. Daily Rating",lwd=2)
abline(v=axis.Date(1, x=grid_lines),col = "lightgray", lty = "dotted", lwd = par("lwd"))
##----remove hoppy trend/find periodicity-----
hoppy_no_trend = hoppy_over_smooth - hoppy_ratings_smooth
plot(date.seq,hoppy_no_trend,main="Avg Daily Differency of Ratings with Yearly Trend",type="l",
xlab="Date",ylab="Avg. Daily Rating Difference",lwd=2)
abline(v=axis.Date(1, x=grid_lines),col = "lightgray", lty = "dotted", lwd = par("lwd"))
# periodogram(hoppy_no_trend)
hoppy_spec = spectrum(hoppy_no_trend, method = "ar")
hoppy_spec_max = hoppy_spec$freq[which.max(hoppy_spec$spec)]

Posted in analysis, data, R, Webscraping | Tagged , , , | 1 Comment

A Small Trick for Big Data in R

The other day I was writing a prediction script in R for a very large data set. I was data prepping and needed to create a logical vector from a numeric vector. I didn't want to spend the time loading everything into our R-server if I didn't have to deal with that inconvenience. I found a very neat memory saving trick that is quite faster than the normal code.

Let's say you want to create a logical vector from a numeric vector. The logical will be 0 if the numeric is zero and 1 if the numeric is not zero. The obvious way would be to do an if-then statement.

> test = sample(0:10,1000000,replace=T)
> system.time(ifelse(test>0,0,1))
user system elapsed
1.31 0.00 1.31

This may not seem bad, but for my case, I had much more data to deal with, and the regular ifelse statement resulted in a memory overflow. After trying all the normal memory tricks (clearing the workspace, garbage collecting, etc...) with no success, I thought of a better plan.

> system.time(ceiling(pmax(0,pmin(test,1))))
user system elapsed
0.19 0.01 0.20

Nice. This is a nested vectorized min in a vectorized max function, wrapped in a ceiling function. This will work for all non-negative values. But we can go quicker. Of course there are many ways to skin the proverbial cat in R, and the fastest I've found so far is:

> system.time(test[test!=0]<-1) user system elapsed 0.13 0.02 0.14

This works with all numeric values, is quicker, and uses the same storage space. The only downside of this type of command is that it overwrites the numeric field. To work around that just duplicate the column or vector of interest beforehand.

Posted in analysis, data, R | Tagged , , | Leave a comment

Webscraping in R: Part 1

It's hard to believe, but webscraping in R can be done really easily. With no fancy packages either. I recently ran into the need to scrape weather information from the web. After writing the program, I realized this specific task was quite easy compared to other webscrapes. So I decided to make it a post here and show the steps. Soon, I'll write a more complicated webscrape, but for now this will get us started.

To get simple weather data, Weather Underground provides an almanac link which has an option to view in CSV form.

Let's say that I'm interested in a specific date for the weather in Las Vegas. The first step is to find the web address. After some simple searching, the csv file is located at:


It's important to look and understand the link provided. If need be, explore more CSV links from different dates/cities. It turns out that the 'KLAS' part of the link represents the Las Vegas location. It's pretty obvious the '...2014/03/09...' part of the link represents the year/month/day of the date queried.

Using this information, we can write a simple R script to retrieve this information and save it in a data frame.

##----Set working directory----
##----Set Location Code----
loc_code = "KLAS"
##----Set Retrieval Date----
retrieval_date = as.Date("2014-03-09") # Could also be today: 'Sys.Date()'

Now we need to piece together the site address. To do this we'll use the 'paste' R command.

##----Set Site Address----
# Site = "http://www.wunderground.com/history/airport/KLAS/2014/3/10/DailyHistory.html?format=1"
site_prefix = "http://www.wunderground.com/history/airport/"
site_suffix = "/DailyHistory.html?format=1"
weather_link = paste(site_prefix,loc_code,"/",gsub("-","/",retrieval_date),site_suffix,sep="")

Let's go out and get that data! We'll use the R command 'readLines' to take the information from the web.

##----Retrieve Web Info----
weather_info = readLines(weather_link)[-1]

The '[-1]' is included because R sometimes sticks in a blank first line when using 'readLines' on a website. Now we have 24 rows of one string. Each string contains all the information we need (14 metrics) separated by a comma. Let's parse it up and extract the headers.

weather_info = strsplit(weather_info,",")
headers = weather_info[[1]]
weather_info = weather_info[-1]

Now we have a 24 element list that contains the vectors we want. Let's transform it into a data frame and label the columns with the headers we saved.

weather_info = do.call(rbind.data.frame,weather_info)
names(weather_info)= headers

We are essentially done, except the data frame is comprised of all factors. Not ideal. What we really need is to look through it, determine which columns we want to be numeric, and which ones we want to be strings and convert the corresponding columns. We do this by initially converting everything to a string (using lapply and as.character functions). Then we'll specify which ones are numeric, convert them and then do some date/time clean up.

##----Convert Data Frame Columns----
weather_info <- data.frame(lapply(weather_info, as.character), stringsAsFactors=FALSE) numeric_cols = c(2,3,4,5,6,8,9,10,13) weather_info[numeric_cols] = lapply(weather_info[numeric_cols],as.numeric) weather_info[is.na(weather_info)]=0 colnames(weather_info)[14]="Date" weather_info$Date = as.Date(substr(weather_info$Date,1,10))

Done. This R script can be found in a complete file under my Github account.

For completeness, and because I like graphs, let's look at what the hourly temperature was like on March 9th, 2014.



Yes, it seems that the Las Vegas "winter" isn't really a winter at all. Seventy five degrees in early March.

Posted in analysis, data, R, Webscraping | Tagged , , | Leave a comment

Using DPLYR in R: A Quicker, Easier Way to Work with Data

As I promised, I thought I would show an example using DPLYR. I decided to create my own data set, instead of going with a canned data set in R. I did this to preface my next few posts. After this post, I will be switching to webscraping in R and analysis of beer reviews. I know, I'm excited too.

The data I decided to use is a data set that I recently got by webscraping a very popular beer review site that publishes a highly respected top-250 beer list. It's not a terribly large data set, but it will illustrate the usefulness of DPLYR and answer a question I've always had about beer. That question we will answer is: "Does the abv of a beer influence the rating of a beer?"

I will make the beer webscraper available soon on my github, but I want to optimize the code first. But since I got it working, I downloaded a top 250-beer list and all the reviews/dates/ratings for those beers. Let's take a look at the data.

> str(data)
'data.frame': 72971 obs. of 9 variables:
$ name : chr "90 Minute IPA" "90 Minute IPA" "90 Minute IPA" "90 Minute IPA" ...
$ date : Date, format: "2007-04-09" "2007-04-07" "2007-04-02" "2005-07-03" ...
$ rating : num 4.33 4.8 4.85 4.47 4.53 4.2 4.51 4.48 4.34 4 ...
$ state : Factor w/ 165 levels "","0","Alabama",..: 22 12 49 48 47 65 32 43 38 13 ...
$ review : chr "redoing this as my review was unceremoniously deleted (BA tip: never, ever complain at all if your avatar, and free speech righ"| __truncated__ "Received in a trade from geexploitation. An excellent beer in my opinion. When pouring, I thought it looked a little lite and p"| __truncated__ "This was really an exceptional beer from start to finish. I can't say enough about it. Absolutely remarkable drinkability and b"| __truncated__ "This IPA pours a beautiful golden color, it leaves a nice sticky white lace that sticks to the side of the glass. The smell is "| __truncated__ ...
$ style : Factor w/ 37 levels "American Amber / Red Ale",..: 4 4 4 4 30 7 7 7 7 7 ...
$ abv : num 9 9 9 9 12 7.25 7.25 7.25 7.25 7.25 ...
$ brewery: chr "Dogfish Head Brewery" "Dogfish Head Brewery" "Dogfish Head Brewery" "Dogfish Head Brewery" ...
$ link : chr "http://beeradvocate.com/beer/profile/10099/2093/" "http://beeradvocate.com/beer/profile/10099/2093/" "http://beeradvocate.com/beer/profile/10099/2093/" "http://beeradvocate.com/beer/profile/10099/2093/" ...

So we are dealing with about 70,000 observations of ratings/reviews for the top 250 beers. And yes, the first reviews there are for the famous Dogfish Head 90 Minute IPA. Yum. I'm getting thirsty already.

Back to 'dplyr'. dplyr introduces something that should be standard in R: converting data frames to a non-printable data frame. One of my pet peeves is accidentally asking R to print out a LARGE data frame, and eating up memory, CPU, and time. Check out the fix.

> data_dplyr = tbl_df(data)
> data_dplyr
Source: local data frame [72,971 x 9]
name date rating state
1 90 Minute IPA 2007-04-09 4.33 Illinois
2 90 Minute IPA 2007-04-07 4.80 California
3 90 Minute IPA 2007-04-02 4.85 Pennsylvania
4 90 Minute IPA 2005-07-03 4.47 Oregon
5 Adam From The Wood 2005-09-12 4.53 Ontario (Canada)
6 AleSmith IPA 2013-09-30 4.20 Wisconsin
7 AleSmith IPA 2013-09-30 4.51 Massachusetts
8 AleSmith IPA 2013-09-29 4.48 New York
9 AleSmith IPA 2013-09-15 4.34 Netherlands
10 AleSmith IPA 2013-09-15 4.00 Colorado
.. ... ... ... ...
Variables not shown: review (fctr), style (fctr), abv (dbl), brewery (fctr), link (chr)

Nice. No more little nagging care about accidentally printing out data frames. AND you can do all the same operations on the new 'tbl_df' object as the regular data frame object.

I'm going to compare base R with dplyr in ordering data, selecting subsets, and mutating data by comparing system times for each. Let's look at ordering.

> system.time(arrange(data_dplyr,name,date,rating))
user system elapsed
0.10 0.00 0.11
> system.time(data[order(data$name,data$date,data$rating),])
user system elapsed
0.33 0.00 0.32

This is actually quite the speed up. dplyr's arrange function is about THREE TIMES FASTER than the base function order. AND this isn't that large of a data set. Zoinks, Batman! Let's do this with selecting subsets:

> system.time(select(data_dplyr,name,brewery,state))
user system elapsed
0 0 0
> system.time(data[c("name","brewery","state"),])
user system elapsed
0.08 0.00 0.08

Ok, so my system is obviously rounding down on the first time, so we can safely assume that the first time is less than 0.005 seconds. You might not notice much of a time difference here, but increase the size 1000-fold, and the difference between 80 seconds and 5 seconds is huge.

Let's say that I'm interested in the average rating of a beer PER abv. That means that I have to tell R that I want to take the rating value and divide it by the abv value for each row. (I realize there's a quicker pre-processing method, but for now, let's stick to this example for illustrative purposes). Comparing the times:

> system.time(mutate(data_dplyr,rating_per_abv = rating/abv))
user system elapsed
0 0 0
> system.time(data$rating/data$abv)
user system elapsed
0.01 0 0.01

Ok, so R-base is very close here, I can guess that we are about halving the time.

Let's use some more dplyr functions to see how the ratings of each beer relates to the ABV. The first thing we are going to do is use the group_by function to group our data frame by beer name. Then we will create a summarized data frame of interest that contains the number of reviews, average rating, and the abv.

> abv_rating_data = group_by(data_dplyr, name)
> rating_info = summarise(abv_rating_data,
count = n(),
rating_sum = mean(rating,na.rm=TRUE),
abv_sum = mean(abv,na.rm=TRUE))

Now let's load ggplot2 and make a nice plot with a smooth lowess line to illustrate a relationship.

> library(ggplot2)
> rating_info = filter(rating_info,abv_sum<20) > ggplot(rating_info, aes(abv_sum,rating_sum)) + geom_point(aes(size=count), aplha = 0.7) + geom_smooth() + scale_size_area()

This will create the nice looking graph:


Yay! That was easy enough. What does this graph say? Maybe we can infer that from 4% abv to 12% abv, there's evidence for a very slight overall upward trend, but mostly no observable trend. (Notice that the sample size for beers > 12% is VERY small).


Posted in analysis, data, R | Tagged , , , | Leave a comment