As I promised, I thought I would show an example using DPLYR. I decided to create my own data set, instead of going with a canned data set in R. I did this to preface my next few posts. After this post, I will be switching to webscraping in R and analysis of beer reviews. I know, I'm excited too.
The data I decided to use is a data set that I recently got by webscraping a very popular beer review site that publishes a highly respected top-250 beer list. It's not a terribly large data set, but it will illustrate the usefulness of DPLYR and answer a question I've always had about beer. That question we will answer is: "Does the abv of a beer influence the rating of a beer?"
I will make the beer webscraper available soon on my github, but I want to optimize the code first. But since I got it working, I downloaded a top 250-beer list and all the reviews/dates/ratings for those beers. Let's take a look at the data.
So we are dealing with about 70,000 observations of ratings/reviews for the top 250 beers. And yes, the first reviews there are for the famous Dogfish Head 90 Minute IPA. Yum. I'm getting thirsty already.
Back to 'dplyr'. dplyr introduces something that should be standard in R: converting data frames to a non-printable data frame. One of my pet peeves is accidentally asking R to print out a LARGE data frame, and eating up memory, CPU, and time. Check out the fix.
Nice. No more little nagging care about accidentally printing out data frames. AND you can do all the same operations on the new 'tbl_df' object as the regular data frame object.
I'm going to compare base R with dplyr in ordering data, selecting subsets, and mutating data by comparing system times for each. Let's look at ordering.
This is actually quite the speed up. dplyr's arrange function is about THREE TIMES FASTER than the base function order. AND this isn't that large of a data set. Zoinks, Batman! Let's do this with selecting subsets:
Ok, so my system is obviously rounding down on the first time, so we can safely assume that the first time is less than 0.005 seconds. You might not notice much of a time difference here, but increase the size 1000-fold, and the difference between 80 seconds and 5 seconds is huge.
Let's say that I'm interested in the average rating of a beer PER abv. That means that I have to tell R that I want to take the rating value and divide it by the abv value for each row. (I realize there's a quicker pre-processing method, but for now, let's stick to this example for illustrative purposes). Comparing the times:
Ok, so R-base is very close here, I can guess that we are about halving the time.
Let's use some more dplyr functions to see how the ratings of each beer relates to the ABV. The first thing we are going to do is use the group_by function to group our data frame by beer name. Then we will create a summarized data frame of interest that contains the number of reviews, average rating, and the abv.
Now let's load ggplot2 and make a nice plot with a smooth lowess line to illustrate a relationship.
This will create the nice looking graph:
Yay! That was easy enough. What does this graph say? Maybe we can infer that from 4% abv to 12% abv, there's evidence for a very slight overall upward trend, but mostly no observable trend. (Notice that the sample size for beers > 12% is VERY small).