## Fast Data Frame Modification in R

One of the frequent data structures I use in R is a data frame. Data frames are similar to matrices except they allow different types of variables in each column. I always rejoice when I can reduce the analysis at hand to a matrix of numbers. R is inherently faster at dealing with matrices than data frames. When I get data that is millions of rows long and over a hundred columns big, R still works, but slows down significantly. Lately, I spend most of my time figuring out the fastest way to transform data and perform calculations on big data frames.

Here is a great solution from the folks at R-studio. I had to post this to get the word out.

Introducing dplyr

'dplyr' is a package that extends the plyr package to data frames and increases the speed tremendously. I plan to post examples using this new package soon.

## Short Circuiting Logical Statements in R

Short circuiting logical statements are the way ‘lazy’ programming languages evaluate logical statements.  To identify whether or not the programming language you are using is lazy or eager, reference the chart on the wiki page: http://en.wikipedia.org/wiki/Short-circuit_evaluation

The lazy (and quicker) way of evaluating logical statements is if any of the logicals joined with an ‘AND’ are false, the whole statement is false.  Conversely, if any of the logicals are joined with an ‘OR’ and any are true, the whole statement is true.  Eager programs will evaluate the whole statement before moving on, and lazy programs will move on as soon as one of those conditions are hit.

Python is, by default, a lazy program and has no options for eager evaluation (although you could explicitly program out eagerness in python).  R has both.  Yup, we should have seen this coming being as R has FIVE different assignment operators (http://stat.ethz.ch/R-manual/R-patched/library/base/html/assignOps.html). The ways to write them are ‘&’ (AND) and ‘|’ (OR) for eager evaluation methods. ‘&&’ (AND) and ‘||’ (OR) for lazy evaluators.

Mandatory python announcement: The logical operators are simply ‘and’ and ‘or’ (written exactly like that) and are by default lazy.  In my opinion, this is a big advantage that python has over R: Simplicity.

Anyways, back to R.  Let’s implement this to show that these operators do indeed work how they should. Remember that testing for equality between a NULL value and a numeric value will return an error in R.

Eager evaluation:

 > if(5!=5 & NULL==5){print("This is true!")}else{print("This is false!")} Error in if (5 != 5 & NULL == 5) { : argument is of length zero 

Lazy evaluation:

 > if(5!=5 && NULL==5){print("This is true!")}else{print("This is false!")} [1] "This is false!" 

This should raise an eyebrow for a few reasons. Besides being really cool, it is also dangerous. If I’m testing a program with lazy evaluation, it will not output an error if it ends up comparing NULL values to numeric values. Hence good procedure is to test programs with eager evaluation and then switch to lazy evaluation after testing.

Now, I promised an application to big data. When working on big data and we need to go though and test criteria on a very large data set, it’s fair to say that lazy evaluation will be quicker than eager evaluation. Let’s see by how much.

 > test.mat = matrix(rnorm(1000000),nrow=1000) > system.time(apply(test.mat,1,function(r) sapply(r,function(c) if(c>0.99 & c<0.999){c}else{0}))) user system elapsed 5.27 0.00 5.38 > system.time(apply(test.mat,1,function(r) sapply(r,function(c) if(c>0.99 && c<0.999){c}else{0}))) user system elapsed 4.15 0.00 4.24 

On one million elements, testing each one by one for two conditions we saved a whole second! Now let’s up the ante.

 > system.time(apply(test.mat,1,function(r) sapply(r,function(c) if(c>0.99 & abs(c**2+3*c-2)<0.999 & sin(tan(abs(pi*c)))>0.5){c}else{0}))) user system elapsed 11.43 0.00 11.50 > system.time(apply(test.mat,1,function(r) sapply(r,function(c) if(c>0.99 && abs(c**2+3*c-2)<0.999 && sin(tan(abs(pi*c)))>0.5){c}else{0}))) user system elapsed 4.90 0.00 4.98 

We saved over half the time! The above example should help point out that the easier to evaluate logical statements should come first in evaluation to speed up the calculation.
As a side note, please do NOT use short circuiting logical statements when selecting a subset of data. See the following example as to why.

 > data = data.frame("a"=c(1,3,5),"b"=c(2,4,6)) > data a b 1 1 2 2 3 4 3 5 6 > data.subset = data[data$a==3 && data$b==4,] > print(data.subset) [1] a b <0 rows> (or 0-length row.names) > data.subset = data[data$a==3 & data$b==4,] > print(data.subset) a b 2 3 4 

You can see that using the short circuiting logical statements reduce our data set down to nothing. Hence, we need to use eager logical statements when sub-setting data.

To summarize, we use lazy evaluation to make programs run quicker with larger data sets and we use eager evaluation to test programs and for sub-setting data.

## Big Data Introduction

Happy Holidays!

For the next few posts, I will be talking about using some tools/tricks on big data. I think that the term 'big data' gets thrown around a bit much these days. The more common definition of big data is any data where it's size presents problems when analyzing it. At work, I have access to a server and Hadoop so the sizes that give me problems are a lot bigger than at home with my workstation. There are a few ways to deal with big data:

1. Subset the data and work with only the data you need.  Surprisingly, many people don't realize the usefulness of this.  Get rid of those extra columns!  Get rid of missing data!  Get rid of outliers!  Many times, reducing data sets properly solves the big-data problem and not much information is lost.
2. Get bigger hardware.  My primary tool, R, is a memory based program.  This means all of it's work is done in memory.  That makes it fast but limited.  Not quite 'excel' limited, but limited none-the-less.  Currently, I usually only work with up to about 4-6GB of data in memory.  Since I do have server access at work, I can deal with up to 60-90GB of data.  Still, the more data, the slower any program will run.  If the analysis is quite complicated, we should move on to other options.
3. Use software that can handle big data.  This means Hadoop (Hive, Pig, Impala), Python, C++/#, etc... . Some also argue that the ff/bigmatrix packages in R work well, and they do to a point.  After trying to get more complicated analyses to work in ff/bigmatrix, I just feel as if I'm tricking R and patchworking/duct taping things together.
4. Be smart.  If you can write programs that are efficient and fast, you will thank yourself later.  Stay away from loops!  Learn your apply functions.  This type of smart-programming can get you quite far.

Next, I will be posting on being smart in R with short-circuiting logical statements.  They are way cool.

## The Simplicity of Python (an example)

Previously, I posted on how to forward fill missing data in columns of data in R and Excel. I just figured out how to do this in Python, and was blown away. Ready?

 data = Series(['A','B','C','D','E','F'], index = [0,2,5,7,12,14]) data.reindex(range(14),method='ffill') 

Done! It's as simple as telling Python, "Hey I want to reindex the 15 items (0-14) with a forward-fill method". And Python responds, "Done".

Me: "OK, Jump!"

Python: "How high?"

Also, I should mention that "method='bfill'" is... you guessed it... a backwards fill method. Enjoy.

ps- I think that the more I post about Python, the shorter my posts will be. (Thanks to the simplicity of Python).

## Intro to Survey Analysis

Lately, more people have approached me with questions on how to analyze survey data. There are very interesting problems associated with survey analysis. In dealing with surveys, I have seen mistakes made in creating them. Here are some general rules to follow when making surveys.

1. For multiple choice answers, make sure that your intended audience always has an option to answer, even if that means incorporating an “other” answer.
2. If you are performing pre and post survey analysis, do not differ in your categorical questions even the slightest.  E.g. if the pre-survey asks:
1. What is your income? A)<20K    B) 20-50K    C) >50K

Then make sure your post survey has the exact same phrasing and answers.  This helps with analyzing the effects before and after an experiment.

3. Be careful with using too many short-answer or write in answers.  Before you know it all of your analysis will be using text mining to answer questions.  Text mining can be very hard and complicated.
4. Keep questions short and simple.  The more you confuse your target audience, the more confusing your results will be.
5. If you are using a rating-scale throughout the survey, be consistent.  Surveys that switch rating scales tend to confuse the audiences.
6. Low response rates lead to bias.  Try to get a large audience (See # 7).
7. I always encourage incentives for surveys.  Just be careful that the incentive doesn’t make your audience biased.  E.g. offering an entry for a tablet drawing could draw a younger, more tech-savvy crowd.  If that doesn’t impact your audience, then go for it.

I will not cover things related to determining audience, supervising data collection, obtaining consent, etc.

When it comes to the actual analysis, the first part is data entry.  This is a time consuming, monotonous, and generally torturous activity.  Depending on the size of the audience and response, there are a few options.

1. Manual labor: If you are an academic researcher, you can probably offer undergraduate credits for data entry.  If not, then you are either on your own, or you can hire someone to help you.
2. Software conversions:  There exists software to convert paper surveys or electronic documents into data.  They can range from free to expensive, but beware; you get what you pay for.  This method will probably involve lots of data janitoring (Yes, go ahead and add the word janitoring to your word/phone dictionary).
3. Website Survey Services:

I highly recommend using one of these online services.  They help you create the survey, do data entry, and even do some simple high level analysis for you (summary statistics mostly).

Now we get to move onto data janitoring.  Some data entry methods are probably more prone to errors than other methods.  Software data entry, in which programs try to determine what people have physically written down are the most prone to errors and will require some heavy data cleanup.  Manual entry is probably the easiest as the data should be clean to start with and in a form that is pretty easy to use.  The survey websites will deliver clean data, but some data janitoring might be needed to get the data in a useable form.

In order to do analysis, most surveys will require dealing with silly human errors.  Spelling errors, grammatical errors, abbreviations, and more will make it hard to do analysis.  If you have write-in/short answer questions you have two options to overcome this problem.  The first is overcoming it with a large enough sample size, such that one person writing “This place rox” will not have an impact on the many others who write “This place rocks”.  The second option, given a smaller data set, is to run spell check.  Be careful with this option because spell check can be overactive.  For example, if people write proper nouns starting in lowercase, spell check will want to look for English words to replace them.

Finally we have arrived at the data analysis part.  Give yourself a pat on the back, the hard stuff is over.

## The Continuous Breeder's Equation

The Breeder's Equation has always fascinated me. I worked a bit with this equation in graduate school, and now that I'm free to work with whatever interests me at the moment, so I thought I would bring this back up. This equation is as follows:

$R=h^{2}S$

This equation tells us the phenotypic response (R) in a population generation under some selective force (S) which depends also on the heritability (h) of the trait. A naive example is to consider a trait that is fully heritable (h=1), then the response in a generation is exactly equal to the selected trait. Unfortunately, most traits have heritability that fall somewhere between 0 and 1, this is the hard part to figure out. But all we have to do is wait a generation of intense selection and see the proportion of the population that follows the selection and we have our heritability. This is well-played out with controlled populations that breed discretely (e.g. agriculture products like corn, soybeans, cows...). Like it or not, we have been performing genetic modification of species ever since mankind changed from a hunter/gatherer society to an agriculture based society thousands of years ago. A farmer would always keep the best crops and use them for reseeding next year. Doing this, she imposes a selection force on the population towards a desired characteristic she wants. What about other populations that don't breed/reproduce so discretely? You might even suggest that wild, uncontrolled populations tend to breed discretely as there are many wild animal populations that have a yearly breeding season or two. But let's widen our scope. Most of the life on this planet is single celled organisms. Bacteria, archea, single celled eukaryotes, etc... these populations definitely do NOT breed discretely. If you consider a whole population of bacteria, at any moment in time, there are many cells in various stages of reproduction (assuming a growing population). This equation can NOT be applied to such a population.

In mathematics there are a few ways to convert discrete equations to continuous. But we have to unravel the breeder's equation first. Let's consider it as follows.

$(\mu_{o}-\mu)=h^{2}(\mu_{p}-\mu)$

Here, the response, R, equals a difference in population means, $\mu_{o}$ is the mean phenotypic trait of the offspring generation, $\mu_{p}$ is the mean phenotypic trait of the parent generation (that we are selecting for), and $\mu$ the mean of the population as a whole. Again, a quick sanity check (h=1) will give us $\mu_{o}=\mu_{p}$, a purely heritable trait. Now I want to write the equation to include time. We assume that $g$ is the 'generation time' of the population.

$(\mu(t+g)-\mu(t))=h^{2}(\mu_{p}-\mu(t))$

or

$\frac{(\mu(t+g)-\mu(t))}{g}=\frac{h^{2}}{g}(\mu_{p}-\mu(t))$

I think you can see how this is going to work. Now we want the limit as $g\rightarrow 0$ and we'll arrive at a continuous derivative.

$\frac{d\mu}{dt}=\mu^{'}(t)=\frac{h^{2}}{g}(\mu_{p}-\mu(t))$

This non-homogeneous first order differential equation is solvable with the condition that $\mu(0)=\mu_{0}$. I'll save you the fun details that you can find in any differential equations book and skip to the answer.

$\mu(t)=\mu_{p}\left[ 1+ \left( \frac{\mu_{0}}{\mu_{p}} -1 \right) e^{-\frac{h^{2}}{g}t} \right]$

This is a very interesting equation. Let's look at a graph of the situation where the heritability (h) = 0.5, the generation time (g) = 1 unit, the starting phenotypic trait ($\mu_{0}$) is 1 and we are selecting for a phenotype of 2 ($\mu_{p}$).

This graph looks exactly as we would expect the population to behave- over time the population approaches the phenotypic value we select. I hope to find some actual data out there soon that approximates this curve. But I'll save that for later, as I don't feel like combing through evolutionary microbiology articles right now.

## Separating Blocks of Numbers in Columns of Data

More and more I'm using the 'apply' functions in R (apply, mapply, sapply, lappy,...). The help functions on these are hard to decipher. Not till I read a post by Neil Saunders, did I really start using them instead of loops.

Lately, I've been creating more nested-apply functions, i.e., an apply function within an apply function. The latest nested apply function I created does something really neat with specific data. What it does is create a list of indices in a sparse matrix down columns of blocks of numbers. Let's look at a simple example.

Let's say here is our raw data:

What we are looking to create in R is a list of the block indices of non-zero numbers by column. Here is the function with it's output:

 > data = matrix(data=c(2,2,3,0,0,0,0,1,1,0,0,4,3,0,0,2,3,4,0,3,3), + nrow=7,ncol=3,dimnames=list(1:7,c("A","B","C"))) > data A B C 1 2 1 0 2 2 1 2 3 3 0 3 4 0 0 4 5 0 4 0 6 0 3 3 7 0 0 3 

The output we seek is a list containing the indices for each non-zero block. For example we want the output for column A to look like:

A[[1]] = 1 2 3 (because the first three positions in column A are non-zero and compromise one block of numbers.

Column B has two blocks, so the output should be:

B[[1]] = 1 2
B[[2]] = 5 6

Here's the nested apply function to create this list (lapply nested inside an apply on the columns).

 > list.data = apply(data,2,function(x) lapply(unique(cumsum(c(FALSE, diff(which(x>0))!=1))+1),function(y){ + which(x>0)[(cumsum(c(FALSE, diff(which(x>0))!=1))+1)==y] + })) > list.data $A$A[[1]] 1 2 3 1 2 3 $B$B[[1]] 1 2 1 2 $B[[2]] 5 6 5 6$C $C[[1]] 2 3 4 2 3 4$C[[2]] 6 7 6 7 

Let's break it down step by step. The first thing is to find the non-zero indices. (Let's perform this on the second column, B)

 > which(data[,2]>0) 1 2 5 6 1 2 5 6 

Remember R reads down columns. Now let's find where non-consecutive differences occur (to separate the blocks).

 > diff(which(data[,2]>0)) 2 5 6 1 3 1 

Notice, wherever there are non-ones is where the break occurs. But because of how the 'diff' function works, we need to add a position in front. Let's also convert them to logicals.

 > c(FALSE, diff(which(data[,2]>0))!=1) 2 5 6 FALSE FALSE TRUE FALSE 

If we do a cumulative sum of these logicals (remember FALSE = 0, and TRUE = 1) we will have separated the blocks with integers starting at 1.

 > cumsum(c(FALSE, diff(which(data[,2]>0))!=1)) 2 5 6 0 0 1 1 

If we add one and find the unique ones, we get:

 > unique(cumsum(c(FALSE, diff(which(data[,2]>0))!=1))+1) [1] 1 2 

Now let's wrap it up in a 'lapply' function to return a list for column B:

 > lapply(unique(cumsum(c(FALSE, diff(which(data[,2]>0))!=1))+1),function(y){ + which(data[,2]>0)[(cumsum(c(FALSE, diff(which(data[,2]>0))!=1))+1)==y] + }) [[1]] 1 2 1 2 

 

[[2]] 5 6 5 6 

Now we just stick it in an apply function across the columns to find the indices of blocks of non-zero numbers. (Or across rows if you want.)

 > apply(data,2,function(x) lapply(unique(cumsum(c(FALSE, diff(which(x>0))!=1))+1),function(y){ + which(x>0)[(cumsum(c(FALSE, diff(which(x>0))!=1))+1)==y] + })) $A$A[[1]] 1 2 3 1 2 3 $B$B[[1]] 1 2 1 2 $B[[2]] 5 6 5 6$C $C[[1]] 2 3 4 2 3 4$C[[2]] 6 7 6 7 

Sometimes I get so proud of my nested apply functions... sniff... sniff. One day I hope to write a triple nested apply function. I hope someone finds this useful.