A Small Trick for Big Data in R

The other day I was writing a prediction script in R for a very large data set. I was data prepping and needed to create a logical vector from a numeric vector. I didn't want to spend the time loading everything into our R-server if I didn't have to deal with that inconvenience. I found a very neat memory saving trick that is quite faster than the normal code.

Let's say you want to create a logical vector from a numeric vector. The logical will be 0 if the numeric is zero and 1 if the numeric is not zero. The obvious way would be to do an if-then statement.

> test = sample(0:10,1000000,replace=T)
> system.time(ifelse(test>0,0,1))
user system elapsed
1.31 0.00 1.31

This may not seem bad, but for my case, I had much more data to deal with, and the regular ifelse statement resulted in a memory overflow. After trying all the normal memory tricks (clearing the workspace, garbage collecting, etc...) with no success, I thought of a better plan.

> system.time(ceiling(pmax(0,pmin(test,1))))
user system elapsed
0.19 0.01 0.20

Nice. This is a nested vectorized min in a vectorized max function, wrapped in a ceiling function. This will work for all non-negative values. But we can go quicker. Of course there are many ways to skin the proverbial cat in R, and the fastest I've found so far is:

> system.time(test[test!=0]<-1) user system elapsed 0.13 0.02 0.14

This works with all numeric values, is quicker, and uses the same storage space. The only downside of this type of command is that it overwrites the numeric field. To work around that just duplicate the column or vector of interest beforehand.

This entry was posted in analysis, data, R and tagged , , . Bookmark the permalink.

Leave a Reply