For the next few posts, I will be talking about using some tools/tricks on big data. I think that the term 'big data' gets thrown around a bit much these days. The more common definition of big data is any data where it's size presents problems when analyzing it. At work, I have access to a server and Hadoop so the sizes that give me problems are a lot bigger than at home with my workstation. There are a few ways to deal with big data:
- Subset the data and work with only the data you need. Surprisingly, many people don't realize the usefulness of this. Get rid of those extra columns! Get rid of missing data! Get rid of outliers! Many times, reducing data sets properly solves the big-data problem and not much information is lost.
- Get bigger hardware. My primary tool, R, is a memory based program. This means all of it's work is done in memory. That makes it fast but limited. Not quite 'excel' limited, but limited none-the-less. Currently, I usually only work with up to about 4-6GB of data in memory. Since I do have server access at work, I can deal with up to 60-90GB of data. Still, the more data, the slower any program will run. If the analysis is quite complicated, we should move on to other options.
- Use software that can handle big data. This means Hadoop (Hive, Pig, Impala), Python, C++/#, etc... . Some also argue that the ff/bigmatrix packages in R work well, and they do to a point. After trying to get more complicated analyses to work in ff/bigmatrix, I just feel as if I'm tricking R and patchworking/duct taping things together.
- Be smart. If you can write programs that are efficient and fast, you will thank yourself later. Stay away from loops! Learn your apply functions. This type of smart-programming can get you quite far.
Next, I will be posting on being smart in R with short-circuiting logical statements. They are way cool.