Webscraping in R: Part 1

It's hard to believe, but webscraping in R can be done really easily. With no fancy packages either. I recently ran into the need to scrape weather information from the web. After writing the program, I realized this specific task was quite easy compared to other webscrapes. So I decided to make it a post here and show the steps. Soon, I'll write a more complicated webscrape, but for now this will get us started.

To get simple weather data, Weather Underground provides an almanac link which has an option to view in CSV form.

Let's say that I'm interested in a specific date for the weather in Las Vegas. The first step is to find the web address. After some simple searching, the csv file is located at:

"http://www.wunderground.com/history/airport/KLAS/2014/3/09/DailyHistory.html?format=1"

It's important to look and understand the link provided. If need be, explore more CSV links from different dates/cities. It turns out that the 'KLAS' part of the link represents the Las Vegas location. It's pretty obvious the '...2014/03/09...' part of the link represents the year/month/day of the date queried.

Using this information, we can write a simple R script to retrieve this information and save it in a data frame.


##----Set working directory----
setwd("C:/Users/Nick/FromData/Rcode")
##----Set Location Code----
loc_code = "KLAS"
##----Set Retrieval Date----
retrieval_date = as.Date("2014-03-09") # Could also be today: 'Sys.Date()'

Now we need to piece together the site address. To do this we'll use the 'paste' R command.


##----Set Site Address----
# Site = "http://www.wunderground.com/history/airport/KLAS/2014/3/10/DailyHistory.html?format=1"
site_prefix = "http://www.wunderground.com/history/airport/"
site_suffix = "/DailyHistory.html?format=1"
weather_link = paste(site_prefix,loc_code,"/",gsub("-","/",retrieval_date),site_suffix,sep="")

Let's go out and get that data! We'll use the R command 'readLines' to take the information from the web.


##----Retrieve Web Info----
weather_info = readLines(weather_link)[-1]

The '[-1]' is included because R sometimes sticks in a blank first line when using 'readLines' on a website. Now we have 24 rows of one string. Each string contains all the information we need (14 metrics) separated by a comma. Let's parse it up and extract the headers.


weather_info = strsplit(weather_info,",")
headers = weather_info[[1]]
weather_info = weather_info[-1]

Now we have a 24 element list that contains the vectors we want. Let's transform it into a data frame and label the columns with the headers we saved.


weather_info = do.call(rbind.data.frame,weather_info)
names(weather_info)= headers

We are essentially done, except the data frame is comprised of all factors. Not ideal. What we really need is to look through it, determine which columns we want to be numeric, and which ones we want to be strings and convert the corresponding columns. We do this by initially converting everything to a string (using lapply and as.character functions). Then we'll specify which ones are numeric, convert them and then do some date/time clean up.


##----Convert Data Frame Columns----
weather_info <- data.frame(lapply(weather_info, as.character), stringsAsFactors=FALSE) numeric_cols = c(2,3,4,5,6,8,9,10,13) weather_info[numeric_cols] = lapply(weather_info[numeric_cols],as.numeric) weather_info[is.na(weather_info)]=0 colnames(weather_info)[14]="Date" weather_info$Date = as.Date(substr(weather_info$Date,1,10))

Done. This R script can be found in a complete file under my Github account.

For completeness, and because I like graphs, let's look at what the hourly temperature was like on March 9th, 2014.


plot(weather_info$TemperatureF,type="l",main="Temperature",ylab="Temp",xlab="Hour")

Temp_LV_March11th

Yes, it seems that the Las Vegas "winter" isn't really a winter at all. Seventy five degrees in early March.

This entry was posted in analysis, data, R, Webscraping and tagged , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *