NavigationUser login |
Importing Weather Data from WundergroundSubmitted by dylan on Mon, 2011-05-09 22:10.
The Wunderground.com website offers several creative interfaces to current and historic weather information. One of the more interesting features is the URL-based interface to personal weather stations. As far as I can tell, the Wunderground website only returns hourly data for a single day from personal weather stations... I wanted an entire year's worth of data, so it made sense to abstract the process of fetching a single day's worth of data from a named station into an R function. In this way, it is possible to quickly query the Wunderground website for arbitrary chunks of data. A semi-tested function, along with some examples are posted below. Enjoy! # be sure to load the function from below first # get a single day's worth of (hourly) data w <- wunder_station_daily('KCAANGEL4', as.Date('2011-05-05')) # get data for a range of dates library(plyr) date.range <- seq.Date(from=as.Date('2009-1-01'), to=as.Date('2011-05-06'), by='1 day') # pre-allocate list l <- vector(mode='list', length=length(date.range)) # loop over dates, and fetch data for(i in seq_along(date.range)) { print(date.range[i]) l[[i]] <- wunder_station_daily('KCAANGEL4', date.range[i]) } # stack elements of list into DF, filling missing columns with NA d <- ldply(l) # save to CSV write.csv(d, file=gzfile('KCAANGEL4.csv.gz'), row.names=FALSE) wunder_station_daily <- function(station, date) { base_url <- 'http://www.wunderground.com/weatherstation/WXDailyHistory.asp?' # parse date m <- as.integer(format(date, '%m')) d <- as.integer(format(date, '%d')) y <- format(date, '%Y') # compose final url final_url <- paste(base_url, 'ID=', station, '&month=', m, '&day=', d, '&year=', y, '&format=1', sep='') # reading in as raw lines from the web server # contains <br> tags on every other line u <- url(final_url) the_data <- readLines(u) close(u) # only keep records with more than 5 rows of data if(length(the_data) > 5 ) { # remove the first and last lines the_data <- the_data[-c(1, length(the_data))] # remove odd numbers starting from 3 --> end the_data <- the_data[-seq(3, length(the_data), by=2)] # extract header and cleanup the_header <- the_data[1] the_header <- make.names(strsplit(the_header, ',')[[1]]) # convert to CSV, without header tC <- textConnection(paste(the_data, collapse='\n')) the_data <- read.csv(tC, as.is=TRUE, row.names=NULL, header=FALSE, skip=1) close(tC) # remove the last column, created by trailing comma the_data <- the_data[, -ncol(the_data)] # assign column names names(the_data) <- the_header # convert Time column into properly encoded date time the_data$Time <- as.POSIXct(strptime(the_data$Time, format='%Y-%m-%d %H:%M:%S')) # remove UTC and software type columns the_data$DateUTC.br. <- NULL the_data$SoftwareType <- NULL # sort and fix rownames the_data <- the_data[order(the_data$Time), ] row.names(the_data) <- 1:nrow(the_data) # done return(the_data) } } ( categories: )
|
I extended it a bit.
I extended this a bit, based somewhat on the previous comments.
Another Base URL that also works
For others who are trying to use this code:
Please note that you can get the data for ANY AIRPORT, using its 3-letter code.
Try:
http://www.wunderground.com/history/airport/K<3-letter airport code>/2010/11/18/DailyHistory.html?format=1
So you'd just change the base URL and it should work. However, note that the data is much more sparse, only 1 or 2 data points per hour.
Your code doesn't work to
Your code doesn't work to retrieve wunderground data for KCSANFR012. I'd like to get data for a series of days and then see if it has been cloudy more often at weekends or weekdays as a percentage of the total number of each.
KCSANFR012
Are you sure that the station ID is correct? It appears that there are no data associated with this ID:
http://www.wunderground.com/weatherstation/WXDailyHistory.asp?ID=KCSANFR012&month=5&day=10&year=2011&format=1
Very nice. Maybe
Very nice. Maybe pre-allocating the list with 'l <- list()' is cleaner:
1. it avoids the unnecessary 'mode' and 'length' id's in the final dataframe
2. it works with only one date in date.range (otherwise ldply chokes on combining the list elements)
With item 2 we can write a wrapper like
wunder_station <- function(station='KMYSTATION',start=Sys.Date(),end=start)
with defaults of just today.
And maybe paste 'station', 'start', and 'end' in the outfile name
Thank you for this.
Dave
Updates
Thanks Dave for the suggestions. I realized that I had made a mistake in how the list was being pre-allocated. The following works much better:
l <- vector(mode='list', length=length(date.range)). ldply() shouldn't choke on the updated version of the pre-allocated list, unless the station you are querying gives results that are drastically different than the one I tried... I'll give your second example a try as well.Your revised also works for
Your revised also works for me - because it starts off at the right length. And my suggestion works since it starts off at zero length and builds up as it goes along. The original started at length 2 so there was a problem if I only gave it one day of data.
pre-allocated list
The pre-allocated list will consume less memory and CPU time when working with massive collections of weather data. My previous post used an incorrect initialization. Thanks for the correction.