Importing Weather Data from Wunderground

Submitted by dylan on Mon, 2011-05-09 22:10.

Wunderground ExampleWunderground Example

The Wunderground.com website offers several creative interfaces to current and historic weather information. One of the more interesting features is the URL-based interface to personal weather stations. As far as I can tell, the Wunderground website only returns hourly data for a single day from personal weather stations... I wanted an entire year's worth of data, so it made sense to abstract the process of fetching a single day's worth of data from a named station into an R function. In this way, it is possible to quickly query the Wunderground website for arbitrary chunks of data. A semi-tested function, along with some examples are posted below. Enjoy!

 
Example Usage

# be sure to load the function from below first
# get a single day's worth of (hourly) data
w <- wunder_station_daily('KCAANGEL4', as.Date('2011-05-05'))

# get data for a range of dates
library(plyr)
date.range <- seq.Date(from=as.Date('2009-1-01'), to=as.Date('2011-05-06'), by='1 day')

# pre-allocate list
l <- vector(mode='list', length=length(date.range))

# loop over dates, and fetch data
for(i in seq_along(date.range))
  {
  print(date.range[i])
  l[[i]] <- wunder_station_daily('KCAANGEL4', date.range[i])
  }

# stack elements of list into DF, filling missing columns with NA
d <- ldply(l)

# save to CSV
write.csv(d, file=gzfile('KCAANGEL4.csv.gz'), row.names=FALSE)

 
Function Definitions

wunder_station_daily <- function(station, date)
  {
  base_url <- 'http://www.wunderground.com/weatherstation/WXDailyHistory.asp?'
 
  # parse date
  m <- as.integer(format(date, '%m'))
  d <- as.integer(format(date, '%d'))
  y <- format(date, '%Y')
 
  # compose final url
  final_url <- paste(base_url,
  'ID=', station,
  '&month=', m,
  '&day=', d,
  '&year=', y,
  '&format=1', sep='')
 
  # reading in as raw lines from the web server
  # contains <br> tags on every other line
  u <- url(final_url)
  the_data <- readLines(u)
  close(u)
 
  # only keep records with more than 5 rows of data
  if(length(the_data) > 5 )
        {
        # remove the first and last lines
        the_data <- the_data[-c(1, length(the_data))]
       
        # remove odd numbers starting from 3 --> end
        the_data <- the_data[-seq(3, length(the_data), by=2)]
       
        # extract header and cleanup
        the_header <- the_data[1]
        the_header <- make.names(strsplit(the_header, ',')[[1]])
       
        # convert to CSV, without header
        tC <- textConnection(paste(the_data, collapse='\n'))
        the_data <- read.csv(tC, as.is=TRUE, row.names=NULL, header=FALSE, skip=1)
        close(tC)
       
        # remove the last column, created by trailing comma
        the_data <- the_data[, -ncol(the_data)]
       
        # assign column names
        names(the_data) <- the_header
       
        # convert Time column into properly encoded date time
        the_data$Time <- as.POSIXct(strptime(the_data$Time, format='%Y-%m-%d %H:%M:%S'))
       
        # remove UTC and software type columns
        the_data$DateUTC.br. <- NULL
        the_data$SoftwareType <- NULL
       
        # sort and fix rownames
        the_data <- the_data[order(the_data$Time), ]
        row.names(the_data) <- 1:nrow(the_data)
       
        # done
        return(the_data)
        }
  }

( categories: )

Hi- First of thanks a bunch

Hi-
First of thanks a bunch for this.
I'm having some trouble running this because (I think) outputs on different days are of different lengths (apparently there are some gaps in the data). When I run as is, I get the following error:

Error in list_to_dataframe(res, attr(.data, "split_labels")) :
Results do not have equal lengths

str(l) indicates that some elements in the list are chr[1:73] some are [1:70] and some are [1:74].

I'm somewhat new to R and I've tried googling ways to deal with this and I'm coming up with nothing. There must be a very simple line I'm missing somewhere. Any ideas?

Thanks,
Elizabeth

I extended it a bit.

I extended this a bit, based somewhat on the previous comments.

Another Base URL that also works

For others who are trying to use this code:
Please note that you can get the data for ANY AIRPORT, using its 3-letter code.
Try:

http://www.wunderground.com/history/airport/K<3-letter airport code>/2010/11/18/DailyHistory.html?format=1

So you'd just change the base URL and it should work. However, note that the data is much more sparse, only 1 or 2 data points per hour.

Your code doesn't work to

Your code doesn't work to retrieve wunderground data for KCSANFR012. I'd like to get data for a series of days and then see if it has been cloudy more often at weekends or weekdays as a percentage of the total number of each.

KCSANFR012

Are you sure that the station ID is correct? It appears that there are no data associated with this ID:

http://www.wunderground.com/weatherstation/WXDailyHistory.asp?ID=KCSANFR012&month=5&day=10&year=2011&format=1

Very nice. Maybe

Very nice. Maybe pre-allocating the list with 'l <- list()' is cleaner:

1. it avoids the unnecessary 'mode' and 'length' id's in the final dataframe
2. it works with only one date in date.range (otherwise ldply chokes on combining the list elements)

With item 2 we can write a wrapper like
wunder_station <- function(station='KMYSTATION',start=Sys.Date(),end=start)
with defaults of just today.

And maybe paste 'station', 'start', and 'end' in the outfile name

Thank you for this.

Dave

Updates

Thanks Dave for the suggestions. I realized that I had made a mistake in how the list was being pre-allocated. The following works much better: l <- vector(mode='list', length=length(date.range)). ldply() shouldn't choke on the updated version of the pre-allocated list, unless the station you are querying gives results that are drastically different than the one I tried... I'll give your second example a try as well.

Your revised also works for

Your revised also works for me - because it starts off at the right length. And my suggestion works since it starts off at zero length and builds up as it goes along. The original started at length 2 so there was a problem if I only gave it one day of data.

pre-allocated list

The pre-allocated list will consume less memory and CPU time when working with massive collections of weather data. My previous post used an incorrect initialization. Thanks for the correction.