Project 2: Holiday weather

by Rob Griffiths and Andres Muniz, 11 September 2015

This is the project notebook for Week 2 of The Open University's Learn to code for Data Analysis course.

There is nothing I like better than taking a holiday. In the winter I like to have a two week break in a country where I can be guaranteed sunny dry days. In the summer I like to have two weeks off relaxing in my garden in London. However I'm often disappointed because I pick a fortnight when the weather is dull and it rains. So in this project I am going to use the historic weather data from the Weather Underground for London to try to predict two good weather weeks to take off as holiday next summer. Of course the weather in the summer of 2016 may be very different to 2014 but it should give me some indication of when would be a good time to take a summer break.

Getting the data

If you haven't already downloaded the dataset for London right-click on the following URL and choose 'Open Link in New Window' (or similar, depending on your browser):

http://www.wunderground.com/history

When the new page opens start typing 'London' in the 'Location' input box and when the pop up menu comes up with the option 'London, United Kingdom' select it and then click on 'Submit'.

When the next page opens click on the 'Custom' tab and selet the time period From: 1 January 2014 to: 31 December 2014 and then click on 'Get History'. The data for that year should then be displayed. Scroll to the end of the data and then right click on the blue link labelled 'Comma Delimited File':

  • if you are using the Safari Browser choose Download Linked File As ...
  • if you are using the Chrome Browser choose Save Link As ...

then, in the File dialogue that appears save the file with its default name of 'CustomHistory' to the folder you created for this course and where this notebook is located. Once the file has been downloaded rename it from 'CustomHistory.html' to 'London_2014.csv'.

Now load the CSV file into a dataframe making sure that any extra spaces are skipped:

In [2]:
from pandas import *
gijon = read_csv('Gijon_2014.csv', skipinitialspace=True)

Cleaning the data

First we need to clean up the data. I'm not going to make use of 'WindDirDegrees' in my analysis, but you might in yours so we'll rename 'WindDirDegrees< br />' to 'WindDirDegrees'.

In [3]:
gijon = gijon.rename(columns={'WindDirDegrees<br />' : 'WindDirDegrees'})

remove the < br /> html line breaks from the values in the 'WindDirDegrees' column.

In [4]:
gijon['WindDirDegrees'] = gijon['WindDirDegrees'].str.rstrip('<br />')

and change the values in the 'WindDirDegrees' column to float64:

In [5]:
gijon['WindDirDegrees'] = gijon['WindDirDegrees'].astype('float64')   

We definitely need to change the values in the 'GMT' column into values of the datetime64 date type. It seems that Gijon saves it's time in CET format?

In [7]:
gijon['GMT'] = to_datetime(gijon['CET'])

We also need to change the index from the default to the datetime64 values in the 'GMT' column so that it is easier to pull out rows between particular dates and display more meaningful graphs:

In [8]:
gijon.index = gijon['GMT']

Finding a summer break

According to meteorologists, summer extends for the whole months of June, July, and August in the northern hemisphere and the whole months of December, January, and February in the southern hemisphere. So as I'm in the northern hemisphere I'm going to create a dataframe that holds just those months using the datetime index, like this:

In [9]:
summer = gijon.ix[datetime(2014,6,1) : datetime(2014,8,31)]

I now look for the days with warm temperatures.

In [13]:
summer[summer['Temperatura mediaC'] >= 25]
Out[13]:
CET Temperatura mÔximaC Temperatura mediaC Temperatura mínimaC Punto de rocíoC MeanDew PointC Min DewpointC Max Humedad Mean Humedad Min Humedad ... Mean VisibilidadKm Min VisibilidadkM Max Velocidad del vientoKm/h Mean Velocidad del vientoKm/h Max Velocidad de rÔfagasKm/h Precipitaciónmm CloudCover Eventos WindDirDegrees GMT
GMT

0 rows Ɨ 24 columns

Summer 2014 was rather cool in London: there are no days with temperatures of 25 Celsius or higher. Best to see a graph of the temperature and look for the warmest period.

So next we tell Jupyter to display any graph created inside this notebook:

In [14]:
%matplotlib inline

Now let's plot the 'Mean TemperatureC' for the summer:

In [16]:
summer['Temperatura mediaC'].plot(grid=True, figsize=(10,5))
Out[16]:
<matplotlib.axes.AxesSubplot at 0xb0b4a64c>

Well looking at the graph the second half of July looks good for mean temperatures over 20 degrees C so let's also put precipitation on the graph too:

In [19]:
summer[['Temperatura mediaC', 'Precipitaciónmm']].plot(grid=True, figsize=(10,5))
Out[19]:
<matplotlib.axes.AxesSubplot at 0xb0a804ac>

The second half of July is still looking good, with just a couple of peaks showing heavy rain. Lets have a closer look by just plotting mean temperature and precipitation for July.

In [20]:
julio = summer.ix[datetime(2014,7,1) : datetime(2014,7,31)]
julio[['Temperatura mediaC', 'Precipitaciónmm']].plot(grid=True, figsize=(10,5))
Out[20]:
<matplotlib.axes.AxesSubplot at 0xb0b6860c>

Yes, second half of July looks pretty good, just two days that have significant rain, the 25th and the 28th and just one day when the mean temperature drops below 20 degrees, also the 28th.

Conclusions

The graphs have shown the volatility of a British summer, but a couple of weeks were found when the weather wasn't too bad in 2014. Of course this is no guarantee that the weather pattern will repeat itself in future years. To make a sensible prediction we would need to analyse the summers for many more years. By the time you have finished this course you should be able to do that.