In [1]:
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np


plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (15, 3)
plt.rcParams['font.family'] = 'sans-serif'

Summary

By the end of this chapter, we're going to have downloaded all of Canada's weather data for 2012, and saved it to a CSV.

We'll do this by downloading it one month at a time, and then combining all the months together.

Here's the temperature every hour for 2012!

In [2]:
data_url = "https://sciencedata.dk/public/6e3ed434c0fa43df906ce2b6d1ba9fc6/pandas-cookbook/data/weather_2012.csv"
In [3]:
weather_2012_final = pd.read_csv(data_url, index_col='Date/Time')

weather_2012_final['Temp (C)'].plot(figsize=(15, 6))
Out[3]:
<AxesSubplot:xlabel='Date/Time'>

5.1 Downloading one month of weather data

When playing with the cycling data, I wanted temperature and precipitation data to find out if people like biking when it's raining. So I went to the site for Canadian historical weather data, and figured out how to get it automatically.

Here we're going to get the data for March 2012, and clean it up

Here's an URL template you can use to get data in Montreal.

In [4]:
url_template = "https://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID=5415&Year={year}&Month={month}&Day=1&time=&timeframe=2&submit=Download+Data"

To get the data for March 2012, we need to format it with month=3, year=2012.

In [5]:
url = url_template.format(month=3, year=2012)

weather_mar2012 = pd.read_csv(url, skiprows=0, index_col='Date/Time', parse_dates=True)

This is super great! We can just use the same read_csv function as before, and just give it a URL as a filename. Awesome.

There are 16 rows of metadata at the top of this CSV, but pandas knows CSVs are weird, so there's a skiprows options. We parse the dates again, and set 'Date/Time' to be the index column. Here's the resulting dataframe.

In [6]:
weather_mar2012
Out[6]:
Longitude (x) Latitude (y) Station Name Climate ID Year Month Day Data Quality Max Temp (°C) Max Temp Flag ... Total Snow (cm) Total Snow Flag Total Precip (mm) Total Precip Flag Snow on Grnd (cm) Snow on Grnd Flag Dir of Max Gust (10s deg) Dir of Max Gust Flag Spd of Max Gust (km/h) Spd of Max Gust Flag
Date/Time
2012-01-01 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 2012 1 1 NaN 5.8 NaN ... 0.0 NaN 1.4 NaN 3 NaN NaN NaN NaN NaN
2012-01-02 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 2012 1 2 NaN 4.6 NaN ... 0.0 T 0.0 T 0 T NaN NaN NaN NaN
2012-01-03 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 2012 1 3 NaN -9.7 NaN ... 0.0 T 0.0 T 0 T NaN NaN NaN NaN
2012-01-04 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 2012 1 4 NaN -7.3 NaN ... 1.4 NaN 1.0 NaN 0 T NaN NaN NaN NaN
2012-01-05 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 2012 1 5 NaN -4.1 NaN ... 0.4 NaN 0.4 NaN 1 NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2012-12-27 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 2012 12 27 NaN -3.3 NaN ... 45.6 NaN 46.2 NaN 11 NaN NaN NaN NaN NaN
2012-12-28 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 2012 12 28 NaN -4.2 NaN ... 0.2 NaN 0.4 NaN 40 NaN NaN NaN NaN NaN
2012-12-29 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 2012 12 29 NaN -8.4 NaN ... 4.6 NaN 4.6 NaN 32 NaN NaN NaN NaN NaN
2012-12-30 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 2012 12 30 NaN -9.0 NaN ... 1.2 NaN 1.2 NaN 34 NaN NaN NaN NaN NaN
2012-12-31 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 2012 12 31 NaN 0.8 NaN ... 1.4 NaN 1.4 NaN 30 NaN NaN NaN NaN NaN

366 rows × 30 columns

Let's plot it!

In [7]:
weather_mar2012[u"Mean Temp (°C)"].plot(figsize=(15, 5))
Out[7]:
<AxesSubplot:xlabel='Date/Time'>

Notice how it goes up to 25° C in the middle there? That was a big deal. It was March, and people were wearing shorts outside.

And I was out of town and I missed it. Still sad, humans.

Let's fix up the columns. We're going to just print them out, copy, and fix them up by hand.

In [10]:
weather_mar2012[:5]
Out[10]:
Longitude (x) Latitude (y) Station Name Climate ID Year Month Day Data Quality Max Temp (°C) Max Temp Flag ... Total Snow (cm) Total Snow Flag Total Precip (mm) Total Precip Flag Snow on Grnd (cm) Snow on Grnd Flag Dir of Max Gust (10s deg) Dir of Max Gust Flag Spd of Max Gust (km/h) Spd of Max Gust Flag
Date/Time
2012-01-01 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 2012 1 1 NaN 5.8 NaN ... 0.0 NaN 1.4 NaN 3 NaN NaN NaN NaN NaN
2012-01-02 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 2012 1 2 NaN 4.6 NaN ... 0.0 T 0.0 T 0 T NaN NaN NaN NaN
2012-01-03 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 2012 1 3 NaN -9.7 NaN ... 0.0 T 0.0 T 0 T NaN NaN NaN NaN
2012-01-04 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 2012 1 4 NaN -7.3 NaN ... 1.4 NaN 1.0 NaN 0 T NaN NaN NaN NaN
2012-01-05 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 2012 1 5 NaN -4.1 NaN ... 0.4 NaN 0.4 NaN 1 NaN NaN NaN NaN NaN

5 rows × 30 columns

You'll notice in the summary above that there are a few columns which are are either entirely empty or only have a few values in them. Let's get rid of all of those with dropna.

The argument axis=1 to dropna means "drop columns", not rows", and how='any' means "drop the column if any value is null".

This is much better now -- we only have columns with real data.

In [11]:
weather_mar2012 = weather_mar2012.dropna(axis=1, how='any')

weather_mar2012[:5]
Out[11]:
Longitude (x) Latitude (y) Station Name Climate ID Year Month Day Max Temp (°C) Min Temp (°C) Mean Temp (°C) Heat Deg Days (°C) Cool Deg Days (°C) Total Rain (mm) Total Precip (mm) Snow on Grnd (cm)
Date/Time
2012-01-01 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 2012 1 1 5.8 -1.8 2.0 16.0 0.0 1.4 1.4 3
2012-01-02 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 2012 1 2 4.6 -9.7 -2.6 20.6 0.0 0.0 0.0 0
2012-01-03 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 2012 1 3 -9.7 -17.9 -13.8 31.8 0.0 0.0 0.0 0
2012-01-04 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 2012 1 4 -7.3 -18.8 -13.1 31.1 0.0 0.0 1.0 0
2012-01-05 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 2012 1 5 -4.1 -10.2 -7.2 25.2 0.0 0.0 0.4 1

The Year/Month/Day columns are redundant, though. Let's get rid of those.

The axis=1 argument means "Drop columns", like before. The default for operations like dropna and drop is always to operate on rows.

In [12]:
weather_mar2012 = weather_mar2012.drop(['Year', 'Month', 'Day'], axis=1)

weather_mar2012[:5]
Out[12]:
Longitude (x) Latitude (y) Station Name Climate ID Max Temp (°C) Min Temp (°C) Mean Temp (°C) Heat Deg Days (°C) Cool Deg Days (°C) Total Rain (mm) Total Precip (mm) Snow on Grnd (cm)
Date/Time
2012-01-01 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 5.8 -1.8 2.0 16.0 0.0 1.4 1.4 3
2012-01-02 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 4.6 -9.7 -2.6 20.6 0.0 0.0 0.0 0
2012-01-03 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 -9.7 -17.9 -13.8 31.8 0.0 0.0 0.0 0
2012-01-04 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 -7.3 -18.8 -13.1 31.1 0.0 0.0 1.0 0
2012-01-05 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 -4.1 -10.2 -7.2 25.2 0.0 0.0 0.4 1

Awesome! We now only have the relevant columns, and it's much more manageable.

5.2 Plotting the snow on ground by temperature

This one's just for fun -- we've already done this before, using groupby and aggregate! But let's do it anyway.

In [14]:
snow = weather_mar2012[[u'Snow on Grnd (cm)']].copy()

snow.loc[:,'Mean Temp (°C)'] = weather_mar2012[[u'Mean Temp (°C)']].copy()

snow.groupby('Mean Temp (°C)').aggregate(np.median).plot()
Out[14]:
<AxesSubplot:xlabel='Mean Temp (°C)'>

5.3 Getting the whole year of data

Okay, so what if we want the data for the whole year? Ideally the API would just let us download that, but I couldn't figure out a way to do that.

First, let's put our work from above into a function that gets the weather for a given month.

I noticed that there's an irritating bug where when I ask for January, it gives me data for the previous year, so we'll fix that too. [no, really. You can check =)]

In [24]:
def download_weather_month(year, month):

    url = url_template.format(year=year, month=month)

    weather_data = pd.read_csv(url, skiprows=0, index_col='Date/Time', parse_dates=True)

    weather_data = weather_data.dropna(axis=1)

    weather_data = weather_data.drop(['Year', 'Day', 'Month'], axis=1)

    return weather_data

We can test that this function does the right thing:

In [25]:
download_weather_month(2012, 1)[:5]
Out[25]:
Longitude (x) Latitude (y) Station Name Climate ID Max Temp (°C) Min Temp (°C) Mean Temp (°C) Heat Deg Days (°C) Cool Deg Days (°C) Total Rain (mm) Total Precip (mm) Snow on Grnd (cm)
Date/Time
2012-01-01 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 5.8 -1.8 2.0 16.0 0.0 1.4 1.4 3
2012-01-02 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 4.6 -9.7 -2.6 20.6 0.0 0.0 0.0 0
2012-01-03 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 -9.7 -17.9 -13.8 31.8 0.0 0.0 0.0 0
2012-01-04 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 -7.3 -18.8 -13.1 31.1 0.0 0.0 1.0 0
2012-01-05 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 -4.1 -10.2 -7.2 25.2 0.0 0.0 0.4 1

Now we can get all the months at once. This will take a little while to run.

In [26]:
data_by_month = [download_weather_month(2012, i) for i in range(1, 12)]

Once we have this, it's easy to concatenate all the dataframes together into one big dataframe using pd.concat. And now we have the whole year's data!

In [27]:
weather_2012 = pd.concat(data_by_month)

weather_2012
Out[27]:
Longitude (x) Latitude (y) Station Name Climate ID Max Temp (°C) Min Temp (°C) Mean Temp (°C) Heat Deg Days (°C) Cool Deg Days (°C) Total Rain (mm) Total Precip (mm) Snow on Grnd (cm)
Date/Time
2012-01-01 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 5.8 -1.8 2.0 16.0 0.0 1.4 1.4 3
2012-01-02 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 4.6 -9.7 -2.6 20.6 0.0 0.0 0.0 0
2012-01-03 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 -9.7 -17.9 -13.8 31.8 0.0 0.0 0.0 0
2012-01-04 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 -7.3 -18.8 -13.1 31.1 0.0 0.0 1.0 0
2012-01-05 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 -4.1 -10.2 -7.2 25.2 0.0 0.0 0.4 1
... ... ... ... ... ... ... ... ... ... ... ... ...
2012-12-27 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 -3.3 -9.4 -6.4 24.4 0.0 0.2 46.2 11
2012-12-28 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 -4.2 -12.0 -8.1 26.1 0.0 0.0 0.4 40
2012-12-29 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 -8.4 -16.0 -12.2 30.2 0.0 0.0 4.6 32
2012-12-30 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 -9.0 -14.1 -11.6 29.6 0.0 0.0 1.2 34
2012-12-31 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 0.8 -11.8 -5.5 23.5 0.0 0.0 1.4 30

4026 rows × 12 columns

5.4 Saving to a CSV

It's slow and unnecessary to download the data every time, so let's save our dataframe to ScienceData for later use! If we use the unqualified hostname sciencedata, we'll be using the internal network, and ssl will complain about the certificate of not matching the host. We tell ssl to ignore this.

We save to a folder tmp. If you don't already have a folder by that name, just create it.

In [28]:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
In [39]:
import requests
requests.put('https://sciencedata/files/tmp/weather_2012.csv', data=weather_2012.to_csv())
Out[39]:
<Response [204]>

We can also save a copy to the pod we're running on, for faster access, but it'll be gone once we delete the pod.

In [40]:
weather_2012.to_csv('weather_2012.csv')

And we're done!