In [1]:
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (15, 3)
plt.rcParams['font.family'] = 'sans-serif'

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

We saw earlier that pandas is really good at dealing with dates. It is also amazing with strings! We're going to go back to our weather data from Chapter 5, here.

In [7]:
weather_2012 = pd.read_csv('https://sciencedata/files/tmp/weather_2012.csv', parse_dates=True, encoding='latin1', index_col='Date/Time')

weather_2012[:5]
Out[7]:
Longitude (x) Latitude (y) Station Name Climate ID Max Temp (°C) Min Temp (°C) Mean Temp (°C) Heat Deg Days (°C) Cool Deg Days (°C) Total Rain (mm) Total Precip (mm) Snow on Grnd (cm)
Date/Time
2012-01-01 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 5.8 -1.8 2.0 16.0 0.0 1.4 1.4 3
2012-01-02 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 4.6 -9.7 -2.6 20.6 0.0 0.0 0.0 0
2012-01-03 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 -9.7 -17.9 -13.8 31.8 0.0 0.0 0.0 0
2012-01-04 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 -7.3 -18.8 -13.1 31.1 0.0 0.0 1.0 0
2012-01-05 -73.75 45.47 MONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 7025250 -4.1 -10.2 -7.2 25.2 0.0 0.0 0.4 1

6.1 Column operations

Pandas provides vectorized functions, to make it easy to operate on columns. There are some great strin examples in the documentation. Here we will consider a simple numerical comparison.

In [8]:
snow = weather_2012['Snow on Grnd (cm)']

is_snowing = snow.gt(0)

This gives us a binary vector, which is a bit hard to look at, so we'll plot it.

In [9]:
# Not super useful

is_snowing[:5]
Out[9]:
Date/Time
2012-01-01     True
2012-01-02    False
2012-01-03    False
2012-01-04    False
2012-01-05     True
Name: Snow on Grnd (cm), dtype: bool
In [10]:
# More useful!

is_snowing=is_snowing.astype(float)

is_snowing.plot()
Out[10]:
<AxesSubplot:xlabel='Date/Time'>

6.2 Use resampling to find the snowiest month

If we wanted the median temperature each month, we could use the resample() method like this:

In [11]:
weather_2012['Mean Temp (°C)'].resample('M').apply(np.median).plot(kind='bar')
Out[11]:
<AxesSubplot:xlabel='Date/Time'>

Unsurprisingly, July and August are the warmest.

So we can think of snowiness as being a bunch of 1s and 0s instead of Trues and Falses:

In [12]:
is_snowing.astype(float)[:10]
Out[12]:
Date/Time
2012-01-01    1.0
2012-01-02    0.0
2012-01-03    0.0
2012-01-04    0.0
2012-01-05    1.0
2012-01-06    1.0
2012-01-07    1.0
2012-01-08    1.0
2012-01-09    1.0
2012-01-10    1.0
Name: Snow on Grnd (cm), dtype: float64

and then use resample to find the percentage of time it was snowing each month

In [13]:
is_snowing.astype(float).resample('M').apply(np.mean)
Out[13]:
Date/Time
2012-01-31    0.903226
2012-02-29    0.931034
2012-03-31    0.225806
2012-04-30    0.000000
2012-05-31    0.000000
2012-06-30    0.000000
2012-07-31    0.000000
2012-08-31    0.000000
2012-09-30    0.000000
2012-10-31    0.000000
2012-11-30    0.000000
2012-12-31    0.612903
Freq: M, Name: Snow on Grnd (cm), dtype: float64
In [14]:
is_snowing.astype(float).resample('M').apply(np.mean).plot(kind='bar')
Out[14]:
<AxesSubplot:xlabel='Date/Time'>

So now we know! In 2012, February was the snowiest month.

6.3 Plotting temperature and snowiness stats together

We can also combine these two statistics (temperature, and snowiness) into one dataframe and plot them together:

In [15]:
temperature = weather_2012['Mean Temp (°C)'].resample('M').apply(np.median)

is_snowing = weather_2012['Snow on Grnd (cm)'].gt(0)

snowiness = is_snowing.astype(float).resample('M').apply(np.mean)


# Name the columns

temperature.name = "Temperature"

snowiness.name = "Snowiness"

We'll use concat again to combine the two statistics into a single dataframe.

In [16]:
stats = pd.concat([temperature, snowiness], axis=1)

stats
Out[16]:
Temperature Snowiness
Date/Time
2012-01-31 -7.00 0.903226
2012-02-29 -5.90 0.931034
2012-03-31 2.90 0.225806
2012-04-30 6.20 0.000000
2012-05-31 15.70 0.000000
2012-06-30 19.20 0.000000
2012-07-31 22.20 0.000000
2012-08-31 22.30 0.000000
2012-09-30 15.45 0.000000
2012-10-31 11.00 0.000000
2012-11-30 0.45 0.000000
2012-12-31 -3.30 0.612903
In [17]:
stats.plot(kind='bar')
Out[17]:
<AxesSubplot:xlabel='Date/Time'>

Uh, that didn't work so well because the scale was wrong. We can do better by plotting them on two separate graphs:

In [18]:
stats.plot(kind='bar', subplots=True, figsize=(15, 10))
Out[18]:
array([<AxesSubplot:title={'center':'Temperature'}, xlabel='Date/Time'>,
       <AxesSubplot:title={'center':'Snowiness'}, xlabel='Date/Time'>],
      dtype=object)