%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (15, 3)
plt.rcParams['font.family'] = 'sans-serif'
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
We saw earlier that pandas is really good at dealing with dates. It is also amazing with strings! We're going to go back to our weather data from Chapter 5, here.
weather_2012 = pd.read_csv('https://sciencedata/files/tmp/weather_2012.csv', parse_dates=True, encoding='latin1', index_col='Date/Time')
weather_2012[:5]
Pandas provides vectorized functions, to make it easy to operate on columns. There are some great strin examples in the documentation. Here we will consider a simple numerical comparison.
snow = weather_2012['Snow on Grnd (cm)']
is_snowing = snow.gt(0)
This gives us a binary vector, which is a bit hard to look at, so we'll plot it.
# Not super useful
is_snowing[:5]
# More useful!
is_snowing=is_snowing.astype(float)
is_snowing.plot()
If we wanted the median temperature each month, we could use the resample()
method like this:
weather_2012['Mean Temp (°C)'].resample('M').apply(np.median).plot(kind='bar')
Unsurprisingly, July and August are the warmest.
So we can think of snowiness as being a bunch of 1s and 0s instead of True
s and False
s:
is_snowing.astype(float)[:10]
and then use resample
to find the percentage of time it was snowing each month
is_snowing.astype(float).resample('M').apply(np.mean)
is_snowing.astype(float).resample('M').apply(np.mean).plot(kind='bar')
So now we know! In 2012, February was the snowiest month.
We can also combine these two statistics (temperature, and snowiness) into one dataframe and plot them together:
temperature = weather_2012['Mean Temp (°C)'].resample('M').apply(np.median)
is_snowing = weather_2012['Snow on Grnd (cm)'].gt(0)
snowiness = is_snowing.astype(float).resample('M').apply(np.mean)
# Name the columns
temperature.name = "Temperature"
snowiness.name = "Snowiness"
We'll use concat
again to combine the two statistics into a single dataframe.
stats = pd.concat([temperature, snowiness], axis=1)
stats
stats.plot(kind='bar')
Uh, that didn't work so well because the scale was wrong. We can do better by plotting them on two separate graphs:
stats.plot(kind='bar', subplots=True, figsize=(15, 10))