# The usual preamble
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Make the graphs a bit prettier, and bigger
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (15, 5)
# This is necessary to show lots of columns in pandas 0.12.
# Not necessary in pandas 0.13.
pd.set_option('display.width', 5000)
pd.set_option('display.max_columns', 60)
Let's continue with our NYC 311 service requests example.
data_url = "https://sciencedata.dk/public/6e3ed434c0fa43df906ce2b6d1ba9fc6/pandas-cookbook/data/311-service-requests.csv"
# because of mixed types we specify dtype to prevent any errors
complaints = pd.read_csv(data_url, dtype='unicode')
I'd like to know which borough has the most noise complaints. First, we'll take a look at the data to see what it looks like:
complaints[:5]
To get the noise complaints, we need to find the rows where the "Complaint Type" column is "Noise - Street/Sidewalk". I'll show you how to do that, and then explain what's going on.
noise_complaints = complaints[complaints['Complaint Type'] == "Noise - Street/Sidewalk"]
noise_complaints[:3]
If you look at noise_complaints
, you'll see that this worked, and it only contains complaints with the right complaint type. But how does this work? Let's deconstruct it into two pieces
complaints['Complaint Type'] == "Noise - Street/Sidewalk"
This is a big array of True
s and False
s, one for each row in our dataframe. When we index our dataframe with this array, we get just the rows where our boolean array evaluated to True
. It's important to note that for row filtering by a boolean array the length of our dataframe's index must be the same length as the boolean array used for filtering.
You can also combine more than one condition with the &
operator like this:
is_noise = complaints['Complaint Type'] == "Noise - Street/Sidewalk"
in_brooklyn = complaints['Borough'] == "BROOKLYN"
complaints[is_noise & in_brooklyn][:5]
Or if we just wanted a few columns:
complaints[is_noise & in_brooklyn][['Complaint Type', 'Borough', 'Created Date', 'Descriptor']][:10]
On the inside, the type of a column is pd.Series
pd.Series([1,2,3])
and pandas Series are internally numpy arrays. If you add .values
to the end of any Series
, you'll get its internal numpy array
np.array([1,2,3])
pd.Series([1,2,3]).values
So this binary-array-selection business is actually something that works with any numpy array:
arr = np.array([1,2,3])
arr != 2
arr[arr != 2]
is_noise = complaints['Complaint Type'] == "Noise - Street/Sidewalk"
noise_complaints = complaints[is_noise]
noise_complaints['Borough'].value_counts()
It's Manhattan! But what if we wanted to divide by the total number of complaints, to make it make a bit more sense? That would be easy too:
noise_complaint_counts = noise_complaints['Borough'].value_counts()
complaint_counts = complaints['Borough'].value_counts()
noise_complaint_counts / complaint_counts
Oops, why was that zero? That's no good. This is because of integer division in Python 2. Let's fix it, by converting complaint_counts
into an array of floats.
noise_complaint_counts / complaint_counts.astype(float)
(noise_complaint_counts / complaint_counts.astype(float)).plot(kind='bar')
So Manhattan really does complain more about noise than the other boroughs! Neat.