# The usual preamble
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Make the graphs a bit prettier, and bigger
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (15, 5)
# This is necessary to show lots of columns in pandas 0.12.
# Not necessary in pandas 0.13.
pd.set_option('display.width', 5000)
pd.set_option('display.max_columns', 60)
Let's continue with our NYC 311 service requests example.
data_url = "https://sciencedata.dk/public/6e3ed434c0fa43df906ce2b6d1ba9fc6/pandas-cookbook/data/311-service-requests.csv"
# because of mixed types we specify dtype to prevent any errors
complaints = pd.read_csv(data_url, dtype='unicode')
I'd like to know which borough has the most noise complaints. First, we'll take a look at the data to see what it looks like:
complaints[:5]
To get the noise complaints, we need to find the rows where the "Complaint Type" column is "Noise - Street/Sidewalk". I'll show you how to do that, and then explain what's going on.
noise_complaints = complaints[complaints['Complaint Type'] == "Noise - Street/Sidewalk"]
noise_complaints[:3]
If you look at noise_complaints, you'll see that this worked, and it only contains complaints with the right complaint type. But how does this work? Let's deconstruct it into two pieces
complaints['Complaint Type'] == "Noise - Street/Sidewalk"
This is a big array of Trues and Falses, one for each row in our dataframe. When we index our dataframe with this array, we get just the rows where our boolean array evaluated to True. It's important to note that for row filtering by a boolean array the length of our dataframe's index must be the same length as the boolean array used for filtering.
You can also combine more than one condition with the & operator like this:
is_noise = complaints['Complaint Type'] == "Noise - Street/Sidewalk"
in_brooklyn = complaints['Borough'] == "BROOKLYN"
complaints[is_noise & in_brooklyn][:5]
Or if we just wanted a few columns:
complaints[is_noise & in_brooklyn][['Complaint Type', 'Borough', 'Created Date', 'Descriptor']][:10]
On the inside, the type of a column is pd.Series
pd.Series([1,2,3])
and pandas Series are internally numpy arrays. If you add .values to the end of any Series, you'll get its internal numpy array
np.array([1,2,3])
pd.Series([1,2,3]).values
So this binary-array-selection business is actually something that works with any numpy array:
arr = np.array([1,2,3])
arr != 2
arr[arr != 2]
is_noise = complaints['Complaint Type'] == "Noise - Street/Sidewalk"
noise_complaints = complaints[is_noise]
noise_complaints['Borough'].value_counts()
It's Manhattan! But what if we wanted to divide by the total number of complaints, to make it make a bit more sense? That would be easy too:
noise_complaint_counts = noise_complaints['Borough'].value_counts()
complaint_counts = complaints['Borough'].value_counts()
noise_complaint_counts / complaint_counts
Oops, why was that zero? That's no good. This is because of integer division in Python 2. Let's fix it, by converting complaint_counts into an array of floats.
noise_complaint_counts / complaint_counts.astype(float)
(noise_complaint_counts / complaint_counts.astype(float)).plot(kind='bar')
So Manhattan really does complain more about noise than the other boroughs! Neat.