# The usual preamble
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
# Make the graphs a bit prettier, and bigger
plt.style.use('ggplot')
# This is necessary to show lots of columns in pandas 0.12.
# Not necessary in pandas 0.13.
pd.set_option('display.width', 5000)
pd.set_option('display.max_columns', 60)
plt.rcParams['figure.figsize'] = (15, 5)
We're going to use a new dataset here, to demonstrate how to deal with larger datasets. This is a subset of the of 311 service requests from NYC Open Data.
data_url = "https://sciencedata.dk/public/6e3ed434c0fa43df906ce2b6d1ba9fc6/pandas-cookbook/data/311-service-requests.csv"
# because of mixed types we specify dtype to prevent any errors
complaints = pd.read_csv(data_url, dtype='unicode')
Depending on your pandas version, you might see an error like "DtypeWarning: Columns (8) have mixed types". This means that it's encountered a problem reading in our data. In this case it almost certainly means that it has columns where some of the entries are strings and some are integers.
For now we're going to ignore it and hope we don't run into a problem, but in the long run we'd need to investigate this warning.
When you print a large dataframe, it will only show you the first few rows.
If you don't see this, don't panic! The default behavior for large dataframes changed between pandas 0.12 and 0.13. Previous to 0.13 it would show you a summary of the dataframe. This includes all the columns, and how many non-null values there are in each column.
complaints
To select a column, we index with the name of the column, like this:
complaints['Complaint Type']
To get the first 5 rows of a dataframe, we can use a slice: df[:5]
.
This is a great way to get a sense for what kind of information is in the dataframe -- take a minute to look at the contents and get a feel for this dataset.
complaints[:5]
We can combine these to get the first 5 rows of a column:
complaints['Complaint Type'][:5]
and it doesn't matter which direction we do it in:
complaints[:5]['Complaint Type']
What if we just want to know the complaint type and the borough, but not the rest of the information? Pandas makes it really easy to select a subset of the columns: just index with list of columns you want.
complaints[['Complaint Type', 'Borough']]
That showed us a summary, and then we can look at the first 10 rows:
complaints[['Complaint Type', 'Borough']][:10]
This is a really easy question to answer! There's a .value_counts()
method that we can use:
complaints['Complaint Type'].value_counts()
If we just wanted the top 10 most common complaints, we can do this:
complaint_counts = complaints['Complaint Type'].value_counts()
complaint_counts[:10]
But it gets better! We can plot them!
complaint_counts[:10].plot(kind='bar')