Data analyses with Python & Jupyter

Introduction

You can do complex biological data manipulation and analyses using the pandas python package (or by switching kernels, using R!)

We will look at pandas here, which provides R-like functions for data manipulation and analyses. pandas is built on top of NumPy. Most importantly, it offers an R-like DataFrame object: a multidimensional array with explicit row and column names that can contain heterogeneous types of data as well as missing values, which would not be possible using numpy arrays.

pandas also implements a number of powerful data operations for filtering, grouping and reshaping data similar to R or spreadsheet programs.

Installing Pandas

pandas requires NumPy. See the Pandas documentation. If you installed Anaconda, you already have Pandas installed. Otherwise, you can sudo apt install it.

Assuming pandas is installed, you can import it and check the version:

In [1]:
import pandas as pd
pd.__version__
Out[1]:
'1.2.5'

Also import scipy:

In [2]:
import scipy as sc

Reminder about tabbing and help!

As you read through these chapters, don't forget that Jupyter gives you the ability to quickly explore the contents of a package or methods applicable to an an object by using the tab-completion feature. Also documentation of various functions can be accessed using the ? character. For example, to display all the contents of the pandas namespace, you can type

In [1]: pd.<TAB>

And to display Pandas's built-in documentation, you can use this:

In [2]: pd?

Pandas dataframes

The dataframes is the main data object in pandas.

importing data

Dataframes can be created from multiple sources - e.g. CSV files, excel files, and JSON.

In [3]:
baseDataUrl = "https://sciencedata.dk/public/6e3ed434c0fa43df906ce2b6d1ba9fc6/the_multilingual_quantitative_biologist/content/data/";
In [4]:
MyDF = pd.read_csv(baseDataUrl+'testcsv.csv', sep=',')
MyDF
Out[4]:
Species Infraorder Family Distribution Body mass male (Kg)
0 Daubentonia_madagascariensis Chiromyiformes Daubentoniidae Madagascar 2.700
1 Allocebus_trichotis Lemuriformes Cheirogaleidae Madagascar 0.100
2 Avahi_laniger Lemuriformes Indridae America 1.030
3 Avahi_occidentalis Lemuriformes Indridae Madagascar 0.814
4 Avahi_unicolor Lemuriformes Indridae America 0.830
5 Cheirogaleus_adipicaudatus Lemuriformes Cheirogaleidae Madagascar 0.200
6 Cheirogaleus_crossleyi Lemuriformes Cheirogaleidae Madagascar 0.400
7 Cheirogaleus_major Lemuriformes Cheirogaleidae Madagascar 0.450
8 Cheirogaleus_medius Lemuriformes Cheirogaleidae Madagascar 0.217

Creating dataframes

You can also create dataframes using a python dictionary like syntax:

In [5]:
MyDF = pd.DataFrame({
   'col1': ['Var1', 'Var2', 'Var3', 'Var4'],
   'col2': ['Grass', 'Rabbit', 'Fox', 'Wolf'],
   'col3': [1, 2, sc.nan, 4]
})

MyDF
Out[5]:
col1 col2 col3
0 Var1 Grass 1.0
1 Var2 Rabbit 2.0
2 Var3 Fox NaN
3 Var4 Wolf 4.0

Examining your data

In [6]:
# Displays the top 5 rows. Accepts an optional int parameter - num. of rows to show
MyDF.head()
Out[6]:
col1 col2 col3
0 Var1 Grass 1.0
1 Var2 Rabbit 2.0
2 Var3 Fox NaN
3 Var4 Wolf 4.0
In [7]:
# Similar to head, but displays the last rows
MyDF.tail()
Out[7]:
col1 col2 col3
0 Var1 Grass 1.0
1 Var2 Rabbit 2.0
2 Var3 Fox NaN
3 Var4 Wolf 4.0
In [8]:
# The dimensions of the dataframe as a (rows, cols) tuple
MyDF.shape
Out[8]:
(4, 3)
In [9]:
# The number of columns. Equal to df.shape[0]
len(MyDF) 
Out[9]:
4
In [10]:
# An array of the column names
MyDF.columns 
Out[10]:
Index(['col1', 'col2', 'col3'], dtype='object')
In [11]:
# Columns and their types
MyDF.dtypes
Out[11]:
col1     object
col2     object
col3    float64
dtype: object
In [12]:
# Converts the frame to a two-dimensional table
MyDF.values 
Out[12]:
array([['Var1', 'Grass', 1.0],
       ['Var2', 'Rabbit', 2.0],
       ['Var3', 'Fox', nan],
       ['Var4', 'Wolf', 4.0]], dtype=object)
In [13]:
# Displays descriptive stats for all columns
MyDF.describe()
Out[13]:
col3
count 3.000000
mean 2.333333
std 1.527525
min 1.000000
25% 1.500000
50% 2.000000
75% 3.000000
max 4.000000

OK, I am going to stop this brief intro to Jupyter with pandas here! I think you can already see the potential value of Jupyter for data analyses and visualization. As I mentioned above, you can also use R (e.g., using tidyr + ggplot) for this.