Data analyses with Python & Jupyter¶

Introduction¶

You can do complex biological data manipulation and analyses using the pandas python package (or by switching kernels, using R!)

We will look at pandas here, which provides R-like functions for data manipulation and analyses. pandas is built on top of NumPy. Most importantly, it offers an R-like DataFrame object: a multidimensional array with explicit row and column names that can contain heterogeneous types of data as well as missing values, which would not be possible using numpy arrays.

pandas also implements a number of powerful data operations for filtering, grouping and reshaping data similar to R or spreadsheet programs.

Installing Pandas¶

pandas requires NumPy. See the Pandas documentation. If you installed Anaconda, you already have Pandas installed. Otherwise, you can sudo apt install it.

Assuming pandas is installed, you can import it and check the version:

import pandas as pd
pd.__version__

'1.2.5'

Also import scipy:

import scipy as sc

Reminder about tabbing and help!¶

As you read through these chapters, don't forget that Jupyter gives you the ability to quickly explore the contents of a package or methods applicable to an an object by using the tab-completion feature. Also documentation of various functions can be accessed using the ? character. For example, to display all the contents of the pandas namespace, you can type

In [1]: pd.<TAB>

And to display Pandas's built-in documentation, you can use this:

In [2]: pd?

Pandas `dataframes`¶

The dataframes is the main data object in pandas.

importing data¶

Dataframes can be created from multiple sources - e.g. CSV files, excel files, and JSON.

baseDataUrl = "https://sciencedata.dk/public/6e3ed434c0fa43df906ce2b6d1ba9fc6/the_multilingual_quantitative_biologist/content/data/";

MyDF = pd.read_csv(baseDataUrl+'testcsv.csv', sep=',')
MyDF

Creating dataframes¶

You can also create dataframes using a python dictionary like syntax:

MyDF = pd.DataFrame({
   'col1': ['Var1', 'Var2', 'Var3', 'Var4'],
   'col2': ['Grass', 'Rabbit', 'Fox', 'Wolf'],
   'col3': [1, 2, sc.nan, 4]
})

MyDF

Examining your data¶

# Displays the top 5 rows. Accepts an optional int parameter - num. of rows to show
MyDF.head()

# Similar to head, but displays the last rows
MyDF.tail()

# The dimensions of the dataframe as a (rows, cols) tuple
MyDF.shape

(4, 3)

# The number of columns. Equal to df.shape[0]
len(MyDF)

4

# An array of the column names
MyDF.columns

Index(['col1', 'col2', 'col3'], dtype='object')

# Columns and their types
MyDF.dtypes

col1     object
col2     object
col3    float64
dtype: object

# Converts the frame to a two-dimensional table
MyDF.values

array([['Var1', 'Grass', 1.0],
       ['Var2', 'Rabbit', 2.0],
       ['Var3', 'Fox', nan],
       ['Var4', 'Wolf', 4.0]], dtype=object)

# Displays descriptive stats for all columns
MyDF.describe()

OK, I am going to stop this brief intro to Jupyter with pandas here! I think you can already see the potential value of Jupyter for data analyses and visualization. As I mentioned above, you can also use R (e.g., using tidyr + ggplot) for this.

	Species	Infraorder	Family	Distribution	Body mass male (Kg)
0	Daubentonia_madagascariensis	Chiromyiformes	Daubentoniidae	Madagascar	2.700
1	Allocebus_trichotis	Lemuriformes	Cheirogaleidae	Madagascar	0.100
2	Avahi_laniger	Lemuriformes	Indridae	America	1.030
3	Avahi_occidentalis	Lemuriformes	Indridae	Madagascar	0.814
4	Avahi_unicolor	Lemuriformes	Indridae	America	0.830
5	Cheirogaleus_adipicaudatus	Lemuriformes	Cheirogaleidae	Madagascar	0.200
6	Cheirogaleus_crossleyi	Lemuriformes	Cheirogaleidae	Madagascar	0.400
7	Cheirogaleus_major	Lemuriformes	Cheirogaleidae	Madagascar	0.450
8	Cheirogaleus_medius	Lemuriformes	Cheirogaleidae	Madagascar	0.217

	col3
count	3.000000
mean	2.333333
std	1.527525
min	1.000000
25%	1.500000
50%	2.000000
75%	3.000000
max	4.000000

	col1	col2	col3
0	Var1	Grass	1.0
1	Var2	Rabbit	2.0
2	Var3	Fox	NaN
3	Var4	Wolf	4.0

	col1	col2	col3
0	Var1	Grass	1.0
1	Var2	Rabbit	2.0
2	Var3	Fox	NaN
3	Var4	Wolf	4.0

	col1	col2	col3
0	Var1	Grass	1.0
1	Var2	Rabbit	2.0
2	Var3	Fox	NaN
3	Var4	Wolf	4.0