You can do complex biological data manipulation and analyses using the pandas
python package (or by switching kernels, using R
!)
We will look at pandas here, which provides R
-like functions for data manipulation and analyses. pandas
is built on top of NumPy. Most importantly, it offers an R-like DataFrame
object: a multidimensional array with explicit row and column names that can contain heterogeneous types of data as well as missing values, which would not be possible using numpy arrays.
pandas
also implements a number of powerful data operations for filtering, grouping and reshaping data similar to R or spreadsheet programs.
pandas
requires NumPy. See the Pandas documentation.
If you installed Anaconda, you already have Pandas installed. Otherwise, you can sudo apt install
it.
Assuming pandas
is installed, you can import it and check the version:
import pandas as pd
pd.__version__
Also import scipy:
import scipy as sc
As you read through these chapters, don't forget that Jupyter gives you the ability to quickly explore the contents of a package or methods applicable to an an object by using the tab-completion feature. Also documentation of various functions can be accessed using the ?
character. For example, to display all the contents of the pandas namespace, you can type
In [1]: pd.<TAB>
And to display Pandas's built-in documentation, you can use this:
In [2]: pd?
baseDataUrl = "https://sciencedata.dk/public/6e3ed434c0fa43df906ce2b6d1ba9fc6/the_multilingual_quantitative_biologist/content/data/";
MyDF = pd.read_csv(baseDataUrl+'testcsv.csv', sep=',')
MyDF
You can also create dataframes using a python dictionary like syntax:
MyDF = pd.DataFrame({
'col1': ['Var1', 'Var2', 'Var3', 'Var4'],
'col2': ['Grass', 'Rabbit', 'Fox', 'Wolf'],
'col3': [1, 2, sc.nan, 4]
})
MyDF
# Displays the top 5 rows. Accepts an optional int parameter - num. of rows to show
MyDF.head()
# Similar to head, but displays the last rows
MyDF.tail()
# The dimensions of the dataframe as a (rows, cols) tuple
MyDF.shape
# The number of columns. Equal to df.shape[0]
len(MyDF)
# An array of the column names
MyDF.columns
# Columns and their types
MyDF.dtypes
# Converts the frame to a two-dimensional table
MyDF.values
# Displays descriptive stats for all columns
MyDF.describe()
OK, I am going to stop this brief intro to Jupyter with pandas here! I think you can already see the potential value of Jupyter for data analyses and visualization. As I mentioned above, you can also use R (e.g., using tidyr
+ ggplot
) for this.