Data Analysis with Jupyter Notebooks.

Tutorial 3

Benjamin J. Morgan, University of Bath.

Data types

So far we have talked about “data” and “results”, but what are the pieces of information that we want to manipulate? Typically numbers (or groups of numbers) or text (or lists of text). Different kinds of data can be used for different things: numbers can be combined in mathematical expressions, text can be printed, searched, or reorganised; numbers can be arranged by magnitude, names can be arranged by alphabetical order. In Python, these differences are represented by different data types.

Numbers: int and float

We will discuss two kinds of numeric types: integers and floating point numbers. Python has other built in numeric data types, including complex numbers, which are useful in specialised cases.

Whole numbers, without decimal points are integers, e.g. 1, 6, 2331.

Numbers with decimal points are floating point numbers or “floats”, e.g. 1.0, 232.141, 1.3e5.

That last example, 1.3e5, uses scientific notation and is shorthand for 130000.0.

Scientific Notation

Very large and very small numbers can be written using scientific notation. For example, instead of 0.0000241, we would normally write 2.41×10-5. In Python this would be written 2.41e-5 or 2.41e-05.

2.41e-5 == 0.0000241
In [ ]:
 

Strings

Strings are any sequence of text. We indicate that a sequence of text is a string, and not a Python command, by enclosing it in single or double quotes. Being able to use either quote type allows strings that themselves contain quotes.

'this is a string using single quotes'
In [ ]:
 
"this is a string using double quotes"
In [ ]:
 
'this string has "nested quotes"'
In [ ]:
 

Lists

Python also contains built-in data types for collections of things. For data analysis we often deal with sets of numbers. These can be collected in lists.

A list is denoted by a series separated by commas, and enclosed in square brackets:

my_list = [ 1, 2, 3, 4 ]

mylist
In [ ]:
 

although lists can contain any set of Python objects, even other lists:

my_other_list = [ 4, 1.5, 'peach' ]

my_other_list
In [ ]:
 
both_lists = [ my_list, my_other_list ]

both_lists
In [ ]:
 

To refer to one element in a list, use the index of that element. Index numbering counts the number of jumps along the sequence, so starts at zero.

# 1st element (zero jumps along the sequence)

print( my_other_list[0] )

# 2nd element (one jump along the sequence)

print( my_other_list[1] ) 

# 3rd element (two jumps along the sequence)

print( my_other_list[2] )
In [ ]:
 

Using an index outside the range of elements in the list will produce an error. For example, my_other_list has three elements, but my_other_list[3] tries to return the 4th element (which does not exist)

print( my_other_list[3] )
In [ ]:
 

You can also refer to a sequence of elements by giving a range as the index:

In [ ]:
# run this cell to create the list `alphabet`

alphabet = [ 'a', 'b', 'c', 'd', 'e', 'f', 'g', 

             'h', 'i', 'j', 'k', 'l', 'm', 'n', 

             'o', 'p', 'q', 'r', 's', 't', 'u',

             'v', 'w', 'x', 'y', 'z' ]
alphabet[3:8]

→ start from three jumps, finish at eight jumps, i.e. elements 4 to 9.

In [ ]:
 

Negative numbers count backwards from the end of the sequence.

alphabet[-8:-3]

→ 9th from the end up to 4th from the end.

In [ ]:
 

And leaving out one of the numbers in the range will include all elements up to the start or end of the sequence.

alphabet[14:]
In [ ]:
 
alphabet[:14]
In [ ]:
 

numpy and arrays

Although lists can be very useful for handling ordered collections of things, for data manipulation we usually deal with ordered lists of only numbers. The flexibility of lists means using them is (relatively) computationally slow. This is not an issue for small data sets, but can be prohibitive for large data sets, with perhaps millions or more entries.

An alternative data type, specifically designed for manipulating (large) numerical data sets is the numpy array. numpy is a module for numerical scientific computing with Python, and is conventionally imported via

import numpy as np

This is similar to the import math we saw above, but uses the as keyword to make numpy more convenient to work with.

import math

math.sqrt(4)
In [ ]:
import math

Having imported numpy (as np) we can store lists of numbers as numpy arrays.

import numpy as np

a = np.array( [ 1, 2, 3, 4 ] )

a
In [ ]:
 

You can think of a 1-dimensional numpy array as a vector, and we can use very compact code to perform vector mathematical operations on the entire array.

a + 1
In [ ]:
 
a**2

Remember that ** is the $power$ operator. This code calculates $a^2$ for every number stored in a.

In [ ]:
 

In both these cases, the mathematical operation (add one; square) is applied to every element in the array, and a new array with all the results is returned.

If the mathematical expression contains two (or more) arrays, then an element-by-element operation is performed:

e.g. vector addition:

b = np.array( [ 5, 6, 7, 8 ] )

a + b
In [ ]:
 
a * b
In [ ]:
 

Let us try to calculate the square root of all the numbers in a:

from math import sqrt

sqrt(a)
In [ ]:
import numpy as np

a = np.array( [ 1, 2, 3, 4 ] )

a

np.sqrt(a)

This gives an error.

Because numpy is not part of the standard Python library, the sqrt function provided by the math module does not know how to treat a numpy array of numbers. To do what we want we can use the sqrt function in numpy instead.

np.sqrt(a)
Edit the previous code cell to use the numpy version of sqrt instead of the standard function from the math module.
In [ ]:
# This cell tests your answers from the three previous code cells.

# You do not need to edit it

assert _[0] == math.sqrt(1)

assert _[1] == math.sqrt(2)

assert _[2] == math.sqrt(3)

assert _[3] == math.sqrt(4)

numpy contains a great many functions for performing mathematical operations on arrays of numbers, which are all listed on the numpy website.

To limit the number of decimal places in our result we can use round():

np.round( np.sqrt(a), 2 ) # round the result to 2 decimal places
In [ ]:
 

Notice that here the first argument is np.sqrt(a), which is itself a numpy function. This is analogous to a function of a function in mathematics: $f(g(x))$. Nesting functions like this helps write compact code without storing the intermediate results. Nesting several functions can make your code confusing to read, however, and your primary goal should be to write clear understandable code.

Generating sequences of numbers

Often, we will want to use numpy arrays to store experimental data. Other times we might just want a list of number, e.g. from 1 to 20. We could write these out to create the array:

one_to_twenty = np.array( [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 ] )

To save typing (and make your code easier to read) numpy contains a function for creating lists of numbers:

n = np.arange(1,21)

n
In [ ]:
 

Notice that arange gives us numbers starting from 1, up to, but not including, 21.

We can generate lists of numbers with different spacings by providing a step-size (which has a default value of 1)

m = np.arange(2,21,2)

m
In [ ]:
 

Another way to generate an evenly spaced list of number is to use linspace().

p = np.linspace(0,10,50)

p
In [ ]:
 

linspace() takes three arguments: the starting number, the end number, and the total number of values in the sequence.

linspace() is particularly useful for generating evenly spaced points that are not integers.