Data Analysis with Jupyter Notebooks.

Tutorial 5

Benjamin J. Morgan, University of Bath.

Data analysis and statistics with numpy

numpy contains a lot of powerful functions for performing simple statistical analysis on our data. For example, consider the set of numbers 1 to 50:

import numpy as np

a = np.arange(1,51)

a
In [ ]:
 

To find the minimum and maximum values we can use np.min() and np.max()

np.min(a)
In [ ]:
 
np.max(a)
In [ ]:
 

To find the sum of all these numbers, we can use np.sum()

np.sum(a)
In [ ]:
 

The mean of a set of numbers is defined as

$$ \begin{equation} \frac{\sum_i^N x_i}{N} \end{equation} $$

which we could calculate with

np.sum(a) / len(a)

# len(a) returns the length of the array `a`
In [ ]:
 

or with np.mean()

np.mean( a )
In [ ]:
 

The standard deviation, $\sigma$ quantifies how much the numbers in our set deviate from the mean.

\begin{equation} \sigma = \sqrt{\frac{1}{N}\sum_{i=1}^N(x_i-\mu)^2} \end{equation}

where $\mu$ is the mean.

Again, we could write this out in code:

import math

sigma = math.sqrt( np.sum( ( a - np.mean(a))**2 ) / len(a) )

sigma
In [ ]:
 

Or use the np.std() function

np.std(a)
In [ ]:
 

Linear Regression

Another commonly used data analysis technique is linear regression. This is used to calculate the relationship between two data sets, $X$ and $Y$, assuming that this relationship can be described by a straight line

\begin{equation} y_i = m x_i + c. \end{equation}

For any real data set, the data points are unlikely to all fall exactly on the same line. Linear regression is the process of calculating the line that “best fits” the given data.

Look at the following snippet, and try to work out what the result will be.
np.random.rand(10) creates an array of 10 random numbers between 0 and 1.
import matplotlib.pyplot as plt

x = np.arange(1,11)

offset = 2.0

y = x + ( np.random.rand(10) - 0.5 ) * offset

plt.plot( x, y, 'o' )

plt.show()
In [ ]:
 

You can see this approximately gives the straight line relationship between $y=x$. We can use linear regression to calculate the “best” straight line that describes these data.

There are a number of different ways in Python to calculate a line of best-fit. One of the simplest is to use another module, scipy.stats. As you might suspect from the name, scipy.stats contains an large set of statistical analysis tools. We want linregress():

from scipy.stats import linregress

linregress( x, y )
In [ ]:
 

You can see the output is complicated, but includes a list of values that includes the slope and the intercept. In fact you can treat the output like a list, and use indexing to select a specific result.

linregress( x, y )[0] # use indexing to get the slope
In [ ]:
 

Another option is to collect all five of the output values at once

slope, intercept, rvalue, pvalue, stderr = linregress( x, y )

print( "slope =", slope )

print( "intercept =", intercept )
In [ ]:
 

To plot the best-fit line against the original data, we generate a new data set according to $y=mx+c$, where $m$ and $c$ are set to the slope and intercept, calculated from linregress.

y_fit = slope * x + intercept 

plt.plot( x, y, 'o' )

plt.plot( x, y_fit, '-' )

plt.xlabel( 'x' )

plt.ylabel( 'y' )

plt.title( 'y =', slope, 'x +', intercept )

plt.show()
In [ ]: