numpy
contains a lot of powerful functions for performing simple statistical analysis on our data. For example, consider the set of numbers 1 to 50:
import numpy as np
a = np.arange(1,51)
a
To find the minimum and maximum values we can use np.min()
and np.max()
np.min(a)
np.max(a)
To find the sum of all these numbers, we can use np.sum()
np.sum(a)
The mean of a set of numbers is defined as
$$ \begin{equation} \frac{\sum_i^N x_i}{N} \end{equation} $$which we could calculate with
np.sum(a) / len(a)
# len(a) returns the length of the array `a`
or with np.mean()
np.mean( a )
The standard deviation, $\sigma$ quantifies how much the numbers in our set deviate from the mean.
\begin{equation} \sigma = \sqrt{\frac{1}{N}\sum_{i=1}^N(x_i-\mu)^2} \end{equation}where $\mu$ is the mean.
Again, we could write this out in code:
import math
sigma = math.sqrt( np.sum( ( a - np.mean(a))**2 ) / len(a) )
sigma
Or use the np.std()
function
np.std(a)
Another commonly used data analysis technique is linear regression. This is used to calculate the relationship between two data sets, $X$ and $Y$, assuming that this relationship can be described by a straight line
\begin{equation} y_i = m x_i + c. \end{equation}For any real data set, the data points are unlikely to all fall exactly on the same line. Linear regression is the process of calculating the line that “best fits” the given data.
import matplotlib.pyplot as plt
x = np.arange(1,11)
offset = 2.0
y = x + ( np.random.rand(10) - 0.5 ) * offset
plt.plot( x, y, 'o' )
plt.show()
You can see this approximately gives the straight line relationship between $y=x$. We can use linear regression to calculate the “best” straight line that describes these data.
There are a number of different ways in Python to calculate a line of best-fit. One of the simplest is to use another module, scipy.stats
. As you might suspect from the name, scipy.stats
contains an large set of statistical analysis tools. We want linregress()
:
from scipy.stats import linregress
linregress( x, y )
You can see the output is complicated, but includes a list of values that includes the slope and the intercept. In fact you can treat the output like a list, and use indexing to select a specific result.
linregress( x, y )[0] # use indexing to get the slope
Another option is to collect all five of the output values at once
slope, intercept, rvalue, pvalue, stderr = linregress( x, y )
print( "slope =", slope )
print( "intercept =", intercept )
To plot the best-fit line against the original data, we generate a new data set according to $y=mx+c$, where $m$ and $c$ are set to the slope
and intercept
, calculated from linregress
.
y_fit = slope * x + intercept
plt.plot( x, y, 'o' )
plt.plot( x, y_fit, '-' )
plt.xlabel( 'x' )
plt.ylabel( 'y' )
plt.title( 'y =', slope, 'x +', intercept )
plt.show()