Data Fitting Exercises¶

2. Linear Least-Squares¶

How to define a line of “best fit”

In the previous exercise, you used the linregress function to calculate a line of best fit to your experimental data. But what is this function actually doing, and how do we define the quality of fit for a particular straight line?

In this exercise, you will work through the maths behind linear regression, and then learn how to implement this using functions in Python as a more general optimisation problem.

The maths¶

Linear regression addresses the problem of how to find a line of “best fit” for some data that is appriximately described by a straight line. For example. we might have the data shown in the figure below:

and we want to find the straight line that best describes the relationship between $x$ and $y$:

How do we find the “best” line for this particular data set?

The equation of a straight line is

\begin{equation} y=mx+c. \end{equation}

This is a mathematical function that defines a set of $y$ values from a set of $x$ values, for any pair or parameters, $m$ and $c$.

More generally, we can say that $y$ is a function of some model parameters, $P$, and the input $x$:

\begin{equation} y = f(P,x). \end{equation}

In our example of a straight line, the parameter set $P$ contains the slope and intercept of the line:

\begin{equation} P=\left\{m,c\right\}, \end{equation}

and the function $f$ is

\begin{equation} f(P,x) = P_0 x + P_1. \end{equation}

Now $P_0$ is the slope, and $P_1$ is the intercept.

In principle, we could choose any combination of $P_0$ and $P_1$ for our model parameters. Every combination gives a different straight line, which might be a better, or worse, fit to our experimental data.

To decide which set of model parameters best describes our real data, we need a way to score the agreement between the model $f(P,x)$ and the data. A common way to quantify the error between the model and the real data is to calculate the sum of squared errors. For every $x$ value in our input data set, we can calculate the $y_\mathrm{predicted}$ value predicted by the model. The difference between this predicted value and the real $y$ value for this data point is the error for that data point.

To quantify the overall error of this particular model, we can add up the squares of all the error terms:

\begin{equation} \mathrm{error} = \sum_i\left[y_i - f(P,x_i)\right]^2 \end{equation}

Adding squares means that positive and negative errors contribute equally to the total error.

We can now determine mathematically whether one model is a better “fit” to the data than another. A better quality of fit will give us a smaller error. The best fit to the data (for this particular model function) is given by the set of parameters that minimises the error. The procedure of finding a set of model parameters that minimises the sum of squared errors is often called least-squares fitting.

The code¶

In Python the model straight-line function can be expressed using the following function:

def model_function( P, x ):

    m = P[0]

    c = P[1]

    y = m * x + c

    return y

If you compare this to the mathematical function definition above, you should see the same structure in both the mathematical description and the Python representation.

The function model_function() also looks similar to the line() function you used in Exercise 1.

def line( m, c, x ):

    y = m * x + c

    return y

The only difference is that in model_function() we pass in the model parameters as a list. This is then unpacked inside the function, instead of passing in $m$ and $c$ separately.

In the code cell below, define the functions model_function() and line(). Using x = 2, check that both these functions return the same $y$ values for pairs of parameters, $m$ and $c$. For the model_function() function, remember that the parameters need to be passed in as a **list**.

We can also write a Python function that calculates the error between the predictions of our model function and the real $y$ values:

def error_function( P, x, y ):

    y_predicted = model_function( P, x )

    error_terms = y - y_predicted

    total_error = np.sum( error_terms**2 )

    return total_error

The first line calls the model_function() function with the list of model parameters, $P$, and the input $x$ values. This returns a set of predicted $y$ values for this particular model.

The second line calculates the error terms, i.e. the differences between the predicted and actual $y$ values.

The third line calculates the sum of squared errors.

The fourth line returns the error.

By copying the appropriate code from Exercise 1, or writing it from scratch, read in the experimental data from 'data/equilibrium_constant.dat' and plot it as points.

Using the model_function() function, generate a series of lines with different parameters, and plot these against the experimental data.

Define the function error_function(). For each of your guessed model parameter sets, calculate the error. You should see that models that appear closer to the experimental data give you lower errors.

You should have seen how changing the parameters for your model lets you generate different lines, that each do better or worse jobs of describing the experimental data. In each case, the quality of fit between your model and the data is given by the error, calculated with your error function.

The final step in the problem is to find the least-squares solution. This corresponds to the set of model parameters that minimise the result from your error function.

Finding the minimum or maximum of a function is a common mathematical problem, and there are a large number of algorithms for doing this numerically, using computers.

The scipy module contains functions for doing exactly this, in scipy.optimize. For this exercise, you will import the minimize() function, and use this to find the parameters that minimise the output of your error function. You can import the minimize() function as follows:

from scipy.optimize import minimize

The minimize() function can be given a large number of options, but the simplest usage requires three arguments:

minimize( function_to_minimise, initial_guess, other_arguments )

The first argument, function_to_minimise, is the name of the function to minimise. For this exercise, this is error_function.
The second argument, initial_guess, is a starting guess for the parameters that you want to fit. e.g. you might guess that P=[1.0,1.0].
The third argument is a list of any other arguments that need to be passed into the function you are minimising. In this exercise, your function error_function takes three arguments: error_function( P, x, y ): The first is the set of model parameters, which you want to change during the fitting. The second and third are the observed $x$ and $y$ values (here these are $1/T$ and $\ln{K}$) for each data point. These do not change during the fit, and get passed into minimize() as a list in the other_arguments position.

e.g. if you have stored the $1/T$ ($x$ values) and $\ln{K}$ ($y$ values) for the experimental data in numpy arrays called, respectively, inverse_temperature and ln_K then you could write

P_initial = [ 1.0, 1.0 ]

other_args = inverse_temperature, ln_K

minimize( error_function, P_initial, other_args )

The other_arg = inverse_temperature, ln_K command stores the $x$ and $y$ data values in a data type called a tuple. From the perspective of this exercise, a tuple is like a list, in that it can store multiple values (here you store two numpy arrays).

In the following code cell, import the minimize function, and use it to minimise your error function, with the experimental data you loaded previously.

Your output should look like:

      fun: 0.026018593683863903

 hess_inv: array([[  1.53892987e+06,  -4.95784961e+03],

       [ -4.95784961e+03,   1.60346929e+01]])

      jac: array([  1.53901055e-07,  -1.45286322e-07])

  message: 'Optimization terminated successfully.'

     nfev: 60

      nit: 4

     njev: 15

   status: 0

  success: True

        x: array([ 6917.390252  ,   -21.05540855])

You will see that a lot of information about the fit is produced. For this exercise, the important parts of the output are the message, which tells you whether the optimisation was successful, and the result labelled x, which gives the input parameters that minimise your error function. The value of the error function with these optimised model parameters is given by the fun result.

You can access specific parts of the output by appending the variable names to the end of your minimize() call, e.g. to get the optimal parameters you can write

minimize( error_function, P_initial, other_args ).x

Repeat your optimisation, and store the optimised model parameters in a variable P_opt. Check these values against the fitted slope and intercept you found in Exercise 1.