library(repr) ; options(repr.plot.res = 100, repr.plot.width = 6, repr.plot.height = 5) # Change plot sizes (in cm) - this bit of code is only relevant if you are using a jupyter notebook - ignore otherwise
In this chapter we will explore fitting a linear model to data when you have multiple explanatory (predictor) variables.
The aims of this chapter are$^{[1]}$:
Learning to build and fit a linear model that includes several explanatory variables
Learning to interpret the summary tables and diagnostics after fitting a linear model with multiple explanatory variables
The models we looked at in the ANOVA chapter explored whether the log genome size (C value, in picograms) of terrestrial mammals varied with trophic level and whether or not the species is ground dwelling. We will now look at a single model that includes both explanatory variables.
The first thing to do is look at the data again.
$\star$ Create a new blank script called MulExpl.R
in your Code
directory and add some introductory comments.
$\star$ Load the data saved at the end of the ANOVA chapter:
load('../data/mammals.Rdata')
Look back at the end of the previous chapter to see how you saved the RData file. If mammals.Rdata
is missing, just import the data again using read.csv
and add the log C Value
column to the imported data frame again (go back to the ANOVA chapter and have a look if you have forgotten how).
Use ls()
, and then str
to check that the data has loaded correctly:
str(mammals)
Previously, we asked if carnivores or herbivores had larger genomes. Now we want to ask questions like: do ground-dwelling carnivores have larger genomes than arboreal or flying omnivores? We need to look at plots within groups.
Before we do that, there is a lot of missing data in the data frame and we should make sure that we are using the same data for our plots and models. We will subset the data down to the complete data for the three variables:
mammals <- subset(mammals, select = c(GroundDwelling, TrophicLevel,
logCvalue))
mammals <- na.omit(mammals)
str(mammals)
Previously, we used the subset
option to fit a model just to dragonflies. You can use subset
with plots too.
$\star$ Add par(mfrow=c(1,2))
to your script to split the graphics into two panels.
$\star$ Copy over and modify the code from the ANOVA chapter to create a boxplot of genome size by trophic level into your script.
$\star$ Now further modify the code to generate the plots shown in the figure below (you will have to subset
your data for this, and also use the subset option of the plot
command).
{tip}
You can use the `plot` function's option `main = ` to add titles to a plot.
The ggplot2
package provides a very neat way to plot data in groups with the facet
command.
library(ggplot2)
ggplot(mammals, aes(x = TrophicLevel, y= logCvalue)) +
geom_boxplot() + facet_grid(. ~ GroundDwelling)
The code logCvalue ~ TrophicLevel | GroundDwelling
means plot the relationship between genome size and trophic level, but group within levels of ground dwelling. We are using the function bwplot
, which is provided by lattice
to create box and whisker plots.
$\star$ Create the lattice plots above from within your script.
Rearrange this code to have three plots, showing the box and whisker plots for GroundDwelling
, grouped within the levels of TrophicLevel
.
Try reshaping the R plot window and running the command again. Lattice tries to make good use of the available space when creating lattice plots.
We're going to make the barplot code from the Regression chapter even more complicated! This time we want to know the mean log genome size within combinations of TrophicLevel
and GroundDwelling
. We can still use tapply
, providing more than one grouping factor. We create a set of grouping factors like this:
groups <- list(mammals$GroundDwelling, mammals$TrophicLevel)
groupMeans <- tapply(mammals$logCvalue, groups, FUN = mean)
print(groupMeans)
$\star$ Copy this code into your script and run it.
Use this code and the script from the ANOVA chapter to get the set of
standard errors for the groups groupSE
:
seMean <- function(x){
# get rid of missing values
x <- na.omit(x)
# calculate the standard error
se <- sqrt(var(x)/length(x))
# tell the function to report the standard error
return(se)
}
groups <- list(mammals$GroundDwelling, mammals$TrophicLevel)
groupMeans <- tapply(mammals$logCvalue, groups, FUN=mean)
print(groupMeans)
groupSE <- tapply(mammals$logCvalue, groups, FUN=seMean)
print(groupSE)
Now we can use barplot
. The default option for a barplot of
a table is to create a stacked barplot, which is not what we want. The
option beside=TRUE
makes the bars for each column appear
side by side.
Once again, we save the midpoints of the bars to add the error bars. The other options in the code below change the colours of the bars and the length of error bar caps.
# get upper and lower standard error height
upperSE <- groupMeans + groupSE
lowerSE <- groupMeans - groupSE
# create barplot
barMids <- barplot(groupMeans, ylim=c(0, max(upperSE)), beside=TRUE, ylab= ' log C value (pg) ' , col=c( ' white ' , ' grey70 '))
arrows(barMids, upperSE, barMids, lowerSE, ang=90, code=3, len=0.05)
$\star$ Generate the barplot above and then edit your script to change the colours and error bar lengths to your taste.
We'll use the plotmeans
function again as an exercise to change graph settings and to prepare figures for reports and write ups. This is the figure you should be able to reproduce the figure below.
$\star$ Use plotmeans
from the ANOVA chapter and the subset
option to generate the two plots below. You will need to
set the ylim
option for the two plots to make them use the same $y$ axis.
$\star$ Use text
to add labels — the command par('usr')
will show you the limits of the plot ($x_{min}, x_{max}, y_{min}, y_{max}$) and help pick a location for the labels.
$\star$ Change the par
settings in your code and redraw the plots to try and make better use of the space. In the example below, the box shows the edges of the R graphics window.
Note the following about the the figure above (generated using plotmeans)):
White space: The default options in R use wide margins and spaced out axes and take up a lot of space that could be used for plotting data. You've already seen the par
function and the options mfrow
for multiple plots and mar
to adjust margin size. The option mgp
adjusts the placement of the axis label, tick labels and tick locations. See ?par
for help on the these options.
Main titles: Adding large titles to graphs is also a bad idea — it uses lots of space to explain something that should be in the figure legend. With multiple plots in a figure, you have to label graphs so that the figure legend can refer to them. You can add labels using text(x,y,'label')
.
Figure legends: A figure caption and legend should give a clear stand-alone description of the whole figure.
Referring to figures: You must link from your text to your figures — a reader has to know which figures refer to which results. So: "There are clear differences in mean genome size between species at different trophic levels and between ground dwelling and other species, Figure xx".
All those exploratory visualizations suggest:
Carnivores have smaller genome size; omnivores have larger genome size.
Herbivores are somewhere in between, but not consistently.
All ground dwelling mammals typically have larger genome sizes.
We suspected these things from the ANOVA chapter analyses, but now we can see that they might have separate effects. We'll fit a linear model to explore this and add the two explanatory variables together.
$\star$ This is an important section — read it through carefully and ask questions if you are unsure. Copy the code into your script and add comments. Do not just jump to the next action item!
$\star$ First, fit the model:
model <- lm(logCvalue ~ TrophicLevel + GroundDwelling, data = mammals)
We're going to do things right this time and check the model diagnostics before we rush into interpretation.
library(repr) ; options(repr.plot.res = 100, repr.plot.width = 7, repr.plot.height = 8) # Change plot size
par(mfrow=c(2,2))
plot(model)
library(repr) ; options(repr.plot.res = 100, repr.plot.width = 6, repr.plot.height = 5) # Change plot size back
Examine these diagnostic plots. There are six predicted values now - three trophic levels for each of the two levels of ground dwelling. Those plots look ok so now we can look at the analysis of variance table:
anova(model)
Ignore the $p$ values! Yes, they're highly significant but we want to understand the model, not rubber stamp it with 'significant'.
The sums of squares for the variables are both small compared to the residual sums of squares — there is lots of unexplained variation. We can calculate the $r^2$ as explained sums of squares over total sums of squares:
$$\frac{0.81 + 2.75}{0.81 + 2.75 + 13.21} = \frac{3.56}{16.77} = 0.212$$Trophic level explain much less variation than ground dwelling — this makes intuitive sense from the plots since there are big differences between in the figure we generated above (using plotmeans) (a vs b), but small differences within.
We could also calculate a significance for the whole model by merging the terms. The total explained sums of squares of $0.81 + 2.75 = 3.56$ uses $2+1 =3$ degrees of freedom, so the mean sums of squares for all the terms together is $3.56/3=1.187$. Dividing this by the residual mean square of 0.052 gives an F of $1.187 / 0.052 = 22.83$.
Now we can look at the summary table to see the coefficients:
summary(model)
Starting at the bottom of this output, summary
has again calculated $r^2$ for us and also an $F$ statistic for the whole model, which matches the calculation above.
The other important bits are the four coefficients. The intercept is now the reference level for two variables: it is the mean for carnivores that are not ground dwelling. We then have differences from this value for being an omnivore or herbivore and for being ground dwelling. There is a big change in genome size associated with ground dwelling and omnivory and both of these have large effects sizes, each introducing about a 20% difference in genome size from the non-ground dwelling carnivores. In contrast, herbivory makes a small difference — about 8%.
Because the difference is small and the standard error is large, the $t$ value suggests that this difference might arise just by chance. Put another way, it isn't significant.
The table below shows how these four coefficients combine to give the predicted values for each of the group means.
Carnivore | Herbivore | Omnivore | |
---|---|---|---|
Not ground | 0.98 = 0.98 | 0.98 + 0.08 = 1.06 | 0.98 + 0.17 = 1.15 |
Ground | 0.98 + 0.21 = 1.19 | 0.98 + 0.08 + 0.21 =1.27 | 0.98 + 0.17 + 0.21 = 1.36 |
(16-MulExp:Predicted-values)=
Getting the model predictions by hand in this way is tedious and error prone. There is a handy function called predict
which uses the model directly to calculate values. The default is to give you the prediction for each point in the original data, but you can also ask for specific predictions.
The first thing to do is to set up a small data frame containing the explanatory values we want to use. The variable names and the level name have to match exactly, so we'll use the levels
function to get the names. We want to look at all six combinations, so we'll use the rep
function to set this up. The each = 2
option repeats each value twice in succession; the times = 3
options repeats the whole set of values three times.
Let's do it:
# data frame of combinations of variables
gd <- rep(levels(mammals$GroundDwelling), times = 3)
print(gd)
tl <- rep(levels(mammals$TrophicLevel), each = 2)
print(tl)
predVals <- data.frame(GroundDwelling = gd, TrophicLevel = tl)
Now we have the data frame of values we want, we can use predict
. Just as when we created log values, we can save the output back into a new column in the data frame:
predVals$predict <- predict(model, newdata = predVals)
print(predVals)
Not that these are in the same order as the bars from your barplot.
$\star$ Make a copy of the barplot and arrows code from above and modify it
barMids <- barplot(groupMeans, ylim=c(0, 1.4), ylab='log C value (pg)', beside=TRUE, col=c('white', 'grey70'))
arrows(barMids, upperSE, barMids, lowerSE, ang=90, code=3, len=0.1)
points(barMids, predVals$predict, col='red', pch=12)
The red markers do not match to the calculated means. This is because the model only includes a single difference between ground and non-ground species, which has to be the same for each trophic group. That is, there is no interaction between trophic level and ground / non-ground identity of each species in the current model.
$\star$ Add the code for this plot to your script file.
Next, we will look at interactions, which allows these values to differ using an interaction term in the model.