library(repr) ; options(repr.plot.width=7, repr.plot.height= 6) # Change plot sizes (in cm) - this bit of code is only relevant if you are using a juyter notebook - ignore otherwise
Clutter and confusion are failures of design, not attributes of information. – Edward Tufte
This chapter aims at introducing you to key principles and methods for data processing, storage, exploration and visualization.
In this modern world, massive amounts of data are being generated in biology due to rapid advances in technologies for collecting new, as well as for digitizing old data. Some prominent examples are Genomic, Ecosystem respiration, Climatic, and Animal tracking data. Ultimately, the goal of quantitative biology is to both, discover patterns in these data, and fit mathematical models to them. Reproducible data manipulation, analyses and visualization are particularly necessary when data are so large and complex, and few are more so than biological data, which are extremely heterogeneous in their structure and quality. Furthermore, when these data are "big" (more below on what makes a dataset "big"), computationally-efficient data handling, manipulation and analysis techniques are needed.
You are likely to spend far more time than you think dredging through data files manually – checking them, editing them, and reformatting them to make them useful for data exploration and analysis. It is often the case that you'll have to deal with messy or incomplete data, either because of sampling challenges (e.g., "field" data), or because you got given data that was poorly recorded and maintained. The data we obtain from different data sources is often unusable at the beginning; for example you may need to:
-999999
) to denote missing observations This is not an exhaustive list. Doing so many different things to your raw data is both time-consuming and risky. Why risky? Because to err is very human, and every new, tired mouse-click and/or keyboard-stab has a high probability of inducing an erroneous data point!
Source: [https://pathanruet.files.wordpress.com/2012/05/unix-tree.png](https://pathanruet.files.wordpress.com/2012/05/unix-tree.png))
So you would like a record of the data wrangling process (so that it is repeatable and even reversible), and automate it to the extent possible. To this end, here are some guidelines:
Store data in universally (machine)-readable, non-proprietary formats; basically, use plain ASCII text for your file names, variable/field/column names, and data values. And make sure the data file's "text encoding" is correct and standard (e.g., UTF-8
).
Keep a metadata file for each unique dataset (again, in non-proprietary format).
Minimize modifying raw data by hand—use scripts instead—keep a copy of the data as they were recorded.
Use meaningful names for your data and files and field (column) names
When you add data, try not to add columns (widening the format); rather, design your tables/data-sheets so that you add only rows (lengthening the format)—and convert "wide format data" to "long format data" using scripts, not by hand,
All cells within a data column should contain only one type of information (i.e., either text (character), numeric, etc.).
Ultimately, consider creating a relational database for your data (More on this below).
This is not an exhaustive list either— see the Readings & Resources Section.
We will use the Pound Hill dataset collected by students in a past Silwood Field Course for understanding some of these principles.
To start with, we need to import the raw data file, for which, follow these steps:
★ Copy the file PoundHillData.csv
and PoundHillMetaData.csv
files from TheMulQuaBio's data
directory into your own R week's data
directory. Then load the data in R:
setwd("../code/")
baseDataUrl = "https://sciencedata.dk/public/6e3ed434c0fa43df906ce2b6d1ba9fc6/the_multilingual_quantitative_biologist/content/data/";
MyData <- as.matrix(read.csv(paste0(baseDataUrl,"PoundHillData.csv"),header = FALSE))
class(MyData)
as.matrix()
, and setting header=FALSE
guarantees that the data are imported "as is" so that you can wrangle them. Otherwise read.csv
will convert the first row to column headers.MyData
because at least one of the entries is already character class.{note}
As of R version 4.0.0 of R released in April 2020, the default for
`stringsAsFactors` [was changed to false](https://cran.r-project.org/doc/manuals/r-devel/NEWS.html) If you are using R version 3.x.x, you will need to add `stringsAsFactors = FALSE` to the above command to prevent R from converting all columns of character type (strings) to the `factor` data type (this will create problems with subsequent data wrangling).
★ Now load the Metadata:
MyMetaData <- read.csv(paste0(baseDataUrl,"PoundHillMetaData.csv"),header = TRUE, sep=";")
class(MyMetaData)
header =TRUE
because we do have metadata headers (FieldName
andDescription
), and ;
) as delimiter because there are commas in one of the field descriptions.{tip}
**The text encoding of your data file:** If you have a string (`character` data type) stored as a variable in your R workspace, or a file containing strings, the computer has to know what [character encoding](https://en.wikipedia.org/wiki/Character_encoding) it is in or it cannot interpret or display it to you correctly. Usually, the encoding will be UTF-8 or ASCII, which is easily handled by most computer languages. Sometimes you may run into (unexpected) bugs when importing and running scripts in `R` because your file has a non-standard text encoding. You will need to specify the encoding in that case, using the `encoding` argument of `read.csv()` and `read.table()`. You can check the encoding of a file by using `find` in Linux/Mac. Try in your UNIX terminal:
`file -i ../data/PoundHillData.csv`
or, check encoding of all files in the `Data` directory:
`file -i ../data/*.csv`
Use `file -I` instead of `file -i` in a Mac terminal.
Now check out what the data look like:
head(MyData)
Note that column names V1-V60 were generated automatically by R when you imported the data
In RStudio you can also do view(MyData)
at the R prompt or any other code editor, fix(MyData)
. We won't do anything with the metadata file in this session except inspect the information it contains.
Now let's implement some of the key data wrangling principles listed above on this dataset. You should keep a record of what you are doing to the data using a R script file. This is illustrated in the DataWrang.R
script file.
★ Copy the script DataWrang.R
into your own R training directory's code
directory and open it.
Go through the script carefully line by line, and make sure you understand what's going on. Read the comments — add to them if you want.
OK, so onward with our data wrangling.
Data wrangling really begins immediately after data collection. You may collect data of different kinds (e.g., diversity, biomass, tree girth), etc. Keep the original spreadsheet well documented using a "metadata" file that describes the data (you would hopefully have written the first version of this even before you started collecting the data!). The minimum information needed to make a metadata file useful is a description of each of the fields — the column or row headers under which the information is stored in your data/spreadsheet.
Have a look at the metadata file for the Pound Hill dataset:
{tip}
**Boolean arguments in R**: In `R`, you can use `F` and `T` for boolean `FALSE` and `TRUE` respectively. To see this, type
```r
a <- T
in the R commandline, and then see what R returns when you type a
. Using F
and T
for boolean FALSE
and TRUE
respectively is not necessarily good practice, but be aware that this option exists.
```
MyMetaData
Ideally, you would also like to add more information about the data, such as the measurement units of each type of observation. These data include just one type of observation: Number of individuals of species per sample (plot), which is a count (integer, or int
data class).
When the dataset is large (e.g., 1000's of rows), cleaning and exploring it can get tricky, and you are very likely to make many mistakes. You should record all the steps you used to process it with an R script rather than risking a manual and basically irreproducible processing. Most importantly, avoid or minimize editing your raw data file—make a copy (with a meaningful tag in the file name to indicate the date and author) before making hand edits.
All blank cells in the data are true absences, in the sense that species was actually not present in that quadrat. So we can replace those blanks with zeros:
MyData[MyData == ""] = 0
One typically records data in the field or experiments using a "wide" format, where a subject's (e.g., habitat, plot, treatment, species etc) repeated responses or observations (e.g., species count, biomass, etc) will be in a single row, and each response in a separate column. The raw Pound Hill data were recorded in this way. However, the wide format is not ideal for data analysis — instead you need the data in a "long" format, where each row is one observation point per subject. So each subject will have data in multiple rows. Any measures/variables/observations that don't change across the subjects will have the same value in all the rows. For humans, the wide format is generally more intuitive for recording (e.g., in field data sheets) data. However, for data inspection and analysis, the long format is preferable for two main reasons:
If you have many response and/or treatment variables, it is hard to inspect the data values in a wide-form version of the dataset. In the case of the pound hill dataset, the response variable is species, and treatment variable is cultivation month (with sequentially nested replicates—block, plot, quadrat— within it), and the data values are the number (count) of individuals of each species per quadrat. As you can see, there are a a large number of columns (60 to be exact), with columns V2-V60 containing different treatment combinations. This makes it hard to visually see problems with the data values. You would have to look across all these columns to see any issues, or if you wanted to run a single command on all the data values (e.g., is.integer()
to check if they are all integers, as you would expect), it would be harder to do so or interpret the output if all the species counts were in a single column.
Long-form datasets are typically required for statistical analysis and visualization packages or commands in R (or Python, for that matter). For example, if you wanted to fit a linear model using R's lm()
command, with treatment (cultivation month) as the independent variable, you would need to structure your data in long form. Similarly, and if you wanted to plot histograms of species numbers by treatment using ggplot (coming up), you would also need these data in long format.
OK, so let's go from wide to long format already!
You can switch between wide and long formats using melt()
and dcast()
from the reshape2
package, as illustrated in the script DataWrang.R
available at TheMulQuaBio repository. But first, let's transpose the data, because for a long format, the (nested) treatments variables should be in rows:
MyData <- t(MyData)
head(MyData)
At this point, you should note that the first row in the data matrix
actually contains the column headers, but these are not encoded by R as being column headers. As far as R is concerned, the first row is just another data row. You can check this:
colnames(MyData)
We need to fix this. Also, we would like to create a dataframe now, as this is needed by the R commands we will be using, and ultimately, any visualization and analysis we will be doing. To this end, let's first create a temporary dataframe with just the data, without the column names:
TempData <- as.data.frame(MyData[-1,],stringsAsFactors = F)
head(TempData)
Note that stringsAsFactors = F
is important here because we don't want R to convert columns to the factor class without supervision. This might be a good idea in may cases, but let's just do it manually later. This will also force you to think carefully about what data each of you columns contains.
Now let's assign the original column names to the temporary dataset:
colnames(TempData) <- MyData[1,] # assign column names from original data
head(TempData)
The row names still remain, but they are not really bothering us, so we can ignore them. But if you do want to get rid of them, you can:
rownames(TempData) <- NULL
head(TempData)
Finally, let's convert the data to long format. For this, we need the reshape2
package:
{note}
Both `library()` and `require()` are commands/functions to load packages. The difference is that `require()` is designed for use inside other functions, so it returns `FALSE` and gives a warning, whereas`library()` returns an error by default if the package does not exist.
require(reshape2)# load the reshape2 package
Check out the help for the melt
command of reshape2
: ?melt
.
OK finally, let's wrangle this dataset into submission!
MyWrangledData <- melt(TempData, id=c("Cultivation", "Block", "Plot", "Quadrat"), variable.name = "Species", value.name = "Count")
head(MyWrangledData); tail(MyWrangledData)
This is the long format we have been extolling! It looks much simpler, with all the count data in a single, long column with each count value associated with some "attributes". It is now clear that the fundamental unit of analysis is a species' count in a quadrat, each nested within a plot, each in turn nested within a block, and finally, each of which is nested within a cultivation month (the main treatment).
We can also now assign the correct data types to each column:
MyWrangledData[, "Cultivation"] <- as.factor(MyWrangledData[, "Cultivation"])
MyWrangledData[, "Block"] <- as.factor(MyWrangledData[, "Block"])
MyWrangledData[, "Plot"] <- as.factor(MyWrangledData[, "Plot"])
MyWrangledData[, "Quadrat"] <- as.factor(MyWrangledData[, "Quadrat"])
MyWrangledData[, "Count"] <- as.integer(MyWrangledData[, "Count"])
str(MyWrangledData)
Once you have wrangled the Pound Hill data to its long format, you are ready to go! You may want to start by examining and visualizing the basic properties of the data, such as the number of tree species (41) in the dataset, number of quadrats (replicates) per plot and cultivation treatment, etc.
We will learn about visualization next. After that, you can return to this dataset and try out some visual data exploration. For example, a useful visualization would be to make a histogram of abundances of species, grouped by different factors. For example, you can look at distributions of species' abundances grouped by the fundamental treatment, Cultivation
.
tidyverse
¶So if you think this is the end of the options you have for data wrangling in R, think again. There is a whole data science "ecosystem" you can use in R through the tidyverse
package. This meta-package includes dplyr
, the next iteration of plyr
that addresses the speed issues in the latter, and tidyr
, essentially a nicer wrapper to the reshape2
package with additional functions, and ggplot2
(coming up later).
Assuming you have installed tidyverse
(see the end of the R Chapter), load it:
require(tidyverse)
This loads a number of packages at one go (and you might get some name-conflict-masking messages; see the note below). You can list these packages:
tidyverse_packages(include_self = TRUE) # the include_self = TRUE means list "tidyverse" as well
That's a lot of packages!
The first time you load tidyverse, you will get some feedback from R, including something like the following:
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
Depending on which package you are loading, you might get such a warning message, that "so and so" function is being "masked" from another one. Basically, any additional packages you load may have one or more functions with the same names(s) as existing function(s) (which will typically do something completely different) in base R's packages. This creates name conflicts. Therefore, R will prevent you from using the new function(s) that you have enabled by using the same name of an pre-existing function. It gives priority to the pre-existing one. So, for example, if you call just filter()
, the command that will be used is the one from stats
, and not the one from dplyr
. In this scenario, you can use the new function by using the double colon notation ::
like so: dplyr::filter()
.
{tip}
The `::` notation is like the dot notation in Python — it allows you to access a particular function (like a Python module) from a package. So you can also use the command at the end of the `::` (try just `as_tibble(MyWrangledData)`, for example), but it's just safer to use the `::` because then you can rest assured that you are actually using the function you intended to.
More of ::
usage coming up below!
OK, let's try data exploration in tidyverse.
First, let's convert the dataframe to a "tibble":
MyWrangledData <- tibble::as_tibble(MyWrangledData)
MyWrangledData
class(MyWrangledData)
OK, there's the funky double colon (::
) notation again...
{note}
A "tibble" in tidyverse is equivalent to R's traditional `data.frame`. Tibbles *are* modified data frames that make data exploration more robust. As such, you do not need to convert a data frame to tibble to use the tidyverse functions, but its a good idea to do so. Read more about why, [here](https://tibble.tidyverse.org/).
dplyr::glimpse(MyWrangledData) #like str(), but nicer!
Or this:
utils::View(MyWrangledData) #same as fix()
dplyr::filter(MyWrangledData, Count>100) #like subset(), but nicer!
dplyr::slice(MyWrangledData, 10:15) # Look at an arbitrary set of data rows
Learning to use tidyverse
involves learning some new syntax and a lot of new commands, but if you plan to do a lot of data wrangling and exploration, you will benefit from getting to know them well.
Have a look at some tidyverse-related cheatsheets to explore further.
One additional thing that you would need to get used to in the tidyverse
is piping multiple commands using the %%
notation, which makes your code ore compact, and in theory, more readable.
tidyverse
¶We used reshape2
in DataWrang.R
for wrangling that dataset. Write a new script called DataWrangTidy.R
that uses tidyverse
(and specifically, dplyr
and tidyr
) for the same data wrangling steps. The best way to do this is to copy DataWrang.R
and rename it DataWrangTidy.R
. Then systematically redo the script from start to end, looking for a function in dplyr
and tidyr
that does the same thing in each wrangling step.
For example, to convert from wide to long format, instead of using melt()
(or dcast()
) from the reshape2
package, you can use gather()
from tidyr
.
Don't forget the cheatsheets, an particular the Data Transformation Cheatsheet.
The buzzword 'Big Data' basically refers to datasets that have the following properties:
A dataset that does not fit into available RAM on one system (say, 2 gigabytes).
A dataset that has so many rows (when in it's long format — see above sections) that it significantly slows down your analysis or simulation without vectorization (that is, when looping).
Both these criteria are programming language-, and computer hardware-dependent, of course. For example, a 32-bit OS can only handle ~2 GB of RAM, so this computer will struggle every time you try to handle a dataset in that size range.
R reads data into RAM all at once when you using the read.table
(or its wrapper, read.csv()
— maybe
you have realized by now that read.csv()
is basically calling read.table()
with a particular set of options. That is, objects in R live in memory entirely, and big-ish data in RAM will cause R to choke. Python has similar problems, but you can circumvent these to an extent by using numpy
arrays (see the second Python Chapter).
There are a few options (which you can combine, of course) if you are actually using datasets that are so large:
Import large files smartly; e.g., using scan()
in R, and then create subsets of the data that you need. Also, use the reshape
or tidyr
packages to covert your data in the most "square" (so neither too long or too wide) format as possible. Of course, you will need subsets of data in long format for analysis (see sections above).
Use the bigmemory
package to load data in the gb range (e.g., use read.big.matrix()
instead of read.table()
. This package also has other useful functions, such as foreach()
instead of for()
for better memory management.
Use a 64 bit version of R with enough memory and preferably on UNIX/Linux!
Vectorize your analyses/simulations to the extent possible (See Python and R chapters).
Use databases (more on this below).
Use distributed computing (distribute the analysis/simulation across multiple CPU's).
The next subsection superficially covers databases. We will cover memory management in the advanced Python, HPC and C weeks.
R can be used to link to and extract data from online databases such as PubMed and GenBank, or to manipulate and access your own. Computational Biology datasets are often quite large, and it makes sense to access their data by querying the databases instead of manually downloading them. So also, your own data may be complex and large, in which case you may want to organize and manage those data in a proper relational database.
Practically all the data wrangling principles in the previous sections are a part and parcel of relational databases.
There are many R packages that provide an interface to databases (SQLite, MySQL, Oracle, etc). Check out R packages DBI and RMySQL .
R can also also be used to access, update and manage databases. In particular SQLite
allows you to build, manipulate, and access databases easily. Try the script called SQLinR.R
available in the code
directory of TheMUlQuaBio. Also have a look at this Appendix.
Now that you have learned how to wrangle data, let's learn some key principles of graphics and visualization, and how to implement them in R. You can use R to build a complete, reproducible workflow for data visualization for both exploratory and publication purposes. We will start with some basic plotting and data exploration. You will then learn to generate publication-quality visualizations using the ggplot2
package.
Before you do any statistical analyses with data, you must clean, explore, and visualize it. And eventually, you want to produce a finished product that presents visualizations of your data and your results clearly and concisely.
Ultimately, at both, the data exploration and the finished product stages, the goal of graphics is to present information such that it provides intuitive ideas. As Edward Tufte says: "Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space."
R can produce beautiful graphics without the time-consuming and fiddly methods that you might have used in Excel or equivalent. You should also make it a habit to quickly plot the data for exploratory analysis. So we are going to learn some basic plotting first.
Here is a menu of basic R plotting commands (use ?commandname
to learn more about it):
Command | What it does |
---|---|
plot(x,y) |
Scatterplot |
plot(y~x) |
Scatterplot with y as a response variable |
hist(mydata) |
Histogram |
barplot(mydata) |
Bar plot |
points(y1$\sim$x1) |
Add another series of points |
boxplot(y$\sim$x) |
Boxplot |
{tip}
**R graphics devices**: In all that follows, you may often end up plotting multiple plots on the same graphics window without intending to do so, because R by default keeps plotting in the most recent plotting window that was opened. You can close a particular graphics window or "device" by using `dev.off()`, and all open devices/windows with `graphics.off()`. By default, `dev.off()` will close the most recent figure device that was opened.
Note that there are invisible devices as well! For example, if you are printing to pdf (coming up below), the device or graphics window will not be visible on your computer screen.
Now let's try some simple plotting for data exploration. As a case study, we will use a dataset on Consumer-Resource (e.g., Predator-Prey) body mass ratios taken from the Ecological Archives of the ESA (Barnes et al. 2008, Ecology 89:881).
$\star$ Copy EcolArchives-E089-51-D1.csv
from the TheMulQuaBio data
to your own data directory. Then, launch R and read in these data to a data frame (note the relative path):
MyDF <- read.csv(paste0(baseDataUrl, "EcolArchives-E089-51-D1.csv"))
dim(MyDF) #check the size of the data frame you loaded
Let's look at what the data contain. Type MyDF
and hit the TAB key twice in the R commandline.
{tip}
**Remember Tabbing**: In your smart code editor or IDE such as vscode or RStudio, you will see a drop-down list of all the column headers in a dataframe when you hit TAB once after `MyDF$`
You can also use str()
and head()
:
str(MyDF)
head(MyDF)
You might want to try the dplyr
way to do this data inspection/exploration as well, as you did above.
require(tidyverse)
dplyr::glimpse(MyDF)
As you can see, these data contain predator-prey body size information. This is an interesting dataset because it is huge, and covers a wide range of body sizes of aquatic species involved in consumer-resource interactions — from unicells to whales. Analyzing this dataset should tell us a lot about what sizes of prey predators like to eat.
Let's change the type of certain columns to factor
because we will want to use them as grouping variables:
MyDF$Type.of.feeding.interaction <- as.factor(MyDF$Type.of.feeding.interaction)
MyDF$Location <- as.factor(MyDF$Location)
str(MyDF)
Let's start by plotting Predator mass vs. Prey mass.
plot(MyDF$Predator.mass,MyDF$Prey.mass)
That doesn't look very meaningful! Let's try taking logarithms. Why? - Because body sizes across species tend to be log-normally distributed, with a lot of small species and a few large ones. Taking a log allows you to inspect the body size range in a meaningful (logarithmic) scale and reveals the true relationship. This also illustrates a important point. Just like statistical analyses, the effectiveness of data visualization too depends on the type of distribution of the data.
plot(log(MyDF$Predator.mass),log(MyDF$Prey.mass))
Let's look at the same using a base-10 log transform:
plot(log10(MyDF$Predator.mass),log10(MyDF$Prey.mass))
Using a log10
transform is often a good idea because then you can see things in terms of "orders of magnitude", ($10^1$, $10^2$, $10^3$, etc), which makes it easier to determine what the actual values in the original scale are.
We can change almost any aspect of the resulting graph; let's change the symbols by specifying the p
lot ch
aracters using pch
:
plot(log10(MyDF$Predator.mass),log10(MyDF$Prey.mass),pch=20) # Change marker
plot(log10(MyDF$Predator.mass),log10(MyDF$Prey.mass),pch=20, xlab = "Predator Mass (g)", ylab = "Prey Mass (g)") # Add labels
A really great summary of basic R graphical parameters can be found here.
Why did we have to take a logarithm to see the relationship between predator and prey size? Plotting histograms of the two classes (predator, prey) should be insightful, as we can then see the "marginal" distributions of the two variables.
Let's first plot a histogram of predator body masses:
hist(MyDF$Predator.mass)
Clearly, the data are heavily right skewed, with small body sized organisms dominating (that's a general pattern, as mentioned above). Let's now take a logarithm and see if we can get a better idea of what the distribution of predator sizes looks like:
hist(log10(MyDF$Predator.mass), xlab = "log10(Predator Mass (g))", ylab = "Count") # include labels
hist(log10(MyDF$Predator.mass),xlab="log10(Predator Mass (g))",ylab="Count",
col = "lightblue", border = "pink") # Change bar and borders colors
So, taking a log really makes clearer what the distribution of body predator sizes looks like. Try the same with prey body masses.
$\star$ You can do a lot of beautification and fine-tuning of your R plots! As an exercise, try adjusting the histogram bin widths to make them same for the predator and prey, and making the x and y labels larger and in boldface. To get started, look at the help documentation of hist
.
We can also plot both predator and prey body masses in different sub-plots using par
so that we can compare them visually.
par(mfcol=c(2,1)) #initialize multi-paneled plot
par(mfg = c(1,1)) # specify which sub-plot to use first
hist(log10(MyDF$Predator.mass),
xlab = "log10(Predator Mass (g))", ylab = "Count", col = "lightblue", border = "pink",
main = 'Predator') # Add title
par(mfg = c(2,1)) # Second sub-plot
hist(log10(MyDF$Prey.mass), xlab="log10(Prey Mass (g))",ylab="Count", col = "lightgreen", border = "pink", main = 'prey')
The par()
function can set multiple graphics parameters (not just multi-panel plots), including figure margins, axis labels, and more. Check out the help for this function.
Another option for making multi-panel plots is the layout
function.
Better still, we would like to see if the predator mass and prey mass distributions are similar by overlaying them.
hist(log10(MyDF$Predator.mass), # Predator histogram
xlab="log10(Body Mass (g))", ylab="Count",
col = rgb(1, 0, 0, 0.5), # Note 'rgb', fourth value is transparency
main = "Predator-prey size Overlap")
hist(log10(MyDF$Prey.mass), col = rgb(0, 0, 1, 0.5), add = T) # Plot prey
legend('topleft',c('Predators','Prey'), # Add legend
fill=c(rgb(1, 0, 0, 0.5), rgb(0, 0, 1, 0.5))) # Define legend colors
Plot annotation with text can be done with either single or double quotes, i.e., 'Plot Title' or "Plot Title", respectively. But it is generally a good idea to use double quotes because sometimes you would like to use an apostrophe in your title or axis label strings.
★ It would be nicer to have both the plots with the same bin sizes – try to do it
Now, let's try plotting boxplots instead of histograms. These are useful for getting a visual summary of the distribution of your data.
boxplot(log10(MyDF$Predator.mass), xlab = "Location", ylab = "log10(Predator Mass)", main = "Predator mass")
Now let's see how many locations the data are from:
boxplot(log(MyDF$Predator.mass) ~ MyDF$Location, # Why the tilde?
xlab = "Location", ylab = "Predator Mass",
main = "Predator mass by location")
Note the tilde (~
). This is to tell R to subdivide or categorize your analysis and plot by the "Factor" location. More on this later.
That's a lot of locations! You will need an appropriately wide plot to see all the boxplots adequately. Now let's try boxplots by feeding interaction type:
boxplot(log(MyDF$Predator.mass) ~ MyDF$Type.of.feeding.interaction,
xlab = "Location", ylab = "Predator Mass",
main = "Predator mass by feeding interaction type")
It would be nice to see both the predator and prey (marginal) distributions as well as the scatterplot for an exploratory analysis. We can do this by adding boxplots of the marginal variables to the scatterplot.
par(fig=c(0,0.8,0,0.8)) # specify figure size as proportion
plot(log(MyDF$Predator.mass),log(MyDF$Prey.mass), xlab = "Predator Mass (g)", ylab = "Prey Mass (g)") # Add labels
par(fig=c(0,0.8,0.4,1), new=TRUE)
boxplot(log(MyDF$Predator.mass), horizontal=TRUE, axes=FALSE)
par(fig=c(0.55,1,0,0.8),new=TRUE)
boxplot(log(MyDF$Prey.mass), axes=FALSE)
mtext("Fancy Predator-prey scatterplot", side=3, outer=TRUE, line=-3)
To understand this plotting method, think of the full graph area as going from (0,0) in the lower left corner to (1,1) in the upper right corner. The format of the fig=
parameter is a numerical vector of the form c(x1, x2, y1, y2)
, corresponding to c(bottom, left, top, right)
. First, par(fig=c(0,0.8,0,0.8))
sets up the scatterplot going from 0 to 0.8 on the x axis and 0 to 0.8 on the y axis, leaving some area for the boxplots at the top and right. The top boxplot goes from 0 to 0.8 on the x axis and 0.4 to 1 on the y axis. The right hand boxplot goes from 0.55 to 1 on the x axis and 0 to 0.8 on the y axis. You can experiment with these proportions to change the spacings between plots.
This plot is useful, because it shows you what the marginal distributions of the two variables (Predator mass and Prey mass) are.
Below you will learn to use ggplot
to produce a much more elegant, pretty multi-panel plots.
And you can also save the figure in a vector graphics format like a pdf. It is important to learn to do this, because you want to be able to save your plots in good resolution, and want to avoid the manual steps of clicking on the figure, doing "save as", etc. So let's save the figure as a PDF:
pdf("../results/Pred_Prey_Overlay.pdf", # Open blank pdf page using a relative path
11.7, 8.3) # These numbers are page dimensions in inches
hist(log(MyDF$Predator.mass), # Plot predator histogram (note 'rgb')
xlab="Body Mass (g)", ylab="Count", col = rgb(1, 0, 0, 0.5), main = "Predator-Prey Size Overlap")
hist(log(MyDF$Prey.mass), # Plot prey weights
col = rgb(0, 0, 1, 0.5),
add = T) # Add to same plot = TRUE
legend('topleft',c('Predators','Prey'), # Add legend
fill=c(rgb(1, 0, 0, 0.5), rgb(0, 0, 1, 0.5)))
graphics.off(); #you can also use dev.off()
Copy the file back to permanent storage on ScienceData. We're using system
to call a shell command. Unfortuantely system
doesn't give any feedback, so better check on sciencedata.dk that the file got written.
Notice: Change tmp
to a suitable directory on ScienceData.
system('curl --insecure --location-trusted --upload "/home/sciencedata/Documents/CMEECourseWork/week1/results/Pred_Prey_Overlay.pdf" "https://sciencedata/files/tmp/Pred_Prey_Overlay.pdf"')
Always try to save results in a vector format, which can be scaled up to any size. For more on vector vs raster images/graphics, see this.
Note that you are saving to the results
directory now. This is a recommended project organization and workflow: store and retrieve data from a Data
directory, keep your code and work from a Code
directory, and save outputs to a results
directory.
You can also try other graphic output formats. For example, png()
(a raster format) instead of pdf()
. As always, look at the help documentation of each of these commands!
Write a script that draws and saves three figures, each containing subplots of distributions of predator mass, prey mass, and the size ratio of prey mass over predator mass by feeding interaction type. Use logarithms of masses (or size ratios)for all three plots. In addition, the script should calculate the (log) mean and median predator mass, prey mass and predator-prey size-ratios to a csv file.
Call the script PP_Dists.R
and save it in the Code
directory — sourcing or running this script should result in three files called Pred_Subplots.pdf
, Prey_Subplots.pdf
, and SizeRatio_Subplots.pdf
(the names should be self-explanatory), and a single csv file PP_Results.csv
containing the mean and median log predator mass, prey mass, and predator-prey size ratio, by feeding type. The csv file should have appropriate headers (e.g., Feeding type, Mean, Median). (Hint: you will have to initialize a new dataframe or matrix in the script to first store the calculations)
The script should be self-sufficient and not need any external inputs. It should import the above predator-prey dataset from the appropriate directory, and save the graphic plots to the appropriate directory (Hint: use relative paths!).
Use the par()
function to create the subplots. For calculating the body size statistics by feeding type, you can either use the "loopy" way — first obtaining a list of feeding types (look up the unique
or levels
functions) and then loop over them, using subset
to extract the dataset by feeding type at each iteration, or the R-savvy way, by using tapply
or ddply
and avoiding looping altogether.
R can produce beautiful visualizations, but it typically takes a lot of work to obtain the desired result. This is because the starting point is pretty much a "bare" plot, and adding features commonly required for publication-grade figures (legends, statistics, regressions, sub-plotting etc.) can require a lot of small and painful additional arguments to the plotting commands at the same time, or even additional steps (such as the fancy predator-prey scatterplot above).
Moreover, it is very difficult to switch from one representation of the data to another (i.e., from boxplots to scatterplots), or to plot several datasets together. The R package ggplot2
overcomes these issues, and produces truly high-quality, publication-ready graphics suitable for papers, theses and reports.
{note}
**3D plots:** *Currently, `ggplot2` cannot be used to create 3D graphs or mosaic plots (but [see this](https://www.tylermw.com/3d-ggplots-with-rayshader)). In any case, most of you won't be needing 3D plots. If you do, there are many ways to do 3D plots using other plotting packages in R. In particular, look up the `scatterplot3d` and `plot3D` packages.*
ggplot2
differs from other approaches as it attempts to provide a "grammar" for graphics in which each layer is the equivalent of a verb, subject etc. and a plot is the equivalent of a sentence. All graphs start with a layer showing the data, other layers and attributes/styles are added to modify the plot. Specifically, according to this grammar, a statistical graphic is a "mapping" from data to geometric objects (points, lines, bars; set using geom
) with aesthetic attributes (colour, shape, size; set using aes
).
For more on the ideas underlying ggplot, see the book "ggplot2: Elegant Graphics for Data Analysis", by H. Wickham (in your Reading directory). Also, the ggplot2 website is an excellent resource.
If ggplot2
is not available on your computer, look up the section on installing packages in the R Chapter.
ggplot can be used in two ways: with qplot
(for q
uick plot
ting) and ggplot
for fully customized plotting.
Note that ggplot2
only accepts data in data frames.
qplot
¶qplot
can be used to quickly produce graphics for exploratory data analysis, and as a base for more complex graphics. It uses syntax that is closer to the standard R plotting commands.
We will use the same predator-prey body size dataset again – you will soon see how much nice the same types of plots you made above look when done with ggplot!
First, load the package:
require(ggplot2)
Let's start plotting the Predator.mass
vs Prey.mass
:
qplot(Prey.mass, Predator.mass, data = MyDF)
As before, let's take logarithms and plot:
qplot(log(Prey.mass), log(Predator.mass), data = MyDF)
Now, color the points according to the type of feeding interaction:
qplot(log(Prey.mass), log(Predator.mass), data = MyDF, colour = Type.of.feeding.interaction)
But the figure's aspect ratio is not very nice. Let's change it using the asp
option:
qplot(log(Prey.mass), log(Predator.mass), data = MyDF, colour = Type.of.feeding.interaction, asp = 1)
The same as above, but changing the shape:
qplot(log(Prey.mass), log(Predator.mass), data = MyDF, shape = Type.of.feeding.interaction, asp = 1)
These examples demonstrate a key difference between qplot
(and indeed, ggplot2's approach) and the standard plot
command: When you want to assign colours, sizes or shapes to the points on your plot, using the plot
command, it's your responsibility to convert (i.e., "map") a categorical variable in your data (e.g., type of feeding interaction in the above case) onto colors (or shapes) that plot
knows how to use (e.g., by specifying "red", "blue", "green", etc).
ggplot does this mapping for you automatically, and also provides a legend! This makes it really easy to quickly include additional data (e.g., if a new feeding interaction type was added to the data) on the plot.
Instead of using ggplot's automatic mapping, if you want to manually set a color or a shape, you have to use I()
(meaning "Identity"). To see this in practice, try the following:
qplot(log(Prey.mass), log(Predator.mass),
data = MyDF, colour = "red")
You chose red, but ggplot used mapping to convert it to a particular shade of red. To set it manually to the real red, do this:
qplot(log(Prey.mass), log(Predator.mass), data = MyDF, colour = I("red"))
Similarly, for point size, compare these two:
qplot(log(Prey.mass), log(Predator.mass), data = MyDF, size = 3) #with ggplot size mapping
qplot(log(Prey.mass), log(Predator.mass), data = MyDF, size = I(3)) #no mapping
But for shape, ggplot doesn't have a continuous mapping because shapes are a discrete variable.So
qplot(log(Prey.mass), log(Predator.mass), data = MyDF, shape = 3)
will give an error.
Instead, try this:
qplot(log(Prey.mass), log(Predator.mass), data = MyDF, shape= I(3))
Because there are so many points, we can make them semi-transparent using alpha
so that the overlaps can be seen:
qplot(log(Prey.mass), log(Predator.mass), data = MyDF, colour = Type.of.feeding.interaction, alpha = I(.5))
Here, try using alpha = .5
instead of alpha = I(.5)
and see what happens.
Now add a smoother to the points:
qplot(log(Prey.mass), log(Predator.mass), data = MyDF, geom = c("point", "smooth"))
If we want to have a linear regression, we need to specify the method as
being lm
:
qplot(log(Prey.mass), log(Predator.mass), data = MyDF, geom = c("point", "smooth")) + geom_smooth(method = "lm")
lm
stands for l
inear m
odels. Linear regression is a type of linear model. More on this in the BASIC DATA ANALYSES AND STATISTICS section of these notes.
We can also add a "smoother" for each type of interaction:
qplot(log(Prey.mass), log(Predator.mass), data = MyDF, geom = c("point", "smooth"),
colour = Type.of.feeding.interaction) + geom_smooth(method = "lm")
To extend the lines to the full range, use fullrange = TRUE
:
qplot(log(Prey.mass), log(Predator.mass), data = MyDF, geom = c("point", "smooth"),
colour = Type.of.feeding.interaction) + geom_smooth(method = "lm",fullrange = TRUE)
Now let's see how the ratio between prey and predator mass changes according to the type of interaction:
qplot(Type.of.feeding.interaction, log(Prey.mass/Predator.mass), data = MyDF)
Because there are so many points, we can "jitter" them to get a better idea of the spread:
qplot(Type.of.feeding.interaction, log(Prey.mass/Predator.mass), data = MyDF, geom = "jitter")
Or we can draw a boxplot of the data (note the geom
argument, which stands for geom
etry):
qplot(Type.of.feeding.interaction, log(Prey.mass/Predator.mass), data = MyDF, geom = "boxplot")
Now let's draw an histogram of predator-prey mass ratios:
qplot(log(Prey.mass/Predator.mass), data = MyDF, geom = "histogram")
Color the histogram according to the interaction type:
qplot(log(Prey.mass/Predator.mass), data = MyDF, geom = "histogram",
fill = Type.of.feeding.interaction)
You can also define your own bin widths (in units of the x axis):
qplot(log(Prey.mass/Predator.mass), data = MyDF, geom = "histogram",
fill = Type.of.feeding.interaction, binwidth = 1)
To make it easier to read, we can plot the smoothed density of the data:
qplot(log(Prey.mass/Predator.mass), data = MyDF, geom = "density",
fill = Type.of.feeding.interaction)
And you can make the densities transparent so that the overlaps are visible:
qplot(log(Prey.mass/Predator.mass), data = MyDF, geom = "density",
fill = Type.of.feeding.interaction,
alpha = I(0.5))
or using colour
instead of fill
draws only the edge of the curve:
qplot(log(Prey.mass/Predator.mass), data = MyDF, geom = "density",
colour = Type.of.feeding.interaction)
Similarly, geom = "bar"
produces a barplot, geom = "line"
a series of points joined by a line, etc.
An alternative way of displaying data belonging to different classes is using "faceting":
qplot(log(Prey.mass/Predator.mass), facets = Type.of.feeding.interaction ~., data = MyDF, geom = "density")
The ~.
(the space is not important) notation tells ggplot whether to do the faceting by row or by column. So if you want a by-column configuration, switch ~
and .
, and also swap the position of the .~
:
qplot(log(Prey.mass/Predator.mass), facets = .~ Type.of.feeding.interaction, data = MyDF, geom = "density")
A better way to plot data in the log scale is to also set the axes to be logarithmic:
qplot(Prey.mass, Predator.mass, data = MyDF, log="xy")
Let's add a title and labels:
qplot(Prey.mass, Predator.mass, data = MyDF, log="xy",
main = "Relation between predator and prey mass",
xlab = "log(Prey mass) (g)",
ylab = "log(Predator mass) (g)")
Adding + theme_bw()
makes it suitable for black and white printing.
qplot(Prey.mass, Predator.mass, data = MyDF, log="xy",
main = "Relation between predator and prey mass",
xlab = "Prey mass (g)",
ylab = "Predator mass (g)") + theme_bw()
Finally, let's save a pdf file of the figure (same approach as we used before):
pdf("../results/MyFirst-ggplot2-Figure.pdf")
print(qplot(Prey.mass, Predator.mass, data = MyDF,log="xy",
main = "Relation between predator and prey mass",
xlab = "log(Prey mass) (g)",
ylab = "log(Predator mass) (g)") + theme_bw())
dev.off()
system('curl --insecure --location-trusted --upload "/home/sciencedata/Documents/CMEECourseWork/week1/results/MyFirst-ggplot2-Figure.pdf" "https://sciencedata/files/tmp/MyFirst-ggplot2-Figure.pdf"')
Using print
ensures that the whole command is kept together and that you can use the command in a script.
Other important options to keep in mind:
Option | |
---|---|
xlim |
limits for x axis: xlim = c(0,12) |
ylim |
limits for y axis |
log |
log transform variable log = "x" , log = "y" , log = "xy" |
main |
title of the plot main = "My Graph" |
xlab |
x-axis label |
ylab |
y-axis label |
asp |
aspect ratio asp = 2 , asp = 0.5 |
margins |
whether or not margins will be displayed |
geom
argument¶geom
Specifies the geometric objects that define the graph type. The geom option is expressed as a R character vector with one or more entries. geom values include "point", "smooth", "boxplot", "line", "histogram", "density", "bar", and "jitter".
Try the following:
# load the data
MyDF <- as.data.frame(read.csv(paste0(baseDataUrl,"EcolArchives-E089-51-D1.csv")))
# barplot
qplot(Predator.lifestage, data = MyDF, geom = "bar")
# boxplot
qplot(Predator.lifestage, log(Prey.mass), data = MyDF, geom = "boxplot")
# violin plot
qplot(Predator.lifestage, log(Prey.mass), data = MyDF, geom = "violin")
# density
qplot(log(Predator.mass), data = MyDF, geom = "density")
# histogram
qplot(log(Predator.mass), data = MyDF, geom = "histogram")
# scatterplot
qplot(log(Predator.mass), log(Prey.mass), data = MyDF, geom = "point")
# smooth
qplot(log(Predator.mass), log(Prey.mass), data = MyDF, geom = "smooth")
qplot(log(Predator.mass), log(Prey.mass), data = MyDF, geom = "smooth", method = "lm")
ggplot
¶The command qplot
allows you to use only a single dataset and a single set of "aesthetics" (x, y, etc.). To make full use of ggplot2
, we need to use the command ggplot
, which allows you to use "layering". Layering is the mechanism by which additional data elements are added to a plot. Each layer can come from a different dataset and have a different aesthetic mapping, allowing us to create plots that could not be generated using qplot()
, which permits only a single dataset and a single set of aesthetic mappings.
For a ggplot
plotting command, we need at least:
The data to be plotted, in a data frame;
Aesthetics mappings, specifying which variables we want to plot, and how;
The geom
, defining the geometry for representing the data;
(Optionally) some stat
that transforms the data or performs statistics using the data.
To start a graph, we must specify the data and the aesthetics:
p <- ggplot(MyDF, aes(x = log(Predator.mass),
y = log(Prey.mass),
colour = Type.of.feeding.interaction))
Here we have created a graphics object p
to which we can add layers and other plot elements.
Now try to plot the graph:
p
Plot is blank because we are yet to specify a geometry — only then can we see the graph:
p + geom_point()
We can use the "+" sign to concatenate different commands:
p <- ggplot(MyDF, aes(x = log(Predator.mass), y = log(Prey.mass), colour = Type.of.feeding.interaction ))
q <- p +
geom_point(size=I(2), shape=I(10)) +
theme_bw() + # make the background white
theme(aspect.ratio=1) # make the plot square
q
Let's remove the legend:
q + theme(legend.position = "none") + theme(aspect.ratio=1)
Now, same as before, but using ggplot instead of qplot:
p <- ggplot(MyDF, aes(x = log(Prey.mass/Predator.mass), fill = Type.of.feeding.interaction )) + geom_density()
p
p <- ggplot(MyDF, aes(x = log(Prey.mass/Predator.mass), fill = Type.of.feeding.interaction)) + geom_density(alpha=0.5)
p
options(repr.plot.width=12, repr.plot.height= 8) # Change plot size (in cm)
You can also create a multifaceted plot:
p <- ggplot(MyDF, aes(x = log(Prey.mass/Predator.mass), fill = Type.of.feeding.interaction )) + geom_density() + facet_wrap( .~ Type.of.feeding.interaction)
p
The axis limits in the plots are fixed irrespective of the data ranges, making it harder to visualize the shapes of the distributions. You can free up the axes to allow data-specific axis limits by using the scales = "free"
argument:
p <- ggplot(MyDF, aes(x = log(Prey.mass/Predator.mass), fill = Type.of.feeding.interaction )) + geom_density() + facet_wrap( .~ Type.of.feeding.interaction, scales = "free")
p
{tip}
In `facet_wrap`, You can also free up just the x or y scales; look up the documentation for this function.
options(repr.plot.width=10, repr.plot.height= 12) # Change plot size (in cm)
Now let's plot the size-ratio distributions by location (and again, with scales free):
p <- ggplot(MyDF, aes(x = log(Prey.mass/Predator.mass))) + geom_density() + facet_wrap( .~ Location, scales = "free")
p
A different example:
p <- ggplot(MyDF, aes(x = log(Prey.mass), y = log(Predator.mass))) + geom_point() + facet_wrap( .~ Location, scales = "free")
p
You can also combine categories like this (this will be BIG plot!):
options(repr.plot.width=12, repr.plot.height= 14) # Change plot size (in cm)
p <- ggplot(MyDF, aes(x = log(Prey.mass), y = log(Predator.mass))) + geom_point() + facet_wrap( .~ Location + Type.of.feeding.interaction, scales = "free")
p
And you can also change the order of the combination:
p <- ggplot(MyDF, aes(x = log(Prey.mass), y = log(Predator.mass))) + geom_point() + facet_wrap( .~ Type.of.feeding.interaction + Location, scales = "free")
p
For more fine-tuned faceting, look up the facet_grid()
and facet_wrap()
functions within ggplot2
. Look up this variant of the R Cookbook for more examples.
options(repr.plot.width=7, repr.plot.height= 6) # Change plot size (in cm)
Let's plot the values of a matrix. This is basically the same as rendering a 2D image. We will visualize random values taken from a unform distribution $\mathcal U [0,1]$. Because we want to plot a matrix, and ggplot2 accepts only dataframes, we use the package reshape2, which can "melt" a matrix into a dataframe:
require(reshape2)
GenerateMatrix <- function(N){
M <- matrix(runif(N * N), N, N)
return(M)
}
M <- GenerateMatrix(10)
Melt <- melt(M)
p <- ggplot(Melt, aes(Var1, Var2, fill = value)) + geom_tile() + theme(aspect.ratio = 1)
p
Add a black line dividing cells:
p + geom_tile(colour = "black") + theme(aspect.ratio = 1)
Remove the legend:
p + theme(legend.position = "none") + theme(aspect.ratio = 1)
Remove all the rest:
p + theme(legend.position = "none",
panel.background = element_blank(),
axis.ticks = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.text.x = element_blank(),
axis.title.x = element_blank(),
axis.text.y = element_blank(),
axis.title.y = element_blank())
Explore some colors:
p + scale_fill_continuous(low = "yellow", high = "darkgreen")
p + scale_fill_gradient2()
p + scale_fill_gradientn(colours = grey.colors(10))
p + scale_fill_gradientn(colours = rainbow(10))
p + scale_fill_gradientn(colours = c("red", "white", "blue"))
{tip}
Look up the `plotly` package, which allows you to create interactive graphs from ggplot2 plots
According to Girko's circular law, the eigenvalues of a matrix $M$ of size $N \times N$ are approximately contained in a circle in the complex plane with radius $\sqrt{N}$. Let's draw the results of a simulation displaying this result.
First, we need to build a function object that will calculate the ellipse (the perdicted bounds of the eigenvalues):
build_ellipse <- function(hradius, vradius){ # function that returns an ellipse
npoints = 250
a <- seq(0, 2 * pi, length = npoints + 1)
x <- hradius * cos(a)
y <- vradius * sin(a)
return(data.frame(x = x, y = y))
}
N <- 250 # Assign size of the matrix
M <- matrix(rnorm(N * N), N, N) # Build the matrix
eigvals <- eigen(M)$values # Find the eigenvalues
eigDF <- data.frame("Real" = Re(eigvals), "Imaginary" = Im(eigvals)) # Build a dataframe
my_radius <- sqrt(N) # The radius of the circle is sqrt(N)
ellDF <- build_ellipse(my_radius, my_radius) # Dataframe to plot the ellipse
names(ellDF) <- c("Real", "Imaginary") # rename the columns
Now the plotting:
# plot the eigenvalues
p <- ggplot(eigDF, aes(x = Real, y = Imaginary))
p <- p +
geom_point(shape = I(3)) +
theme(legend.position = "none")
# now add the vertical and horizontal line
p <- p + geom_hline(aes(yintercept = 0))
p <- p + geom_vline(aes(xintercept = 0))
# finally, add the ellipse
p <- p + geom_polygon(data = ellDF, aes(x = Real, y = Imaginary, alpha = 1/20, fill = "red"))
p
$\star$ Create a self-sufficient R Script called Girko.R
that combines all the above commands for plotting the Girko's law simulation, including saving the result figure as a file called Girko.pdf
in the results directory.
In this example, we will use the ggplot geom
text
to annotate a plot.
a <- read.table(paste0(baseDataUrl,"Results.txt"), header = TRUE)
Check what the data looks like:
head(a)
a$ymin <- rep(0, dim(a)[1]) # append a column of zeros
# Print the first linerange
p <- ggplot(a)
p <- p + geom_linerange(data = a, aes(
x = x,
ymin = ymin,
ymax = y1,
size = (0.5)
),
colour = "#E69F00",
alpha = 1/2, show.legend = FALSE)
# Print the second linerange
p <- p + geom_linerange(data = a, aes(
x = x,
ymin = ymin,
ymax = y2,
size = (0.5)
),
colour = "#56B4E9",
alpha = 1/2, show.legend = FALSE)
# Print the third linerange:
p <- p + geom_linerange(data = a, aes(
x = x,
ymin = ymin,
ymax = y3,
size = (0.5)
),
colour = "#D55E00",
alpha = 1/2, show.legend = FALSE)
# Annotate the plot with labels:
p <- p + geom_text(data = a, aes(x = x, y = -500, label = Label))
# now set the axis labels, remove the legend, and prepare for bw printing
p <- p + scale_x_continuous("My x axis",
breaks = seq(3, 5, by = 0.05)) +
scale_y_continuous("My y axis") +
theme_bw() +
theme(legend.position = "none")
p
$\star$ Create a self-sufficient R Script called MyBars.R
that combines all the above commands for annotating a plot, including saving the result figure as a file called MyBars.pdf
.
Let's try mathematical annotation of an axis as well as within the plot area.
First create some linear regression "data":
x <- seq(0, 100, by = 0.1)
y <- -4. + 0.25 * x +
rnorm(length(x), mean = 0., sd = 2.5)
# and put them in a dataframe
my_data <- data.frame(x = x, y = y)
# perform a linear regression
my_lm <- summary(lm(y ~ x, data = my_data))
# plot the data
p <- ggplot(my_data, aes(x = x, y = y,
colour = abs(my_lm$residual))
) +
geom_point() +
scale_colour_gradient(low = "black", high = "red") +
theme(legend.position = "none") +
scale_x_continuous(
expression(alpha^2 * pi / beta * sqrt(Theta)))
# add the regression line
p <- p + geom_abline(
intercept = my_lm$coefficients[1][1],
slope = my_lm$coefficients[2][1],
colour = "red")
# throw some math on the plot
p <- p + geom_text(aes(x = 60, y = 0,
label = "sqrt(alpha) * 2* pi"),
parse = TRUE, size = 6,
colour = "blue")
p
$\star$ Create a self-sufficient R Script called plotLin.R
that combines all the above commands for annotating a linear regression plot, including saving the resulting figure as a file called MyLinReg.pdf
.
The package ggthemes
provides you some additional geom
s, scale
s, and theme
s for ggplot
. These include a theme based on Tufte's The Visual Display of Quantitative Information (see suggested readings).
Let's try this package. First install it:
install.packages("ggthemes")
Then try:
install.packages("ggthemes")
library(ggthemes)
p <- ggplot(MyDF, aes(x = log(Predator.mass), y = log(Prey.mass),
colour = Type.of.feeding.interaction )) +
geom_point(size=I(2), shape=I(10)) + theme_bw()
p + geom_rangeframe() + # now fine tune the geom to Tufte's range frame
theme_tufte() # and theme to Tufte's minimal ink theme
Go to https://github.com/jrnold/ggthemes for more information and a list of geom
s, theme
s, and scale
s.
In this practical, you will write script that draws and saves a pdf file of the following figure, and writes the accompanying regression results to a formatted table in csv. Note that as the plot suggests, that the analysis must be subsetted by the Predator.lifestage
field of the dataset.
Guidelines:
Write a R
script file called PP_Regress.R
and save it in the Code
directory. Sourcing or running this
script should result in one pdf file containing the following figure being saved in the results
directory: (Hint: Use the print()
command to write to the pdf)
In addition, the script should calculate the regression results corresponding to the lines fitted in the figure and save it to a csv delimited table called (PP_Regress_Results.csv
), in the results
directory. (Hint: you will have to initialize a new dataframe in the script to first store the calculations and then write.csv()
or write.table()
it.)
All that you are being asked for here is results of an analysis of Linear regression on subsets of the data corresponding to available Feeding Type $\times$ Predator life Stage combination — not a multivariate linear model with these two as separate covariates!
The regression results should include the following with appropriate headers (e.g., slope, intercept, etc, in each Feeding type $\times$ life stage category): regression slope, regression intercept, R$^2$, F-statistic value, and p-value of the overall regression (Hint: Review the Stats week!).
The script should be self-sufficient and not need any external inputs — it should import the above predator-prey dataset from the appropriate directory, and save the graphic plots to the appropriate directory (Hint: use relative paths). Anybody should be able to source
it without errors.
You can also use the dplyr
function instead of looping (see R Chapter, and the ggplot
command instead of qplot
.
You can do mapping in R using the maps
package. In this practical, you will map the Global Population Dynamics Database (GPDD). This is a freely available database that was developed at Silwood. The Living Planet Index is based upon these data.
load()
from GPDDFiltered.RData
that is available on the TheMulQuaBio git repository — have a look at the data contents. install.packages()
to install the package maps
, as you did with ggplot2
GPDD_Data.R
that:maps
Do the same as (1) above, but the analysis this time should be separate by the dataset's Location
field. Call it PP_Regress_loc.R
. No need to generate plots for this (just the analysis results to a .csv
file), as a combination of Type.of.feeding.interaction
, Predator.lifestage
, and Location
will be far too busy (faceting by three variables is too much to be meaningful)!
Check out DataDataData!, Visualization and R, under readings on the TheMulQuaBio repository.
Brian McGill's Ten commandments for data management
This paper covers similar ground (available in the readings directory): Borer et al (2009). Some Simple Guidelines for Effective Data Management. Bulletin of the Ecological Society of America, 90(2), 205-214.
The classic Tufte; Available in the Imperial College Central Library. I have also added extracts and a related book in pdf on the git repository. BTW, check out what Tufte thinks of PowerPoint.