Note: This material was modified from http://www.molecularecologist.com/2016/02/quick-and-dirty-tree-building-in-r and Dr. Schwartz’s example script.
Creating mammals phylogenies from example DNA sequences.
Creating new directory:
-Select “File - New Project” -Select “New Directory” and “Empty Project” -Give a name to the project (for eg. “Phylogenies”) -Create a new R script
Required Packages: “ape”, “pharngorn”, “seqinr”, “ggtree”, “Biostrings”.
install.packages("ape")
install.packages("phangorn")
install.packages("seqinr")
install.packages("ggtree")
source("https://bioconductor.org/biocLite.R")
biocLite("Biostrings")
library("ape")
library("phangorn")
library("seqinr")
library("ggplot2")
library("ggtree")
Step 1 - reading DNA files (for more information: https://cran.r-project.org/web/packages/ape/ape.pdf)
Options: Downloading example data from:
you can read DNA seq. from GenBank via Internet
you can read DNA seq. from other databases (EBI etc.).
To do that see “ape” manual read.DNA section.
In this example we will get DNA seq. from http://evolution.gs.washington.edu/gs570/2016/#data
Download ADAM7.dna file to your project folder.
mammals <- read.dna("ADAM7.dna", format="interleaved")
mammals_phyDat <- phyDat(mammals, type = "DNA", levels = NULL)
Next step: Determining your nucleotide evolution model that best fits your data. Best model should have lowest AIC score.
mt <- modelTest(mammals_phyDat)
print(mt)
min(mt[4])
In this case, your best model should be GTR+G+I but dist.ml() function doesn’t support this model. You can choose JC69 (Jukes and Cantor, 1969). This is the simplest substitution model and it assumes equal base and equal mutation rates.
Distance matrix:
dna_dist <- dist.ml(mammals_phyDat, model="JC69")
Let’s get a quick phylogenetic tree based on distance matrices.
UPGMA = Unweighted Pair Group Method with Arithmetic Mean NJ = Neighbor Joining Algorithm
mammals_UPGMA <- upgma(dna_dist)
mammals_NJ <- NJ(dna_dist)
plot(mammals_UPGMA, main="UPGMA")
plot(mammals_NJ, main = "Neighbor Joining")
Which tree is the best? We can use parsimony method.
parsimony(tree = mammals_UPGMA, data = mammals_phyDat)
parsimony(mammals_NJ,mammals_phyDat)
mammals_optim <- optim.parsimony(mammals_NJ, mammals_phyDat)
plot(mammals_optim)
“Though more computationally intensive than distance-matrix methods, building trees with maximum likelihood methods allows you to include the full data from your sequence alignment in a statistical framework that estimates model parameters. You can begin by computing the likelihood of a given tree with the function pml(). Then, the function optim.pml() will optimize tree topology and branch length for your selected model of nucleotide evolution.”
fit <- pml(mammals_NJ, data=mammals_phyDat)
print(fit)
fitJC <- optim.pml(fit, model = "JC", rearrangement = "stochastic")
logLik(fitJC)
You can also bootstrap your results for a better picture of how much a small portion of your data affects your overall results.
bs <- bootstrap.pml(fitJC, bs=100, optNni=TRUE, multicore=TRUE, control = pml.control(trace=0))
plotBS(midpoint(fitJC$tree), bs, p = 50, type="p")
Writing tree:
bstree <- plotBS(midpoint(fitJC$tree), bs, p = 50, type="p")
plotBS(bstree, type = "p")
write.tree(bstree, file="mammals_example.tre")
You can also change plotting options:
plot(bstree, type = "phylogram", use.edge.length = FALSE, show.node.label = TRUE,
label.offset = 0, node.depth = 2)