sco.h-its.orgsco.h-its.org/exelixis/pubs/dissFernando.pdf · Die dritte Methode wird SEV (Subtree Equality Vectors) genannt und reduziert den Speicherbedarf nahezu proportional zum

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Zusammenfassung

Phylogenetische Baume stellen evolutionaren Beziehungen zwischen verschie-denen Organismen (Arten) dar, die vermutlich gemeinsame Vorfahren haben.Außere Knoten eines phylogenetischen Baumes (taxa) bezeichnen lebendi-ge Arten, fur die molekulare Daten verfugbar sind oder sequenziert werdenkonnen. Innere Knoten bezeichnen hypothetische ausgestorbene Arten, furdie in der Regel keine Datenquelle zur Verfugung steht.

In den letzten Jahren hat sich das Feld durch die Einfuhrung von sogen-nanten NGS (Next-Generation-Sequencing) Methoden dramatisch geandert.Diese Methoden, die sich rasch etablierten, erlauben einen erhohten Durch-satz. Aus diesem Grund wachsen aktuelle molekulare Datenbanken schnellerals vom mooreschen Gesetz vorhergesagt. Dadurch entsteht die Herausforde-rung, solche große molekulare Datenmengen effizient zu analysieren.

Die Maximum-Likelihood Baumrekonstruktion versucht, den Baum mitdem hochsten Likelihood score fur ein bestimmtes Sequenzalignment (input)zu finden. Von Seiten der Informatik besteht die Herausforderung darin, diePhylogenetic Likelihood Function (PLF) zum Bewerten alternativer Topolo-gien effizient zu berechnen.

In dieser Arbeit beschaftigen wir uns mit der Untersuchung und Entwick-lung von Methoden, die diese Aufgabe, im Zusammenhang mit groß angeleg-ten Datensatzen, erleichtern konnen. Wir prasentieren drei unterschiedlicheAnsatze hierfur: den Speicherbedarf der PLF Funktion zu verringern, Lauf-zeiten zu reduzieren und die Automatisierung phylogenetischer Analysen.

Verringerung des Speicherbedarfs der PLF Funktion Wir entwickel-ten drei Methoden, die den Speicherbedarf fur Zwischenergebnisse ver-ringern kann. Die (i) out-of-core und die (ii) recomputation habenmiteinander gemeinsam, dass nur ein Teil der Zwischenergebnisse imHauptspeicher gespeichert werden muss. Die restlichen Zwischenergeb-nisse werden entweder auf der Festplatte (out-of-core) gespeichert odererneut berechnet (recomputation trade-off).

Unsere Auswertungen deuten darauf hin, dass der recomputation An-satz (trade-off) deutlich effizienter ist, da er fur typische Datensatze den

i

Speicherbedarf halbiert zu Lasten einer um 10% erhohten Laufzeit.

Die dritte Methode wird SEV (Subtree Equality Vectors) genannt undreduziert den Speicherbedarf nahezu proportional zum Anteil fehlen-der Daten im Datensatz ohne die Laufzeit zu erhohen. Im allgemeinenkonnen diese Methoden gleichzeitig eingesetzt werden.

Reduzierung von Laufzeiten Wir portierten die PLF Implementierungauf GPUs mit OpenCL. Dabei passten wir das Speicherlayout an, umeine optimale GPU Leistung zu erreichen. Unsere Auswertungen zeigen,dass fur ausreichend lange Datensatze, die Verlagerung von Berech-nungen auf die GPU bis zu zweimal schneller als der am effizientestenvektorisierte CPU-Code sein kann.

Aus einer algorithmischen Perspektive haben wir den sogenannten Back-bone Algorithm, der den Suchraum aller moglichen Baume beschranktenkann, entwickelt. Die Anwendung dieser Methode kann die Laufzeit biszu 50% verringern und liefert dabei Baume, die statistisch vergleichbarsind mit denen, die ohne Suchraumbeschrankungen gefunden werden.

Automatisierung phylogenetischer Analysen Der letzte Teil dieser Ar-beit beschaftigt sich mit dem Problem, dass vorhandene phylogeneti-sche Baume nach kurzer Zeit nicht mehr die aktuellsten Daten ausden schnell wachsenden Datenbanken widerspiegeln. Wir entwickeltenein Framework namens PUmPER (Phylogenies Updated PERpertual-ly). Mit PUmPER konnen Pipelines erstellt werden, die iterativ neueSequenzalignments assemblieren und Baume aus vorherigen Iteratio-nen erganzen. Das Framework kann entweder als stand-alone pipelinekonfiguriert werden oder rechenintensive Aufgaben auf einem Clusterausfuhren.

Obwohl die in dieser Arbeit beschriebene Methoden als proof-of-conceptinnerhalb der RAxML codebase implementiert wurden, sind diese Metho-den nichtsdestotrotz relativ einfach in anderen state-of-the-art Likelihood-basierten Codes implementierbar. Im allgemeinen erwarten wir, dass dieseMethoden hilfreich sind, um aktuelle phylogenetische Anwendungen besserzu skalieren und dadurch auch phylogenetische Analysen ermoglichen, die aufkompletten Genomen basieren.

ii

Acknowledgements

This research project would not have been possible without the support ofmany people. I wish to express my gratitude to my supervisor, Prof. Dr.Alexandros Stamatakis who was abundantly helpful and offered invaluableassistance, support and dedicated guidance. Deepest gratitude also to Prof.Dr. Arndt von Haeseler without whose knowledge and assistance this studywould not have been successful. I addition I would like to express my sin-cere gratitude to Casey Dunn, Stephen A. Smith, and John Cazes for kindlyhosting me at their lab during my research stay at Brown University. Specialthanks also to all my collaborators and fellow group members; Nikos Alachi-otis, Simon Berger, Andre Aberer, Pavlos Pavlidis, Solon P. Pissis, TomasFlouri, Dilrini de Silva, Paschalia Kapli, Kassian Kobert, Jiajie Zhang andAlexey Kozlov for being an always-collaborative and helpful group.

I would like to thank my parents, brothers and family for their uncondi-tional support, as well as my friends from Valladolid and Deggendorf, whohave always stood by me during my time in Heidelberg. Last but not least,my dearest thanks to Beifei for being an endless source of encouragementand inspiration.

This work was funded via the German Science Foundation (DFG) grantsSTA 860/2 and STA 860/3. Part of this work used resources provided by theiPlant Collaborative (funded by NSF grant #DBI-0735191).

iii

Contents

Zusammenfassung i

Acknowledgements iii

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Scientific Contribution . . . . . . . . . . . . . . . . . . . . . . 21.3 Structure of this thesis . . . . . . . . . . . . . . . . . . . . . . 4

2 Computational Molecular Phylogenetics 52.1 Statistical Models of Evolution . . . . . . . . . . . . . . . . . 52.2 Sequence Alignment . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.1 Pairwise Sequence Alignment . . . . . . . . . . . . . . 82.2.2 Multiple Sequence Alignment . . . . . . . . . . . . . . 9

2.3 Phylogenetic Trees . . . . . . . . . . . . . . . . . . . . . . . . 112.4 Tree Reconstruction Methods . . . . . . . . . . . . . . . . . . 122.5 Computing the Likelihood of a Tree . . . . . . . . . . . . . . . 15

2.5.1 Accounting for rate heterogeneity . . . . . . . . . . . . 192.6 Maximum Likelihood Tree Search . . . . . . . . . . . . . . . . 212.7 Phylogenetic Likelihood Library . . . . . . . . . . . . . . . . . 242.8 RAxML-family implementation concepts . . . . . . . . . . . . 25

2.8.1 Node records . . . . . . . . . . . . . . . . . . . . . . . 252.8.2 Internal tree nodes . . . . . . . . . . . . . . . . . . . . 252.8.3 Data structure for Trees . . . . . . . . . . . . . . . . . 262.8.4 Computing the likelihood on a Tree . . . . . . . . . . . 272.8.5 Optimizing branch lengths . . . . . . . . . . . . . . . . 29

3 Memory-Saving Techniques 313.1 Memory requirements for the PLF . . . . . . . . . . . . . . . . 323.2 Out-of-Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . 34

iv

3.2.2 Computing the PLF Out-of-Core . . . . . . . . . . . . 343.2.3 Experimental Setup & Results . . . . . . . . . . . . . . 39

3.3 Recomputation of Ancestral Vectors . . . . . . . . . . . . . . . 443.3.1 Recomputation of Ancestral Probability Vectors . . . . 453.3.2 Experimental Setup and Results . . . . . . . . . . . . . 543.3.3 Evaluation of traversal overhead . . . . . . . . . . . . . 56

3.4 Subtree Equality Vectors . . . . . . . . . . . . . . . . . . . . . 583.4.1 Gappy Subtree Equality Vectors . . . . . . . . . . . . . 583.4.2 Generation of Biological Test Datasets . . . . . . . . . 613.4.3 SEV Performance . . . . . . . . . . . . . . . . . . . . . 62

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4 The Backbone Algorithm 654.1 Constraining tree search to a backbone tree . . . . . . . . . . 654.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.2.1 Starting tree . . . . . . . . . . . . . . . . . . . . . . . . 674.2.2 Tip Clustering . . . . . . . . . . . . . . . . . . . . . . . 684.2.3 Backbone construction . . . . . . . . . . . . . . . . . . 714.2.4 Backbone-constrained Tree Search . . . . . . . . . . . . 73

4.3 Evaluation and Results . . . . . . . . . . . . . . . . . . . . . . 744.3.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . 744.3.2 Simulated Datasets (Accuracy) . . . . . . . . . . . . . 78

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5 Introduction to GPU Programming 795.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795.2 CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.2.1 CUDA Hardware and Architecture . . . . . . . . . . . 805.2.2 CUDA Programming Model . . . . . . . . . . . . . . . 825.2.3 Performance Considerations . . . . . . . . . . . . . . . 82

5.3 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.3.1 OpenCL performance portability . . . . . . . . . . . . 85

6 GPU implementation of Phylogenetic Kernels 866.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . 876.2 Generic Vectorization . . . . . . . . . . . . . . . . . . . . . . . 886.3 GPU Implementation . . . . . . . . . . . . . . . . . . . . . . . 91

6.3.1 Kernel Implementation . . . . . . . . . . . . . . . . . . 926.3.2 GPU Memory Organization . . . . . . . . . . . . . . . 946.3.3 OpenCL Implementation . . . . . . . . . . . . . . . . . 95

6.4 Experimental setup and results . . . . . . . . . . . . . . . . . 96

v

6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

7 Perpetual Phylogenies with PUmPER 1017.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 1027.2 Framework Overview . . . . . . . . . . . . . . . . . . . . . . . 103

7.2.1 MSA Construction/Extension with PHLAWD . . . . . 1057.2.2 Phylogenetic Inference . . . . . . . . . . . . . . . . . . 1077.2.3 Manual and automatic tree updates . . . . . . . . . . . 108

7.3 Software and Availability . . . . . . . . . . . . . . . . . . . . . 1087.3.1 Standalone implementation . . . . . . . . . . . . . . . 1107.3.2 Distributed implementation . . . . . . . . . . . . . . . 1107.3.3 Custom iPlant setup . . . . . . . . . . . . . . . . . . . 111

7.4 Evaluation and Results . . . . . . . . . . . . . . . . . . . . . . 1137.4.1 Biological examples . . . . . . . . . . . . . . . . . . . . 1137.4.2 Simulated Data . . . . . . . . . . . . . . . . . . . . . . 119

7.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1217.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

8 Conclusion And Outlook 1228.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1228.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

8.2.1 GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . 1238.2.2 PUmPER . . . . . . . . . . . . . . . . . . . . . . . . . 124

8.3 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

List of Figures 126

List of Tables 128

List of Acronyms 129

Bibliography 130

vi

Chapter 1

Introduction

1.1 Motivation

The landscape of genomics and phylogenetics has changed dramatically sincethe irruption of Next-Generation Sequencing (NGS) technologies [75]. Asa consequence of the throughput increases, the amount of data accumu-lated in molecular databases keeps growing at an accelerated pace. A cur-rent overview over NGS technology platforms is available in [51]. This dataavalanche facilitates the exploration of new research areas, such as phyloge-nomics, where phylogenetic techniques are applied to infer evolutionary re-lationships based on multi-species genomes [63]. Furthermore, the reducedsequencing cost has enabled the execution of collaborative large-scale genomesequencing projects, such as the 10K vertebrate genome project (http://www.genome10k.org/) and the 1000 insect transcriptome evolution project(http://www.1kite.org/). The size of the datasets used in these projectspose new challenges for large-scale maximum likelihood-based phylogeneticanalyses.

On the other hand, computing power has been growing exponentially inthe last decades as predicted by Moore’s law. However, at present, due tothe high throughput of new sequencing technologies, molecular sequence dataaccumulates at a pace faster than Moore’s law. This induces a growing gapbetween the data that is available and the computing resources that can beused for analyzing these data.

Phylogenetic trees are tree topologies that represent the evolutionary his-tory of a set of organisms. The field of computational phylogenetics entailsthe inference of phylogenetic trees. The input data for phylogenetic inferenceare generally a pre-processed (aligned) set of molecular sequences. The maincomputational challenge consists in efficiently computing the PLF (Phyloge-

1

netic Likelihood Function) for scoring alternative tree topologies.Given the above considerations, in this thesis, we identified and addressed

the following computational challenges in the field of computational phyloge-netics. Firstly, we aim to reduce the high memory requirements of the PLFto allow for analyzing larger datasets with the available hardware. Secondly,we intend to decrease the running time of tree inferences, either by algorith-mic means or architecture-specific optimizations, for instance, using graphiccards (GPUs). Finally, we also study and provide solutions related to theautomation of phylogenetic analysis (sequence alignment generation and treeinference). In this case, the focus is not on computational optimizations toreduce memory footprints or achieving speedups, but rather man-hours spentin planning and running phylogenetic analyses. The automation of compu-tational workflows will become an increasingly important task in genomicsas manual inspection and analysis of datasets becomes unpractical at thegenome scale.

1.2 Scientific Contribution

In this thesis, we explored several areas relevant to the inference of phyloge-netic trees under maximum likelihood. The phylogenetic likelihood function(PLF) introduced by Joe Felsenstein in 1981 [23] is one of the most importantstatistical functions in the area of evolutionary Bioinformatics.

The existing open source software RAxML [87], is a widely used tool forphylogenetic inference. In RAxML the PLF dominates inference times (typ-ically accounting for 80 - 95% of total execution time) and overall memoryrequirements (accounting for at least 70% of total RAM consumption).

In this thesis, we used RAxML as a platform to implement and test newsearch algorithms, port the PLF to GPUs, implement memory saving tech-niques, and to infer extremely large plant trees on real biological data. Thesead hoc implementations are freely available as open-source code under theGNU GPL license. While we have used RAxML as a workbench, the tech-niques described in this dissertation are conceptually generic enough to beapplied to other likelihood-based state-of-the-art programs for phylogeneticinference such as IQPNNI [53], GARLI [116], PHYML 3.0 [29], FastTree2.0 [66], MrBayes [72], PhyloBayes [46], and BEAST [17], as well as forPLF libraries such as BEAGLE [4], and other phylogenetic applications suchas DPPDiv [31], which requires computing the PLF to estimate divergencetimes on fixed topologies.

We have developed the backbone algorithm, which can reduce the timerequired for tree inferences by more than 50% while yielding ’good’ trees in

2

the statistical sense. This is achieved by constraining the space of possibletopologies that are evaluated by the PLF.

Regarding memory requirements, we present three different techniquesthat can be applied to compute the likelihood score for a given tree whilereducing memory usage. The first two techniques, named out-of-core andrecomputation approaches, reduce memory requirements at the cost of longerrunning times. The out-of-core approach stores on disk the intermediateresults that do not fit into memory. The recomputation approach is a trade offwhere additional computations are used to re-compute intermediate resultswhen not enough RAM memory is available to store them.

The third memory-saving technique presented, based on the SEV (SubtreeEquality Vectors) technique, implements the PLF so that some computationsare skipped when data is not present. This can reduce memory requirementsalmost proportionally to the amount of missing data, and, in some cases, itcan also significantly reduce running time. However, it is only applicable toinput datasets which show certain patterns of missing data.

BEAGLE [4] is a general purpose library for evaluating the likelihood ofsequence evolution on trees. One major advantage of using such softwarelibraries is that they often give the user access to parallelized and optimizedfunction kernel implementations. The exilixis-lab (http://exelixis-lab.org/) is currently developing the Phylogenetic Likelihood Library (PLL) [27].Based on the optimized and parallelized PLF implementation of RAxML, thePLL is a highly optimized open-source library that entails state-of-the artimplementations for common input datatypes (currently DNA and proteindata). The PLL can be executed transparently on a large number of emergingparallel architectures. In the context of this project, we developed a proof-of-concept OpenCL GPU implementation of the key PLF functions for DNAdata.

One of the consequences of the rapid growth of molecular databases isthat existing phylogenies, after short periods of time, may not reflect thelatest available data. Furthermore, re-starting phylogeny reconstruction fromscratch to add new data is an expensive task that involves computationalresources and man-hours. In order to address this issue, we have developedPUmPER (Phylogenies Updated PERpetually), a framework for developingautomated pipelines for phylogenetic analysis. PUmPER can automaticallyupdate existing phylogenies by iteratively assembling new alignments andextending existing trees, without human intervention. The framework canbe configured to run as a stand-alone pipeline on a single machine, or tooffload computationally expensive tasks to a cluster. PUmPER is written inRuby and transparently uses state-of-the-art tools for alignment construction(PHLAWD [81]) and phylogenetic inference (RAxML-Light [91]).

3

The main results presented in this thesis have been published in fourpeer-reviewed conferences [35, 37, 39, 40] and two journal articles [36, 38].

During the course of these thesis, we collaborated and conducted ad-ditional work on other research topics not presented here. These projectsinvolved the development and usage of the PLL library [15, 27], heuristic al-gorithms for the protein assignment problem [30], integration of the recompu-tation technique in RAxML-Light [91], and conducting phylogenetic analysisin the context of metagenomic species profiling [101]. We also collaboratedwith the on-going 1000 insect transcriptome evolution project (http://www.1kite.org/).

1.3 Structure of this thesis

This thesis is structured as follows: Chapter 2 provides a general introductionto maximum likelihood based inference of phylogenetic trees, and includesdetails on RAxML concepts that are a prerequisite for the rest of the text.In Chapter 3 we introduce methods to reduce the memory requirements ofthe PLF. The backbone algorithm for reducing the tree search space is de-scribed in Chapter 4. In Chapter 5 an introduction to GPU Programming isgiven, followed by Chapter 6, which describes our OpenCL-based GPU im-plementation of the PLF. In Chapter 7 we introduce PUmPER, our frameworkfor perpetually updated phylogenies. Finally, we conclude and discuss futurework in Chapter 8.

4

Chapter 2

Computational MolecularPhylogenetics

In this chapter we briefly introduce some basic concepts related to the fieldof Molecular Phylogenetics.

Phylogenetics is the study of evolutionary relationships among organismsthat share common ancestors. The most common structure used to representthese relationships is a phylogenetic tree, which depicts ancestor-descendantrelationships. From a biological perspective, it is known that evolution can-not be modelled as a strict branching process, because genetic information isnot always transferred vertically. For instance, phenomena such as hybridiza-tion and lateral gene transfer (LGT) cannot be modeled by a phylogenetictree. More complex structures, such as phylogenetic networks, are topics ofactive research [33]. In this thesis, however, we focus solely on phylogenetictrees.

The input data used to infer the trees can be morphological or molecu-lar sequences. Morphological sequences are generated by encoding observedmorphological traits. Molecular data (e.g., DNA, RNA, or AA) is obtainedthrough sequencing technologies. Due to the continuously decreasing costof Next-Generation Sequencing (NGS) technologies, molecular sequences arenowadays the most common source of data, especially for large-scale datasets.In this thesis, we will assume that molecular information, in particular DNAor protein data, is used in all the methods and applications discussed.

2.1 Statistical Models of Evolution

A molecular sequence (e.g., a DNA molecule), can be represented as a stringof characters, where each character is an element of a finite alphabet. In the

5

case of DNA data the four nucleotides (also called bases) are A, C, G, T. ForRNA data, these are A, C, G, U. For amino acid (AA) data, the alphabetcomprises 20 characters that are also called residues. Each element of thealphabet represents a possible character state for a given sequence index(position or site).

A statistical model of evolution describes how likely it is for a state tomutate into a different state (transition) within a given evolutionary timet. We will now shortly describe nucleotide substitution models of evolutionfor DNA data. The distance between two DNA sequences is defined as theexpected number of nucleotide substitutions per site. The number of substi-tutions can be estimated with a continuous-time Markov model. We assumethat sites are evolving independently from each other. The Markov chainhas four possible states (A, C, G, T) and is memory-less (the next statedepends only on the current one). The substitution rate matrix Q = {qi,j}defines the probability that state i mutates into state j in an infinitely smallamount of time dt. Each row of the Q matrix sums to zero. To computethe probability of a state mutating into another given time t, we need tocompute the transition-probability matrix as follows:

P (t) = eQt (2.1)

If Q is symmetrical, we call the model time-reversible. The Markov chainis reversible if and only if πiqi,j = πjqj,i for any i �= j, where πi is the proba-bility that the chain is in state i when time t → ∞. The set (πA, πC , πG, πT )is known as the limiting distribution or stationary distribution of the chain.If the states of the chain are in the stationary distribution, the chain willstay in that distribution. These state frequencies are also called equilibriumbase frequencies. The set of equilibrium frequencies are determined by theQ matrix, but in practice is often estimated from the data or optimizednumerically.

The numerical optimization of equilibrium base frequencies adds threefree parameters to the model since πA, πC , πG and πT need to sum up toone. If equilibrium base frequencies are estimated from the data (by count-ing the number of occurrences of each state), they are called empirical basefrequencies.

The standard approach to calculate the transition probability matrix P (t)is to compute numerically the eigenvalues and eigenvectors of the rate matrixQ [112], as shown in Equation 2.2

Q = UΛU−1 (2.2)

where Λ is a diagonal matrix containing the eigenvalues of Q and U is

6

a square matrix whose ith column is the eigenvector of Q. Now we canrewrite Equation 2.1 as:

P (t) = eQt = UeΛtU−1 (2.3)

We can now describe the GTR (General Time Reversible) model [45],which is the most commonly used model for nucleotide substitution in large-scale phylogenetic inference. A detailed description of GTR and other nu-cleotide and amino acid evolution models can be found in Chapters 1 and 2of [112].

Q = {qi,j} =

· qA,C qA,G qA,T

qC,A · qC,G qC,T

qG,A qG,C · qG,T

qT,A qT,C qT,G ·

=

· aπC bπG cπT

aπA · dπG eπT

bπA dπC · fπT

cπA eπC fπG ·

(2.4)

Equation 2.4 can be decomposed into a product of a symmetric matrixof substitution rates and a diagonal matrix of equilibrium frequencies.

Q = {qi,j} =

· a b ca · d eb d · fc e f ·

πA 0 0 00 πC 0 00 0 πG 00 0 0 πT

(2.5)

Substitution rates a, b, c, d, e, f are parameters that can be freely opti-mized (usually f := 1). The GTR has therefore a total of eight free param-eters.

Other nucleotide substitution models can be derived from GTR by simplyimposing restrictions on the base frequencies and/or the substitution rates.These derived models are more simple (nested within GTR) and have a lowernumber of free parameters. For instance, the Jukes-Cantor model [41] is givenby πA = πC = πG = πT = 0.25 and a = b = c = d = e = f = 1.0.

2.2 Sequence Alignment

A sequence alignment can be seen as a matrix of n sequences of moleculardata (DNA, RNA or protein), where each row corresponds to one sequence.In general, before sequences are aligned, each input sequence has a differentlength.

Any non-trivial alignment contains gaps (sites for which no state is as-signed). Gaps may occur for biological reasons (character insertions or dele-tions). Data may also be undetermined due to technical reasons (data is

7

not read correctly during sequencing), or because information is missing (aspecific gene for a specific taxon has never been sequenced). In the methodsdescribed in this thesis, missing and undetermined characters are treated asgaps.

Aligning sequences consists in identifying regions of common similarity,that is, gaps are inserted between characters so that the s columns in the ma-trix contain similar characters. Alignment algorithms strive to maximize thissimilarity according to a given optimization criterion. Gaps observed in analignment column are interpreted as indels (insertions or deletions in the se-quence). Columns including mismatches can be interpreted as sequence siteswhere one or several species have been subject of point mutations. In DNAdata, point mutations are random SNPs (Single nucleotide Polymorphism),which usually take place during DNA replication.

If the codon resulting from the mutation change is translated into thesame amino-acid, it is called a silent or synonymous substitution. Oth-erwise, if the mutation induces a change into a different amino acid, it iscalled nonsynonymous substitution, and depending on how different (in termsof biochemical properties) the new amino acid is, the protein function canbe affected.

Sequence alignment is based on the notion that similar regions share acommon evolutionary history. The interpretation is that regions with highsimilarity do not show changes because they correspond to an important,evolutionary conserved, structural property or function.

2.2.1 Pairwise Sequence Alignment

A pairwise sequence alignment is an alignment of two molecular sequences,that is, n := 2. In global alignment, every character of each sequence isaligned, that is, the alignment is at least as long as the longest sequence.In local alignment, a segment of one sequence, often called query, is alignedagainst a segment of a reference sequence.

The two main computational approaches are based on dynamic program-ming and k-tuple methods.

The Needleman–Wunsch algorithm [57], introduced in 1970, is based ondynamic programming and performs global alignment. In 1981, the Smith–Waterman algorithm [83], an adaptation of the Needleman-Wunsch algo-rithm, was introduced to solve the local pairwise alignment problem.

The importance of Needleman–Wunsch and Smith–Waterman is that,given a scoring function, they can find optimal solutions for the matchingproblem (by using dynamic programming) with time and space complexitygiven by O(s1s2), where si is the length of sequence i.

8

Sequence 1 for Gene arb AGCATCGATACCGATATCGCGAT

Sequence 2 for Gene arb AGAAGCGATCCCATACAGATGCGAT

Sequence 3 for Gene arb AGCATCGATACAGATGGCACGAT

Sequence 4 for Gene arb AGCAGCGATACATGGCACGAT

<---- MSA for Gene arb ---->

Seq1 AGCATCGAT-----ACCGATATCGCGAT

Seq2 AGAAGCGATCCCATACAGATG---CGAT

Seq3 AGCATCGAT-----ACAGATGGCACGAT

Seq4 AGCAGCGAT-----AC--ATGGCACGAT

Figure 2.1: From a set of sequences to an alignment: Gaps are introduced in orderto align the available sequences.

The Smith–Waterman algorithm is extensively used in applications suchas, for instance, genome mapping and gene prediction, and has thereforebeen subject of intensive optimization and parallelization. Currently, thereexist efficient implementations based on FPGAs (Field-programmable GateArrays) [2], GPUs [50], and vectorization on CPUs [71].

Word (k-tuple) methods are heuristic methods, and do not guaranteeto find an optimal solution, but usually run faster than dynamic program-ming approaches. These methods are employed by tools that approximatematching queries against large sequence databases, such as BLAST [3] orFASTA [47]. Following a seed-and-extend strategy, initially exact matchesof k-tuples (seeds) are identified. Then, the matching locations are extendedinto longer alignment candidates called HSPs (High Score Pairs). HSPsthat are statistically significant may then be locally aligned with Smith–Waterman.

2.2.2 Multiple Sequence Alignment

A multiple sequence alignment (MSA) is an alignment of n molecular se-quences, where n > 2. An alignment has s alignment columns, which arealso often referred to as sites. The alignment length typically refers to thenumber of sites. The phylogenetic signal present in an alignment generallyincreases with alignment length [56].

For illustration, an MSA with length of s := 27 sites, corresponding to afictional gene arb, and 4 sequences is shown in Figure 2.1

9

<-------- Gene arb --------><----- Gene arb2 ------>

Seq1 AGCATCGAT-----ACCGATATCGCGATA-TAAGCTA-CGT-AGTTGAGGGT

Seq2 AGAAGCGATCCCATACAGATG---CGATG-TAAGCAA-CGT-AGTTGAGGGT

Seq3 AGCATCGAT-----ACAGATGGCACGATG-TAGCCTA-CCC-AGTTTCGCGG

Seq4 AGCAGCGAT-----AC--ATGGCACGATG-TGGCCTA-CCC-AGTTGCGCGT

Figure 2.2: Alignments for gene arb (red) and gene arb (blue) are concatenated tobuild a multiple-gene MSA.

It is common to concatenate alignments from different sources (e.g., dif-ferent genes) for the same set of taxa. Such large concatenated datasets canthen be organized into partitions. For instance, each gene may correspondto a separate partition. In a Maximum Likelihood partitioned analysis, theevolutionary model is evaluated under a specific optimized set of parametersfor each partition.

Concatenated alignments, as shown in Figure 2.2, where two partitionshave been concatenated, are also commonly referred to as supermatrices.

The sites that are put into one partition are often, but not necessarily,genes. The partition boundaries can also be based on other criteria such ascodon positions. It is important to note that there is a distinction betweengene and species phylogenies. Due to biological events such as gene loss, geneduplication and incomplete lineage sorting, a gene phylogeny may show anumber of topological conflicts with the species phylogeny, which, therefore,is much more difficult to reconstruct. In this thesis, we always refer to aphylogeny as the tree we reconstruct given the alignment.

The process of building a multiple sequence alignment for phylogeneticinference is composed of several phases, that we summarize below:

1. Selection of the region of interest: The region may be a specific gene,a set of genes or even a genome (phylogenomics). This determines thelength of the alignment.

2. Selection of a taxonomic group and sequence retrieval: The group ofsequences (taxa) that will be included in the alignment. This stepinvolves homology identification (two sequences are homologous if theyshare a lineage and have descended from a common ancestor). Thesesequences may be obtained through sequencing or querying databasessuch as GenBank [6].

10

3. Sequence alignment. Sequences are aligned and gaps are introducedso that a scoring criterion is maximized (matching sites are favoured,insertions of gaps are penalized). Sequence alignment of n sequences (ortaxa) has been shown to be NP-hard [21], that is, the globally optimalsolution is not computationally tractable. In practice, a heuristic searchstrategy called progressive alignment is often employed. Some popularmultiple sequence aligners are mafft [42] and muscle [19].

The above procedure can be (partially) automated by computationalpipelines such as PHLAWD [81]. The quality of the generated alignment, how-ever, is difficult to assess, because the true alignment is unknown. Thequality of the MSA has a substantial impact on the accuracy of the resultingphylogenetic reconstruction, and in general we assume that the given MSAis ”correct”. Nonetheless, it is important to be aware that, no matter howaccurate the phylogenetic search method may be, a good phylogenetic treecan only be inferred if a good MSA is available. Unfortunately, there is nostraight forward mechanism to assess the quality of a MSA. However, givena pairwise scoring function (see Subsection 2.2.1), it is possible to score anMSA computing the sum-of-pairs (SP) function. The SP is given by the sumof all column scores in the alignment, where the the score of each column isthe sum of pair scores for all pairs that occur in the column. The accuracyof the phylogenetic method is hard to evaluate because the true evolutionarytree is unknown.

Furthermore, available high-quality database of curated sequence align-ments, such as Pandit [107], can be used to benchmark MSA tools.

2.3 Phylogenetic Trees

In graph theory, a graph is a representation of a set of vertices where somepairs of vertices are connected by edges. In an undirected graph, edges haveno orientation. A path is a sequence of edges that connect a sequence ofvertices. A tree is an undirected graph in which any two vertices of the treeare connected by exactly one path. Therefore cycles, where two vertices areconnected by more than one path, are not present in a tree.

In phylogenetics, such trees are called phylogenetic trees. The branchingpattern defines the topology. Vertices are usually called nodes. Nodes are alsocalled TUs (taxonomic units). Edges are referred to as branches. Terminalnodes of the tree are often called tips, taxa, or OTUs (Operational TaxonomicUnit). A single tip is called taxon, or OTU. The taxa represent living (extant)organisms for which molecular data is available and can be sequenced. Theinner nodes represent hypothetical extinct common ancestors.

11

Alignment: ../data/7_64.aa.phySeaview [blocks=10 fontsize=10 A4] on Wed Jul 31 15:04:13 2013�

1seq1 mgikglktfl te--hvnkki aldaaflmhr sk-------- -edwldilat cdvkavfifd glspseq2 mgvk-igeli e-----gkki aldafnamyq fllmtskgei tsvhsgifyr tgiipiyvfd gekpseq3 mgiknltkfl pdsiqegqil gvdtsiflyk yk-------- -rflesfvtq idieliyifd gkppseq4 m--------- ---------- ---------- ----dsqgrv tshlsglfyr tgvipiyvfd gkppseq5 mgikgltali pgaikegrkv aidasmslyq fllmtesget tshllgffyr tgikpmyvfd gtppseq6 mgikglkpll vhe-yagkti avdgtfllhk yk-------- vpwhyltlyt lnvkvlfifd gmspseq7 mgikgltall pgamredrrv aidasmhiyq fmltneagev tshlqgmlmr tgikpvyvfd gkpp

Figure 2.3: A protein alignment comprising seven sequences. Image generatedwith seaview version 4.3.1

In general, a phylogenetic tree represented as an unrooted binary tree,where all nodes have degree one (leaves) or three (inner nodes). Thus, givenn taxa, there exist n − 2 inner nodes (ancestral nodes) and 2n − 3 edges(branches). For example, Figure 2.4 shows an unrooted parsimony tree forthe protein alignment shown in Figure 2.3.

The number of distinct possible rooted and unrooted topologies are givenby Equation 2.6 [20].

Nunrooted =(2n− 5)!

2n−3(n− 3)!n ≥ 3. (2.6)

A root is a node from which a unique path leads to any other node of thetree. If a root is given in the tree, the direction of the evolutionary path isuniquely determined. Such topologies are called rooted trees.

The number of possible rooted topologies for n tips can be computed asthe number of unrooted topologies for n+1 tips and is given by Equation 2.7.

Nrooted =(2n− 3)!

2n−2(n− 2)!n ≥ 2. (2.7)

In a rooted tree, a clade is a group of nodes that have evolved from acommon ancestor.

The length of the branches represents the evolutionary distance betweentwo nodes in the tree. Lengths of branches are measured in units of evolu-tionary changes. A tree without branch length values is called a cladogram.For example, Figure 2.5 shows an unrooted tree for the protein alignmentshown in Figure 2.3.

2.4 Tree Reconstruction Methods

There exist different methods for reconstructing trees from molecular se-quences. The common principle is the analysis of the differences among

12

Figure 2.4: An unrooted phylogenetic tree based on the alignment from figure 2.3.

Figure 2.5: An unrooted tree with branch lengths based on the alignment fromfigure 2.3.

13

sequences and the use of that information to reconstruct an evolutionaryhistory.

We can classify methods into distance based and character based methods.Distance methods, where a distance matrix is computed from the pair-

wise distances among sequences, are based on the notion that common an-cestry is reflected by sequence similarity. Once a distance matrix is available,a clustering algorithm is applied to deterministically generate a phylogenetictree. A representative example of this group is Neighbour Joining(NJ). Thecanonical NJ problem is an O(n3) time algorithm and uses O(n2) space, wheren is the number of species. This can lead to expensive memory requirementswhen the number of species is large, but efficient implementations, such asNINJA [106], have been designed to handle such datasets.

We now introduce character based methods, where different tree topolo-gies are searched and evaluated under a given scoring criterion.Parsimony methods compare trees based on the parsimony criterion. The

parsimony score of a given tree is the minimum number of evolutionary statechanges required to generate the MSA (Multiple Sequence Alignment) giventhe tree under evaluation.

Theoretically, in order to find the most parsimonious tree, all possibletopologies should be evaluated. In practice, because the number of possibletopologies grows super-exponentially with the number of taxa, the objectiveis to search the tree space for local optima by applying heuristics to evaluatea promising subset of topologies.Maximum Likelihood (ML) and Bayesian methods use a Multiple Se-

quence Alignment and a statistical evolutionary model. Both methods arebased on statistical frameworks that extensively rely on the computation ofthe Phylogenetic Likelihood Function (PLF). In Maximum Likelihood, thescore to maximize is the likelihood, which is the probability of observing theMSA given the hypothetical tree and other parameters. Thus, we strive toexplore and find the tree and set of parameters that maximize this score forour MSA. Likelihood scores cannot be compared across different MSAs.

Under Bayesian methods, the computed score is the (posterior) probabil-ity of the hypothetical tree, given the data (MSA and model). While thisscore has a clearer statistical interpretation (the score is the probability thatthe tree is the true tree), it also entails challenges such as, for instance, how toset the prior probability of a tree, which is unknown. Like Parsimony-basedmethods, Maximum Likelihood and Bayesian methods explore a subset ofpossible topologies using heuristics.

Distance Matrix methods are usually the fastest, since they do not requirescoring alternative hypothesis during tree search. Parsimony trees are muchfaster to score than Likelihood/Bayesian trees, because they rely on a simpler

14

score calculation.The accuracy of phylogenetic inference methods can be evaluated by con-

ducting simulations, for instance, with tools like INDELible [26]. Such exper-iments allow to generate simulated true alignments (MSAs) and true trees.The simulated alignment can then be used to reconstruct trees by differentmethods. Then the accuracy of each method can be evaluated by comput-ing the topological distance between the reconstructed tree and the simulated(true) tree. Trees on simulated data, however, tend to be easier to reconstructthan on real data. One possible explanation is that simulation programs gen-erate the data using the same statistical models of evolution as the programsused to infer the tree, that is, the evolutionary model is given a priori [94].Thus, accuracy results derived from such experiments should be interpretedwith caution.

In general, distance methods are the least accurate, but also the fastest.In terms of accuracy, Likelihood and Bayesian methods clearly outperformother methods mainly due to the fact that they have an explicit evolutionarymodel [52]. In particular, Maximum Likelihood is consistent in the staticalsense: the reconstructed tree converges to the true tree as the number ofaligned sites goes to infinity [25]. In this thesis, we focused on MaximumLikelihood methods.

2.5 Computing the Likelihood of a Tree

As a consequence of the Maximum Likelihood search process, applying thePLF (Phylogenetic Likelihood Function) to evaluate alternative tree topolo-gies dominates both running time and memory requirements of phylogeneticinference programs [39]. In Chapter 3 we describe the memory requirementsin detail. We will now discuss how to compute the PLF.

The computation of the PLF relies on the following assumptions:

1. Sites in the MSA evolve independently from each other.

2. Evolution in different parts of the tree is independent.

3. A comprehensive unrooted phylogeny T , including a set of branchlengths bij, is given.

4. An evolutionary model is available, and defines the transition proba-bilities Pij(b), that is the probability that j will be the final state atthe end of a branch of length b, given that the initial state is i.

5. The evolutionary model is time-reversible, that is, πjPij(b) = πiPji(b)

15

These assumptions allow us to compute the likelihood of the tree as theproduct of the site-likelihoods of each column i of the MSA with s columns,as shown in Equation 2.8.

Let T be an unrooted binary tree with n tips. Let θ be a set of (optimizedor given) evolutionary model parameters (see Section 2.1). Let φ = {bxy} bea set of (optimized or given) branch length values for tree T , where bxy is thebranch length value connecting nodes x and y in tree T (bxy = byx, x �= yand |φ| = 2n− 3).

LT = Pr(D | T, θ, φ) =s�

i=1

Pr(Di | T, θ, φ) (2.8)

In order to prevent numerical underflow, it is common practice to computeand report log likelihood values:

log(LT ) = log(Pr(D | T, θ, φ)) =s

�

i=1

log(Pr(Di | T, θ, φ))

The root of the tree is, in general, unknown. In order to compute thelikelihood on unrooted topologies, a virtual root is placed into an arbitrarybranch. Because the substitution model is time-reversible, the likelihood ofthe tree is identical, independently of the branch chosen to place the virtualroot. Figure 2.6 shows a virtual root placed on an arbitrary branch of thetree depicted in Figure 2.5.

For notation clarity, let us assume that we are working with a rooted treeand DNA data, where we have an alphabet of size four. Thus, only fourstates are possible and correspond to the nucleotides A, C, G, and T.

For each site i, four entries must be computed to store the conditionallikelihood for nucleotide states A, C, G, and T at node p. The conditionallikelihood entry L

(p)S (i) is the probability of everything that is observed from

node p on the tree on up, at site i, conditional on node p having state S [25].We can define the ancestral probability vector(APV) at node p and site i as:

�L(p)(i) = (L(p)A (i), L

(p)C (i), L

(p)G (i), L

(p)T (i)) (2.9)

Let us consider Equation 2.10. Its derivation is explained in detail inChapter 4 of [112].

L(p)A (i) =

�

T�

S=A

PAS(bqp)L(q)S (i)

��

T�

S=A

PAS(brp)L(r)S (i)

�

(2.10)

This equation computes the ancestral probability vector (APV) entry �L(p)A

for observing the nucleotide A at site i of a parent node p, with two child nodes

16

Figure 2.6: A rooted view of the unrooted ML phylogenetic tree from Figure 2.5.The virtual root is placed in an arbitrary branch.

q and r given the respective branch lengths bqp and brp, the correspondingtransition probability matrices P (bqp), P (brp), and the probability vectors of

the children �L(q), �L(r) for site i. The children/parent relationship is given bythe position of the virtual root (see Figure 2.7).

The tips, which do not have any child nodes, must be initialized. Proba-bility Vectors at the tips are also called tip vectors. In general, the sequence atthe tips already have a known value for each site and therefore can be directlyassigned a probability. For instance, if site i is an A, the tip vector can bedirectly initialized as: (L

(p)A (i), L

(p)C (i), L

(p)G (i), L

(p)T (i)) := (1.0, 0.0, 0.0, 0.0).

To efficiently calculate the likelihood of a given, fixed tree, we executea depth-first post-order tree traversal that starts at the virtual root. Viathe post-order tree traversal, the probabilities at the inner nodes (ancestralprobability vectors) are computed bottom-up from the tips toward the virtualroot. This procedure to calculate the likelihood is called Felsenstein pruningalgorithm [23].

At the virtual root node, the site likelihood of the tree can be computedusing equation 2.11

LT (i) = Pr(Di | T, θ, φ) =T�

x=A

πxL(root)x (i) (2.11)

In Equation 2.11, πx is the equilibrium frequency for state x. It representsthe probability that the nucleotide at the root is x.

17

Figure 2.7: Ancestral probability node entries are computed based on the APVs ofthe children nodes, the model and the transition probability matrices P (bqp) andP (brp).

18

A more detailed description of PLF computations and efficient PLF im-plementations can be found in [88].

2.5.1 Accounting for rate heterogeneity

An important extension in the computation of the PLF is that not all align-ment sites evolve at the same rate [48]. Ignoring rate variation may re-sult into incorrect reconstruction of phylogenies [111]. Thus, models of rateheterogeneity have been introduced in order to accommodate the biologicalphenomenon of rate variation .

The most common model is the Γ [110] model of rate heterogeneity, wherefor each site the likelihood is integrated over a continuous Γ distribution ofrates. The gamma density is given by Equation 2.12.

g(r; α, β) =βαrα−1e−βr

Γ(α)(2.12)

The mean is given by α/β. The Γ model sets α := β, so that the meanis 1. Thus, the distribution has only one parameter α, which is usuallyadjusted by numerical optimization. Figure 2.8 shows the effect of α on thedistribution of rates. If α < 1, the distribution implies that there is a largeamount of rate variation, that is, many sites evolve slowly but some sitesmay have high rates. Large values of α tend to minimize the effect of ratevariation, since most rates fall close to the mean.

We can adapt Equation 2.11 to compute the likelihood of the tree inte-grating over the rate distribution.

LT (i) = Pr(Di | T, θ, φ, α) =

�

∞

0

g(r; α)T�

x=A

πxL(root)x (i, r) dr (2.13)

In phylogenetic software tools, a more computationally tractable discretegamma model is implemented. This model approximates the continuous dis-tribution using K rate categories with equal probability. Accounting forrate heterogeneity increases accuracy but has an important impact on per-formance. For every node, site and state K entries must be computed inthe APVs. Therefore, likelihood computations and memory requirements in-crease by a factor of K. The mean rate in each category is used to representall the rates in that category. The most common choice is to use four rateswith K := 4, which is sometimes called Γ4.

Individual per-site evolutionary rates are not used because this mightlead to over-fitting and over-parametrization of the data. The PSR model

19

0 1 2 3 4

01

23

4

Probability density for the gamma distribution

Rate

Den

sity

α = β

α = 0.1α = 0.5α = 5α = 20α = 60

Figure 2.8: The shape parameter α is inversely related to the rate variation amongsites: Larger values of α reduce the variation.

20

(previously known as CAT [86]) attempts to alleviate this issue by using afixed number c << s, where s is the number of sites and c is the numberof rate categories. Each individually optimized evolutionary rate ri, with1 < i <= s, is mapped to a rate category ρj, with 0 < j < c − 1. Therelevance of the PSR model is that, while memory usage and number ofcomputations are four times lower than with the Γ4 model, reconstructedtrees show comparable Γ-likelihood values [86].

2.6 Maximum Likelihood Tree Search

The objective of ML (Maximum Likelihood) phylogenetic inference is to findthe topology that maximizes the (log) likelihood score (computed by thePLF), given the data (MSA) and a statistical model of evolution.

A naıve strategy for tree space exploration is an exhaustive search, thatis, all possible unrooted topologies are enumerated. Then, for each topol-ogy, the branch lengths and model parameters are optimized to maximizethe likelihood. The topology with the highest likelihood is the optimal MLtopology. In practice, this is only feasible for very small trees. In general,finding the optimal ML topology is NP-hard [70].

Tree space is explored making use of heuristics, with the goal of scoringonly a small subset of good topologies. One idea for exploring tree space is touse a greedy algorithm. First, a comprehensive tree is generated randomly orwith a faster method such as Neighbour Joining or Parsimony. Then, smallrearrangements are applied to the topology of this initial tree, generatingnew trees that can be scored with the PLF. Whenever a new tree is better(higher likelihood score), it is kept and new rearrangements are applied tothis tree.

Typical topological moves for finding/reconstructing a tree topology withan improved likelihood score are SPR (Subtree Pruning and Re-grafting),NNI (Nearest Neighbor Interchange) or TBR (Tree Bisection and Reconnec-tion) moves.

Most of the time, as the SPR example in Figure 2.9 shows, these movesonly induce local changes to the tree topology. In other words, the majorityof the ancestral probability vectors are not affected by the topological changeand do therefore not need to be recomputed/updated with respect to the lo-cally altered tree topology via a full post-order tree traversal (see Section 2.5).In these cases, we can compute the likelihood of the tree after a topologicalmove by only updating the (mostly small number of) ancestral probabilityvectors included a local post-order tree traversal.

All standard ML-based programs deploy search strategies that typically

21

Figure 2.9: SPR (Subtree pruning and regrafting) moves can be divided in stages.In (1) a comprehensive tree is available. (2) a subtree defined by node 3 is selectedand pruned. A new branch length needs to be computed to reconnect nodes 0and 2. (3) The algorithm selects a insertion branch, usually close to the pointof pruning. (4) The pruned subtree is reinserted and new branch lengths arerecomputed. In lazy SPRs like the one shown here, only surrounding branchlengths are re-estimated.

22

require updating only a small fraction of probability vectors in the vicinityof the topological change.

The phylogenetic search is finished when no better topology can be foundaccording to a convergence criterion. There are several possible convergencecriteria, which vary among different phylogenetic software implementations.For instance, in PhyML, search convergence is determined by NNIs [29].RAxML [87] bases its criterion on SPRs.

The best tree after convergence is mostly a local optimum. There is noguarantee, however, that the global optimum can be found. Therefore theusual practice is to run several searches with different starting trees to explorethe landscape of local optima.

Assessment of Phylogenetic Trees

In general, the true evolutionary history is unknown. Therefore, an importantquestions is how to assess the significance of a tree, and how to quantify howdifferent a set of trees is among itself.

A bipartition, also referred to as split, represents a branch of the tree, anddefines two disjoint sets of taxa. An unrooted topology is uniquely defined byits set of bipartitions. A non-trivial bipartition is a bipartition where bothsets contain more than one element. A trivial bipartition is a bipartitionwhere one set contains one element.

For example, given the tree ((A,B),((C,D),E)), there are two non-trivialbipartitions defined by {A,B},{C,D,E} and {A,B,E},{D,C}.

One common approach to measure how well a tree represents the under-lying data is to repeat the phylogenetic search on resampled datasets andquantify how often a bipartition is recovered.

Given a Maximum Likelihood tree, bootstrap analysis [24] can be con-ducted to resample the data. First, bootstrap replicates are generated byresampling columns of the original MSA with replacement. Thereafter, aphylogenetic is tree inferred on each bootstrap replicate MSA, using the sameinference method as for the maximum likelihood tree. Finally, for each bipar-tition of the maximum likelihood tree, a bootstrap support value is computedas the percentage of bootstrap trees that contain that bipartition. Bootstrapsupport values reflect how well the maximum likelihood tree represents theunderlying data. Bootstrap replicate trees can also be summarized into asingle consensus tree. For instance, the majority rule (MR) criterion yieldsa consensus tree which only includes bipartitions that exist in half or moreof the replicate trees.

Topological distances between unrooted trees are often measured withthe Robinson-Foulds distance [69]. The Robinson-Foulds distance (RF dis-

23

tance) between two trees, also termed symmetric difference and partitionmetric, is defined as the number of bipartitions that are not shared betweentwo trees. This value is usually normalized by the total possible number ofnon-trivial bipartitions. Thus, identical trees have an RF distance 0. Treesthat only share trivial bipartitions have an RF distance 1.

Phylogenetic trees can be considered as competing evolutionary hypothe-sis. Multiple plausible hypothesis (trees) can be compared with site-likelihoodbased statistical tests such as the Kishino-Hasegawa (KH) [43], the Shimodaira-Hasegawa (SH) [77] and the Approximately Unbiased (AU) tests [76]. Giventhe site-likelihood values of each tree, the software package CONSEL [78]can compute p-values for each of these tests. A confidence set of trees (setof trees that are significantly better than the others) can be obtained bycollecting the trees with p-values that are larger than a chosen significancelevel, for instance, trees with p > 0.05.

2.7 Phylogenetic Likelihood Library

The Phylogenetic Likelihood Library (PLL) [27] is a parallelized and highlyoptimized software library for phylogenetic calculations. It has been derivedfrom the highly tuned PLF implementation of RAxML [87].

The PLL library comprises implementations of state-of-the-art algorithmsand data structures, including those described in Section 2.8, along withlow-level technical and hardware-dependent optimizations. The library cancalculate (and optimize) the likelihood on a phylogenetic tree for differentstatistical models and data types. It also implements branch length optimiza-tion, and various tree rearrangement techniques, such as Subtree Pruning andRegrafting (SPR) and Nearest Neighbor Interchanges (NNI). Moreover, thePLL can use multiprocessor architectures via a fine-grain parallelization ofthe PLF that relies on PThreads or MPI. In the parallel version, the align-ment sites (or alignment partitions) are distributed across processors. Singlex86 cores use SSE3 or AVX intrinsics to accelerate computations.

While the PLL is currently work in progress, it has already been success-fully deployed to substantially accelerate DPPDIV [31], a Bayesian programfor divergence time(s) estimates. It has also been used to implement theproof-of-concept implementation on GPUs described in Chapter 6.

24

2.8 RAxML-family implementation concepts

In this Section, we describe some implementation details regarding datastructures and algorithms that are required for phylogenetic computations.The data structures and terminology we present here resemble the ones usedin the RAxML-family, which includes the original standard-RAxML [87], aswell as more recent phylogenetic codes such as RAxML-light [91], ExaML [90],and the PLL library [27].

Thus, we introduce some generic terminology, including pseudo-code forsome functions and data structures whose naming and implementation mayslightly differ but serve the same functionality. Most of these concepts arealso present in other state-of-the-art phylogenetic codes.

These implementation details are required to outline the techniques de-scribed on Chapter 3, Chapter 4 and Chapter 6 .

2.8.1 Node records

The node record is the fundamental data structure to compute the likelihoodof a tree. Each node record can be seen as a subtree root. The data structurein Pseudo-code 1 represents a node record.

Pseudo-code 1 The node record data structure

typedef struct noderec

{

double z; /* branch length value */

struct noderec *next;

struct noderec *back;

int number; /* tree node identifier */

char x; /* true if towards the root */

} node;

2.8.2 Internal tree nodes

The internal nodes of a binary tree are represented by circular lists of threenode records. Each internal node of the tree corresponds to one ancestralprobability vector (APV). However, the entries of the APV differ dependingon the orientation of the node. The internal nodes of the tree are circu-lar lists composed of three node records, where, by definition, the conditionx == TRUE holds only for one node record. In Figure 2.10, tree node records

25

Figure 2.10: Circular double linked lists of node records (black) represent treenodes (red). Back links are not shown. Each node record represents a possiblesubtree with different ancestral probabilities. For each tree node, the APV entriesare only valid for the node record oriented towards the virtual root (p, q and r).

are labeled p, q and r. The ancestral probability vector for the node corre-sponds to the subtree root defined by the node record oriented towards thevirtual root.

Thus, given a node record p, all node records share the same number,which identifies uniquely the internal node (p->number == p->next->number).

2.8.3 Data structure for Trees

The partition data (see Pseudocode 3) can be accessed through tipSequences,which point to the raw alignment data in memory. The array tipVector con-tains the product of the left eigenvectors for all possible states. It can beused as a look-up table to compute the likelihood entries at the tips.

The traversal descriptor (see Pseudocode 3) is used as a list of updateoperations for ancestral probability vectors in inner nodes. Each updateoperation involves two input child nodes (by convention q and r), and oneparent output node (p). The traversalInfo data structure represents anupdate operation. The numbers pNumber, qNumber, and rNumber uniquely

26

Pseudo-code 2 The partition data structure

typedef struct

{

double **apvEntries;

unsigned char **tipSequences;

double *tipVector;

} pInfo;

identify nodes p, q, and r.The parent node is always an inner node of the tree. The children nodes

can either be tips or inner nodes. Thus, there are three possible cases.The tipCase field stores the type of configuration (inner/inner, inner/tipor tip/tip).

Pseudo-code 3 The traversal-descriptor data structure

typedef struct{

int tipCase;

int pNumber;

int qNumber;

int rNumber;

double qz;

double rz;

} traversalInfo;

typedef struct

{

traversalInfo *ti;

int count;

} traversalDescriptor;

The tree data structure (see Pseudocode 4) includes the previously pre-sented data structures.

2.8.4 Computing the likelihood on a Tree

For computing the likelihood on a tree, we need the following two core func-tions:

27

Pseudo-code 4 A simplified tree data structure

typedef struct

{

node **nodep;

node *start;

pInfo *partitionData;

traversalDescriptor *td;

} tree;

Pseudo-code 5 Building the traversal descriptor with a post-order traversal

void computeTraversalDescriptor(node *p, traversalDescriptor *td)

{

if isTip(p->number)

return;

node *q = p->next->back;

node *r = p->next->next->back;

if(isTip(q->number) && isTip(r->number))

{

addEntry(td, p, q, r, TIP_TIP);

}

else if(isTip(q->number) || isTip(r->number))

{

swap(q,r) if isTip(r->Number);

computeTraversalDescriptor(q, td);

addEntry(td, p, q, r, TIP_INNER);

}

else

{

computeTraversalDescriptor(q, td);

computeTraversalDescriptor(r, td);

addEntry(td, p, q, r, INNER_INNER);

}

}

28

The newview(tree *tr, node *p) function updates an ancestral proba-bility vector (apvEntries[p->number]) given two child nodes and given twotransition probability matrices P for the respective branch lengths leadingto these child nodes (see Figure 2.7).

The evaluate(tree *tr) function is called at the virtual root that hasbeen placed into the unrooted tree for scoring it. Given the two ancestralprobability vectors at either end of the rooted branch and the branch length,evaluate() computes the overall log likelihood of the tree.

As described in Section 2.5, in order to compute the log likelihood of atree, we need to conduct a depth-first post-order traversal of the tree topol-ogy (starting at the virtual root). Pseudocode 5 shows how the traversaldescriptor is filled. Thereafter, we invoke newview() for each likelihood op-eration included in the descriptor. If we conduct a full traversal, the length ofthe descriptor will be equal to the number of inner nodes (all internal nodesare visited).

In some cases, for instance, while executing SPR moves (see Figure 2.9),only a partial traversal is required, because the local perturbation only re-quires a subset of the APVs to be updated. Once the traversal descriptorhas been executed, all APVs are oriented towards the virtual root. Thus, weinvoke evaluate() to calculate the overall log likelihood score of the tree.

2.8.5 Optimizing branch lengths

The functions evaluate() and newview() are sufficient to implement aBayesian inference program, since the MCMC procedure, unlike the max-imum likelihood method, does not require dedicated parameter optimizationroutines.

In the Maximum Likelihood (ML) framework however, we do need suchexplicit parameter optimization routines. Typically, branch length optimiza-tion is implemented via the Newton-Raphson procedure.

Branch length optimization accounts for approximately 20% to 30% oftotal execution time in state-of-the-art ML tree inference algorithms [10, 93].To optimize a specific branch, we need to invoke newview() first on thenodes at either end of the branch to ensure that the ancestral probabilityvectors are oriented towards the branch that is being optimized. In fact,this corresponds to placing the virtual root of the tree into the branch thatshall be optimized. Furthermore, we also need to invoke newview() when abranch has been changed to ensure that the ancestral probability vectors inthe tree are in a consistent state and reflect the altered branch.

The Newton-Raphson branch length optimization procedure involves thefollowing two routines:

29

• coreDerivative() computes the first and second derivative of the like-lihood function at a given branch.

• sumGAMMA() pre-computes the element-wise product of the ancestralprobability vectors to the left and the right of the branch under op-timization. This product is then re-used repeatedly by iterations ofcoreDerivative() and allows to save time by avoiding redundant com-putations.

The RAxML-family codes provide a function for direct branch lengthoptimization (called makenewz()) using the Newton-Raphson procedure thatuses the two functions described above.

30

Chapter 3

Memory-Saving Techniques

The computation of the phylogenetic likelihood function (PLF) for recon-structing evolutionary trees from molecular sequence data is both memory-and compute-intensive.

Based on our interactions with the RAxML [87] user community, we findthat, memory shortages are increasingly becoming a problem and representthe main limiting factor for large-scale phylogenetic analyses, especially atthe genome level. At the same time, the amount of available genomic data isgrowing at a faster pace than RAM sizes. For instance, to compute the likeli-hood on a simulated DNA alignment with 1,481 species and 20,000,000 sites(corresponding roughly to the 20,000 genes in the human genome) on a sin-gle tree topology under a simple statistical model of nucleotide substitutionwithin 48 hours, 1TB of memory and a total of 672 cores are required.

Some solutions to this problem have been previously presented. The useof single precision arithmetics [7] can reduce memory requirements by 50%,but it can also potentially introduce numerical instability. Novel algorith-mic solutions [92] have been introduced to exploit large sections of missingdata. Those approaches, however, remain dataset-specific, that is, their ef-ficiency/applicability depends on the specific properties of the Multiple Se-quence Alignment (MSA) that is used as input.

The concepts presented in this chapter can be used to exactly computethe phylogenetic likelihood function while significantly reducing memory re-quirements. These techniques can be applied to all programs that rely onthe phylogenetic likelihood function and can contribute significantly to en-abling the computation of whole-genome phylogenies. In Section 3.1, wediscuss the memory requirements associated to the computation of the PLF,which dominate the memory requirements of a typical phylogenetics appli-cation. In Section 3.2, we present an out-of-core implementation of the PLF.Section 3.3 describes a technique to compute the PLF by trading memory for

31

Node vSite 1 Site 2

Rate 0 Rate 1 Rate 0 Rate 1LALCLGLT LALCLGLT LALCLGLT LALCLGLT

Figure 3.1: Memory layout of an ancestral probability vector with a two-rate Γmodel of rate heterogeneity.

recomputations of ancestral probability vectors. In Subsection 3.4.1, we dis-cuss the reduction of memory and computations by means of subtree equalityvectors when alignments with large gappy sections are present.

3.1 Memory requirements for the PLF

The PLF is defined on unrooted binary trees. The n extant species/organismsof the MSA under study are located at the tips of the tree, whereas the n−2inner nodes represent extinct common ancestors. The molecular sequencedata in the MSA that has a length of s sites (alignment columns) is located atthe tips of the tree. The memory requirements for storing those n tip vectorsof length s are not problematic, because one 32-bit integer is sufficient tostore, for instance, 8 nucleotides when ambiguous DNA character encoding isused. The memory requirements are dominated by the ancestral probabilityvectors that are located at the ancestral nodes of the tree. Depending onthe PLF implementation, at least one such vector (a total of n − 2) willneed to be stored per ancestral node. For each alignment site i, i = 1...s,an ancestral probability vector needs to hold the data for the probability ofobserving an A,C,G or T. Thus, under double precision arithmetics and forDNA data, a total of (n − 2) · 8 · 4 · s bytes is required for the most simpleevolutionary models. If the standard (and biologically meaningful) Γ modelof rate heterogeneity [110] with 4 discrete rates is deployed, this numberincreases to (n − 2) · 8 · 16 · s, since we need to store 16 probabilities foreach alignment site. Further, if protein data is used that has 20 instead of4 states, under a Γ model the memory requirements of ancestral probabilityvectors increase to (n− 2) · 8 · 80 · s bytes.

Figure 3.1 depicts a standard memory layout. For each alignment site, theancestral probability vector contains 2 rate blocks, which correspond to a tworate discretization. In real-world applications four rates are typically used(see Section 2.5). Each rate block contains 4 entries (1 per state, denoted byLA, LC , LG, and LT ).

Thus, total memory requirements can be easily estimated in advance,

32

since they are strongly dominated by the size of the ancestral probabilityvectors (APVs). The APVs requirements are directly proportional to align-ment size.

3.2 Out-of-Core

The content of this Section has been derived from the following peer-reviewed publication:F. Izquierdo-Carrasco and A. Stamatakis. Computing the phylogeneticlikelihood function out-of-core. In Parallel and Distributed ProcessingWorkshops and Phd Forum (IPDPSW), 2011 IEEE International Sym-posium on, pages 444–451, 2011

In this Section, we study the performance of an out-of-core execution of thephylogenetic likelihood function by means of a proof-of-concept implementa-tion in standard RAxML.

In cases where the data structures for computing a function do not fit intothe available Random Access Memory (RAM), out-of-core execution may besignificantly more efficient than relying on paging by the Operating System(OS). This is usually the case, because application-specific knowledge and’page’ granularity can be deployed to more efficiently exchange data betweenRAM and disk. Since the PLF is characterized by predictable linear dataaccesses to vectors, as we show, PLF-based programs are well-suited to theout-of-core paradigm.

We find that, in our proof-of-concept implementation, RAM miss ratesare below 10%, even if only 5% of the required data structures are held inRAM. Moreover, we show that our proof-of-concept implementation runsmore than 5 times faster than the respective standard implementation whenpaging is used. The method is generally applicable. The source code and re-sults are available at http://www.exelixis-lab.org/web/personal_page/izquierdo/ooc.tar.gz

The remainder of this Section is organized as follows: In Subsection 3.2.1we briefly discuss related work in the general area of out-of-core computingand some applications to phylogenetic reconstruction using Neighbor Joiningwhich exhibits substantially different data access patterns. In Subsection 3.2.2we initially outline the necessary underlying principles of the PLF that al-low for out-of-core execution and describe the optimization of the proof-of-concept implementation in RAxML. In the subsequent Subsection 3.2.3 wedescribe the experimental setup and provide respective performance results.

33

3.2.1 Related Work

The I/O bandwidth and communication between internal memory (RAM)and slower external devices (disks) can represent a bottleneck in large-scaleapplications. Methods that are specifically designed to minimize the I/Ooverhead via explicit, application-specific, data placement control and move-ment (e.g., between disk and RAM) are termed out-of-core algorithms (fre-quently also called: External-Memory (EM) algorithms; we will henceforthuse the terms as synonyms).

EM data structures and algorithms have already been deployed for a widerange of problems in scientific computing including sorting, matrix multi-plication, FFT computation, computational geometry, text processing, etc.Vitter provides a detailed review of work on EM algorithms in [105].

With respect to applications in phylogenetics, EM algorithms have sofar only been applied to Neighbor-Joining (NJ) algorithms [79, 106]. NJ isfundamentally different from PLF-based analysis (see Chapter 2). NJ is aclustering technique that relies on updating an O(n2) distance matrix thatcomprises the pairwise distances of the n organisms for which an evolutionarytree is reconstructed. The size of this matrix becomes prohibitive for datasetswith several thousand organisms. The data access pattern is dominated bysearching for the minimum in the O(n2) distance matrix at each step of thetree building process.

3.2.2 Computing the PLF Out-of-Core

Data Access Patterns

The reason why the PLF is particularly well-suited for out-of-core executionis the regularity and predictability of data access patterns. The likelihoodon the tree is computed according to the Felsenstein pruning algorithm [23],which we described in Section 2.5. Given an arbitrary rooting of the tree, oneconducts a post-order tree traversal to compute the likelihood. The s valuesin an ancestral probability vector are computed recursively by combiningthe values in the respective left and right child vectors. Thus, such a treetraversal to compute the likelihood proceeds from the tips towards the virtualroot in the tree. The ancestral probability vectors are accessed linearly andthe ancestral probability vector access pattern is given by the tree topology.

In general terms, good I/O performance in EM algorithms is achieved bymodifying an application such as to achieve a high degree of data locality.For the PLF, the straight-forward candidate data structure (the ’page’) fortransfers between disk and RAM are the ancestral probability vectors. In

34

RAxML they are stored linearly in memory. The replacement strategy simplyneeds to exploit the access pattern induced by the tree. In current ML searchalgorithms, the tree is not entirely re-traversed for every candidate tree thatis analyzed. A large number of topological changes that are evaluated arelocal changes. Thus, only a small fraction of ancestral probability vectorsneeds to be accessed and updated for each tree that is analyzed.

The typical minimum HW block size is 512 bytes, although some oper-ating systems use a larger block size of 8KB [105]. For the PLF this gran-ularity is not an issue, since a representative ancestral probability vectoris significantly larger than the block size. For instance, consider a typical,but still comparatively small, MSA of DNA data with length s = 10, 000and n = 10, 000 species. To compute the PLF, 9, 998 ancestral probabilityvectors need to be stored. Each of these vectors is stored contiguously inmemory and has a size of 10, 000 · 8 · 4 · 4 = 1, 280, 000 bytes (1.28MB) un-der double precision arithmetics for a Γ model of rate heterogeneity with 4discrete rates.

Thus, we can simply set the logical block size b (i.e., the ’page’ size) to thesize of an individual ancestral probability vector. Therefore, I/O operationscan be amortized, that is, each read or write to disk will access a contiguousnumber of bytes on disk that is significantly larger than the minimum blocksize.

Basic Implementation

In our basic implementation, we store all ancestral probability vectors thatdo not fit into RAM contiguously in a single binary file (see figure 3.2).We deploy an appropriate data structure to keep track of which vectors arecurrently available in RAM and which vectors are stored on disk.

Let n be the number of ancestral probability vectors and m the numberof vectors in memory, where m < n (i.e., n − m vectors will be stored ondisk). Due to the way the likelihood is computed by combining the valuesof two child vectors for obtaining an ancestral probability vector (see Sec-tion Section 2.5 and Figure 2.7), we must ensure that m ≥ 3. In other words,the space allocated in RAM must be large enough to hold at least three an-cestral probability vectors. To allow for easy assessment of various values ofm with respect to n, we use a parameter f that determines which fractionof required RAM will be made available, that is, m := f · n. Now, let w bethe number of bytes required for storing an ancestral probability vector, thatis, our proof-of-concept implementation will only allocate m · w bytes. Wehenceforth use the term slot (one may think of this as a page), to refer to asegment of available memory (an ancestral probability vector) with a size of

35

Figure 3.2: Each numbered ancestral node in the tree (left part of figure) needsto hold a vector of doubles. These vectors are either stored on disk (red) or inmemory (green). A data structure (middle part of figure) is used to look up thememory location (right part of figure) of each ancestral vector. Each slot has thesize of an ancestral probability vector.

w bytes.To orchestrate data transfers and to control the location of vectors, we

use the following C data structure (unnecessary details omitted).

typedef struct

{

FILE **fs;

unsigned int num_files;

size_t slot_width;

unsigned int num_slots;

unsigned int *item_in_mem;

unsigned int num_items;

unsigned int disk_items;

nodemap *itemVector;

double *tempslot;

boolean skipReads;

replacement_strategy strategy;

}map;

The array itemVector is a list of n pointers (see Figure 3.2) that isindexed using the (unique) ancestral node numbers. Each entry of typenodemap keeps track of whether the respective ancestral vector is stored ondisk or in memory. More specifically, if the ancestral vector resides in RAM,

36

we maintain its starting address in memory. If the vector resides on disk, wemaintain an offset value for its starting position in the binary file.

We also maintain an array of n integers (item_in_mem) that keeps trackof which vector is currently stored in which memory slot.

Replacement Strategies

In the standard implementation of RAxML, where n = m, all vectors arestored in RAM. Whenever an ancestral probability vector is required, wesimply start reading data from the respective starting addresses for node i

that is stored in the address vector xVector[i]. In the out-of-core version,we use a function getxVector(i), which returns the memory address of therequested ancestral vector for node i. The entire out-of-core functionalityis transparently encapsulated in function getxVector(i). The function willinitially check, whether the requested vector is already mapped to RAM. Ifnot, it will determine an adequate memory slot (according to some replace-ment strategy, see below), and swap the currently stored vector in that slotwith the requested vector in the binary file.

A constraint for the replacement strategy is that, we must ensure thatthe 3 vectors required to compute the values at the current ancestral nodei (vector i and the two child nodes j and k) reside in memory. Using theexample in Figure 3.2, let us assume that we are traversing the tree andthat the virtual root is located in the direction of the dotted line. Whenwe need to compute the values for vector 3 (i), vectors 1 (j) and 2 (k)need to reside in memory. Calling getxVector(1) will return the address ofmemory slot#1 (where the vector for node 1 is already available). However,vector 2 may be located on disk. A call to getxVector(2) will thus require aswap of vectors, but slots #1 and #3 must be excluded (pinned to memory)from the possible swap candidates, since the values of vectors 1 and 3 arerequired for the immediate computation. For this reason, getxVector() hastwo additional parameters that specify which inner nodes must be pinned tomemory and can not be swapped out.

Even if we optimize data transfer performance between disk and RAM ata technical level, accessing data on disk has a significantly higher latency thanaccessing data in RAM. Therefore, it is important to minimize the numberof I/O accesses (number of ancestral probability vector swaps).

As already mentioned, a vector is always either stored on disk or inRAM. Whenever RAxML tries to access a vector that resides on disk viagetxVector(), we need to chose a vector that resides in RAM, and thenswap the vectors. Therefore, we require a replacement strategy, that is con-ceptually similar to cache line replacement or page swap strategies.

37

To conduct a thorough assessment, we have implemented and tested thefollowing four replacement strategies:

Random The vector to be replaced is chosen at random with minimumoverhead (one call to a random number generator).

LRU Least Recently Used. The vector to be replaced is the one that hasbeen accessed the furthest back in time. This requires a list of n time-stamps as well as an O(log(n)) binary search for the oldest time-stamp.Note the use of n rather than m, because we only search among timestamps of vectors that are currently in RAM.

LFU Least Frequently Used. The vector to be replaced is the one which hasbeen accessed the least number of times. This requires maintaininga list of m entries containing the access frequency and an O(log(n))binary search for the smallest value.

Topological The vector to be replaced is the most distant node (in termsof number of nodes along the path in the tree) from the node/vectorcurrently being requested. The node distance between a pair of nodesin a binary tree is defined as the number of nodes along the uniquepath that connects them.

The rationale for the topological replacement strategy is that, due to thelocality of the tree search and the computations, we expect the most distantnode/vector to be accessed again the furthest ahead in the future.

Reducing the Number of Swaps

So far, our EM algorithm has been integrated into RAxML in a fully trans-parent way. We have shown that it is possible to modify the program andany PLF-based program for that matter, such that the complexity is en-tirely encapsulated by a function call that returns the address of an ancestralprobability vector. However, it is possible to further reduce the number ofI/O operations by exploring some implementation-specific characteristics ofRAxML, that can also be deployed analogously in other PLF-based imple-mentations.

For each global or local traversal (re-computation of a part or of all an-cestral probability vectors) of the tree, we know, a priori (based on the treestructure), that some of the vectors that will be swapped into RAM will becompletely overwritten, that is, they will be used in write-only mode dur-ing the first access. Thus, whenever we swap in a vector from file, of which

38

we know that it will initially be used for writing, we can omit reading itscurrent contents from file. We denote this technique as read skipping andimplement it as follows: We introduce a flag in our EM bookkeeping datastructure that indicates whether read skipping can be applied or not, thatis whether a vector will be written during the first access. We instrumentthe search algorithm such that, when the global or local tree traversal orderis determined (this is done prior to the actual likelihood computations), theflag is set appropriately.

3.2.3 Experimental Setup & Results

Evaluation of replacement strategies

Assessing the correctness of our implementation is straight-forward since thisonly requires comparing the log likelihood scores obtained from tree searchesusing the standard RAxML version and the out-of-core version. Hence, weinitially focus on analyzing the performance (vector miss rate) of our replace-ment strategies and the impact of the read skipping technique as a functionof f (proportion of vectors residing in RAM).

To evaluate the replacement strategies, we used 2 real-world biologicaldatasets with 1288 species (DNA data, MSA length s := 1200 sites/columns),and 1908 species (DNA data, MSA length: s := 1424 sites/columns) respec-tively. Tree searches were executed under the Γ model of rate heterogeneitywith four discrete rates. We used the SSE3-based [7] sequential version ofRAxML v7.2.8.

For each of the four replacement strategies, we performed three differentruns for the out-of-core version with f := 0.25 (25% of vectors memory-mapped), f := 0.50 (50% of vectors memory-mapped), and f := 0.75 (75%of vectors memory-mapped).

Given a fixed starting tree, RAxML is deterministic, that is, regardless off and the selected replacement strategy, the resulting tree (and log likelihoodscore) must always be identical to the tree returned by the standard RAxMLimplementation. For each run, we verified that the standard version and theout-of-core version produced exactly the same results.

This part of our computational experiments was conducted on a singlecore of an unloaded multi-core system (Intel Xeon E5540 CPU running at2.53GHz with 36 GB of RAM). On this system, the amount of available RAMwas sufficient to hold all vectors in memory for the two test-datasets, bothfor the standard implementation or by using memory-mapped I/O for theout-of-core version.

We found that, with the exception of the LFU strategy, even mapping

39

2

4

6

8

10

12

14

16

18

0.2 0.3 0.4 0.5 0.6 0.7 0.8

Mis

s ra

te (%

of t

otal

vec

tor r

eque

sts)

Fraction of RAM allocated

Miss rate for dataset with 1288 species

TopologicalLFU

RANDLRU

Figure 3.3: Vector miss rates for different replacement strategies using a datasetwith 1,288 species. We allocated 25%, 50% and 75% of the required memory forstoring ancestral probability vectors in RAM. Every time a vector is not availablein RAM, we count a miss.

only 25% of the probability vectors to memory results in miss rates under10%. As expected, miss rates converge to zero as the fraction of availableRAM is increased (see Figure 3.3). In the trivial case (f := 1.0), the miss rateis zero, since all vectors reside in RAM. The Random, LRU, and Topologicalstrategies perform almost equally well. Thus, one would prefer the randomor LRU strategy over the topological strategy because it requires a largercomputational overhead for determining the replacement candidate.

In Figure 3.4, we show the positive effect of the read skipping techniquefor analogous runs on the same dataset with 1288 species. Here, we quantifythe fraction of ancestral vector reads from disk that are actually carried outper ancestral vector access. Note that, without the read skipping techniquethis fraction would be identical to the miss rate in Figure 3.3. Thus, bydeploying this technique, we can omit more than 50% of all vector readoperations.

40

0

1

2

3

4

5

6

7

8

9

10

11

0.2 0.3 0.4 0.5 0.6 0.7 0.8

Rea

d ra

te (%

of t

otal

vec

tor r

eque

sts)

Fraction of RAM allocated

Read rate for dataset with 1288 species

TopologicalLFU

RANDLRU

Figure 3.4: Effect of Read skipping: We count the fraction of vector accesses forwhich a vector needs to be actually read from file using the same parameters as inFigure 3.3. Without the read skipping strategy the read rate is equivalent to themiss rate.

10

11

12

13

14

15

16

17

18

19

20

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14

Mis

s ra

te (%

of t

otal

vec

tor r

eque

sts)

Fraction f of RAM allocated

Miss rate for dataset with 1288 species

RAND

Figure 3.5: Miss rates for several runs using the random replacement strategyon the dataset with 1288 species. The fraction f of memory-mapped ancestralprobability vectors was divided by two for each run.

41

Miss Rates as a Function of f

To more thoroughly assess the impact of f on the miss rate, we conductedadditional experiments for different values of f using a random replacementstrategy on the test dataset with 1288 taxa. The fraction f was subsequentlydivided by two. The smallest f value we tested corresponds to only fiveancestral probability vector slots in RAM.

Figure 3.5 depicts the increase in miss rates for decreasing f . The mostextreme case with only five RAM slots, still exhibits a comparatively low missrate of 20%. This is due to the good vector usage locality in the RAxMLalgorithm. One reason for this behavior, that is inherent to ML programs,are branch length optimization procedures. Branch length optimization istypically implemented via a Newton-Raphson procedure, that iterates overa single branch of the tree. Thus, only memory accesses to the same twovectors (located at either end of the branch) are required in this phase whichaccounts for approximately 20-30% of overall execution time. In RAxML,access locality is also achieved by —in most cases— only re-optimizing threebranch lengths after a change of the tree topology during the tree search(Lazy SPR technique; see [87]).

Real Test Case

Finally, we conducted realistic tests by analyzing large data sets on a systemwith only 2GB of RAM. Here, we compare execution times of the standardalgorithm (using paging) with the out-of-core performance.

For these tests, we generated large simulated DNA datasets using IN-DELible [26]. We intentionally generated datasets whose memory require-ments for storing ancestral probability vectors (see Figure 3.6) exceed themain memory available on the test system (Intel i5 running at 2.53 GHz with2GB RAM configured with 36 GB swap space). To achieve this, we deployedINDELible to simulate DNA data on a tree with 8192 species and varyingalignment lengths s. We chose values of s such that, the simulated datasetshad (ancestral probability) memory requirements ranging between 1GB and32GB. Because of the prohibitive execution times for full tree searches onsuch large datasets, we did not execute the standard RAxML search algo-rithm. Instead, we executed a subset of the PLF as implemented in RAxML(-f z option in the modified code by simply reading in a given, fixed, treetopology and computing five full tree traversals (recomputing all ancestralprobability vectors in the tree five times) according to the Felsenstein prun-ing algorithm. This represents a worst-case analysis, since full tree traversalsexhibit the smallest degree of vector locality. Full tree traversals are required

42

0

5000

10000

15000

20000

0 5 10 15 20 25 30

Ela

psed

exe

cutio

n tim

e (s

)

RAM required for inner nodes (GB)

Running time for simulated datasets of variable width (8192 species)

Standardooc-1GB-LRU

ooc-1GB-RAND

Figure 3.6: Execution times for 5 full tree traversals on a tree with 8192 sequencesand variable dataset width s for the standard RAxML version (using paging) andthe out-of-core version.

to optimize likelihood model parameters such as the α shape parameter ofthe Γ model of rate heterogeneity.

The out-of-core runs were invoked with the -L 1,000,000,000 flag toforce the program to use less than 1GB of RAM for storing ancestral proba-bility vectors.

Figure 3.6 demonstrates that the execution times of the out-of-core im-plementation scale well with dataset size (ancestral probability vector spacerequirements). As expected, the standard approach is faster for datasetsthat still fit into RAM, while it is more than five times slower for the largestdataset with 32GB. For ancestral vector memory footprints between 2GBand 32GB, the run-time performance gap between the out-of-core implemen-tation and the standard version increases, since the standard algorithm startsusing virtual memory (e.g., then number of page faults increases from 346,861for 2GB to 902,489 for 5GB). Note that, for the out-of-core runs, we onlyuse 1GB of RAM, that is, better performance can be achieved by slightlyincreasing the value for -L. Thus, under memory limitations and given theruntimes in Figure 3.6 using the out-of-core approach is significantly fasterthan the standard approach for computing the likelihood on large datasets.

43

3.3 Recomputation of Ancestral Vectors

The content of this Section has been derived from the following peer-reviewed publication:F. Izquierdo-Carrasco, J. Gagneur, and A. Stamatakis. Trading runningtime for memory in phylogenetic likelihood computations. In J. Schier,C. M. B. A. Correia, A. L. N. Fred, and H. Gamboa, editors, BIOIN-FORMATICS, pages 86–95. SciTePress, 2012J. Gagneur is the author of the proof in Figure 3.3.1 on the theoreticalminimum memory requirements to compute the PLF.

Given enough execution time and disk space, the out-of-core version can bedeployed to essentially infer trees on datasets of arbitrary size. However, theincrease in execution times in the out-of-core implementation is still large.While we are confident that the miss rate and miss penalty can be furtherimproved by low-level I/O performance optimization, we have explored theoverhead of, instead of storing on disk, recomputing ancestral probabilityvectors.

In this Section we introduce the implementation of a versatile concept totrade memory for additional computations in the likelihood function whichexhibits a surprisingly small impact on overall execution times. When trading50% of the required RAM for additional computations, the average execu-tion time increase because of additional computations amounts to only 15%.This slowdown (due to the additional recomputations) is comparable to thestandard deviation of the running time of a set of independent searches, thatis, it is not significant from the user perspective. We demonstrate that, the-oretically, for a phylogeny with n species only log(n) + 2 memory space isrequired for computing the likelihood of a tree. We also assess practical lowerbounds.

The term time-memory trade-off engineering refers to situations/approacheswhere memory requirements/utilization is reduced at the cost of additionalcomputations, that is, at the expense of increased run time. This trade-off engineering approach has been applied in diverse fields such as languagerecognition [18], cryptography [5], and packet scheduling [109]. We are notaware of any applications of or experiments with time-memory trade-off en-gineering approaches in the field of computational phylogenetics.

We next describe the underlying idea for the memory saving approach,demonstrate that only log(n) + 2 vectors are required to compute the PLF,

44

and introduce two vector replacement strategies. In Subsection 3.3.2 we de-scribe the experimental setup and provide corresponding results.

3.3.1 Recomputation of Ancestral Probability Vectors

Specific PLF Memory Requirements

We have previously discussed the memory requirements for the PLF compu-tation in Section 3.1. Next, we will discuss how these requirements can belowered with the recomputation trade-off. Let us consider again Equation 2.10for the PLF computation of one entry in the ancestral probability vector ofnode p at site i.

L(p)A (i) =

�

T�

S=A

PAS(blp)L(l)S (i)

��

T�

S=A

PAS(brp)L(r)S (i)

�

In order to compute L(p)A (i), we do not necessarily need to immediately

have available in memory vectors L(l)S (i) and L

(r)S (i), since they can be ob-

tained by a recursive descent (a post-order subtree traversal) into the sub-trees rooted at l and r using the above equation. In other words, if we donot have enough memory available, we can recursively re-compute the valuesof L(l) and L(r). Note that, the recursion terminates when we reach the tips(leaves) of the tree and that memory requirements for storing tip vectors arenegligible compared to ancestral vectors.

If not all ancestral probability vectors fit in RAM, the required vectors forthe operation at hand can still be obtained by conducting some additionalcomputations for obtaining them by applying equation 2.10. We observethat, when both L

(l)S (i) and L

(r)S (i) have been used (consumed) for calculating

L(p)A (i), the two child vectors are not required any more. That is, those two

vectors can be omitted/dropped from RAM or be overwritten in RAM to savespace. Therefore, the likelihood of a tree can be computed without storing alln− 2 ancestral vectors. Instead, a smaller amount of space for only storingx < (n − 2) vectors can be used. Those x vectors can then dynamicallybe assigned to a subset (varying over time) of the n − 2 inner nodes. Thisgives rise to the following question: Given a MSA with n taxa, what is theminimum number of inner vectors xmin that must reside in memory to beable to compute the likelihood on any unrooted binary tree topology with ntaxa?

Due to the post-order traversal of the binary tree topology, some ancestralvectors need to be stored as intermediate results. Consider an ancestral nodep that has two subtrees rooted at child nodes l and r with identical subtree

45

1

2

3 4

5

6 7

virtual root

Figure 3.7: A balanced subtree, where the vector of inner node 1 is oriented inthe direction of the virtual root. In order to compute the likelihood of vector 1,the bold vectors must be held in memory. The transparent vectors may reside inmemory but are not required to.

depth. When the computations on the first subtree for obtaining L(l) havebeen completed, this vector needs to be kept in memory until the ancestralprobability vector L(r) has been calculated to be able to compute L(p).

In the following Figure 3.3.1, we demonstrate that the minimum numberof ancestral vectors xmin(n) that must reside in RAM to be able to computethe likelihood of a tree with n taxa is log2(n) + 2. For obtaining this lowermemory bound, we consider the worst case tree shape, that is, a perfectlybalanced unrooted binary tree with the virtual root placed into the innermostbranch. As we show in Figure 3.3.1, this represents the worst case becauseall subtrees from the virtual root have the same length.

Figure 3.7, where we first descended into the left subtree, depicts thisworst case scenario for n = 4. The probability vectors will be written inthe following order: 3, 4, 2, 6, 7, 5, 1. Figure 3.7 shows the numberof vectors required in memory (log2(4) + 2 = 4) at the point of time wherevector 5 needs to be computed. At any other point of time during the post-order traversal, holding 3 vectors in RAM is sufficient to successfully proceedwith the computations.

Theoretical Minimum Memory Requirements

The underlying idea of our approach is to reduce the total number of an-cestral probability vectors residing in RAM at any given point of time. LetPLF be the likelihood function (see Equation 2.10), which —for the sake ofsimplifying our notation— can be computed at a tip (essentially at zero com-putational cost) or an inner node given the ancestral probability vectors ofits two child nodes. The PLF always returns an ancestral probability vector

46

as result.As a pre-processing step, we compute the number of descendants (size

of the respective subtree) for each inner node. This can be implementedvia a single post-order tree traversal. The memory required for storing thenumber of descendants as integers at each node is negligible compared torequired probability vector space. Given this information, the binary treedata structure can be re-ordered such that, for every node p, the left childnode l contains the larger number of descendants and the right child node rcontains the smaller number of descendants. We can now compute the PLFof the tree by invoking the following recursive procedure f at the virtual rootp of the tree:

proc f(p) ≡if isALeaf(p) then return(PLF (p)) fi;vl := f(p.l); Step 1vr := f(p.r); Step 2return(PLF (vl, vr)) Step 3

.

During this computation, the maximum number of probability vectorsxmin(n) that need to reside in memory for any tree with n leaves is log2(n)+2.This upper bound is required if and only if the binary tree is balanced.

Proof: We demonstrate the above theorem by recursion. xmin(1) = 1, sincefor a tree with a single node, only the sequence data (a single probabilityvector) need to be stored. Now assume that xmin(n − 1) = log2(n − 1) + 2and consider a tree with n leaves. We execute all steps of f(p), where p isthe root, and keep track of the number of probability vectors that are storedsimultaneously in RAM:

• Step 1: Computing f(p.l) requires storing at most xmin(n − 1) prob-ability vectors simultaneously which is strictly less than log2(n) + 2.Once f(p.l) has been computed only the result vector is stored (i.e.,only one single probability vector).

• Step 2: Because the subtree size of the right descendant of the root p.rmust be ≤ n/2, this computation needs to simultaneously store at mostlog2(n/2) + 2 = log2(n) + 1 probability vectors. Here, we need to addthe number of probability vectors that are required to be maintainedin RAM after completion of Step 1. Thus, overall, Step 2 needs tosimultaneously hold at most log2(n) + 2 probability vectors in RAM.

47

Once the results of f(p.l) and f(p.r) have been computed we are leftwith two probability vectors vl, vr that need to reside in memory.

• Step 3: This step requires 3 probability vectors to be stored at thesame time, namely, vl, vr, and the vector to store the result of PLF (vl, vr).Note that, to obtain the overall likelihood of the tree (a single numer-ical value), we need to apply a function g() to the single result vectorreturned by f(), that is, g(f(p)). Function g() simply uses the data inthe result vector to calculate the likelihood score over all root vectorentries.

Hence the peak in memory usage is reached during Step 2 and its upperbound is log2(n)+2. Moreover, this upper bound is reached if and only if thenumber of descendants in the respective left and right subtrees is identicalfor all nodes, that is, for a balanced tree.

Basic Implementation

While holding log2(n) + 2 vectors in RAM provides sufficient space for com-puting the PLF, this will induce a significant run-time increase (due to re-computations) compared to holding n vectors in memory. In practice, weneed to analyze and determine a reasonable run-time versus RAM trade-off as well as an appropriate vector replacement/overwriting strategy. Forinstance, in the RAxML or MrBayes PLF implementations, some ancestralvectors (depending on their location in the tree) can be recomputed fasterthan others. In particular, cases where the left and/or right child vectors aretip sequences can be handled more efficiently. For instance, an observed nu-cleotide A at a tip sequence corresponds to a simple probability vector of theform [P (A) := 1.0, P (C) := 0.0, P (G) := 0.0, P (T ) := 0.0]. This property oftip vectors can be used for saving computations in equation 2.10.

Devising an appropriate strategy (see Section 3.3.1) for deciding whichvectors shall remain in RAM and which can be discarded (because they canbe recomputed at a lower computational cost) can have a substantial impacton the induced run time overhead when holding, for instance, x := n/2vectors in RAM. In the following, we will outline how to compute the PLFand conduct SPR-based tree searches with x < n vectors in RAM.

Let n be the number of species, n−2 the number of ancestral probabilityvectors, and x the number of available slots in memory, where log2(n) + 2 ≤x < n (i.e., n−x vectors are not stored, but recomputed on demand). Let wbe the number of bytes required for storing an ancestral probability vector(all vectors have the same size). Our implementation will only allocate x ·w

48

bytes, rather than n · w. We henceforth use the term slot to denote a RAMsegment of w bytes that can hold an ancestral probability vector.

We define the following C data structure (details omitted) to keep trackof the vector-to-slot mapping of all ancestral vectors and for implementingreplacement strategies:

typedef struct

{

int numVectors;

size_t width;

double **tmpvectors;

int *iVector;

int *iNode;

int *unpinnable;

boolean allSlotsBusy;

unpin_strategy strategy;

}recompVectors;

The array tmpvectors is a list of pointers to a set of slots (i.e., startingaddresses of allocated RAM memory) of size numVectors (x). The arrayiVector has length x and is indexed by the slot id. Each entry holds thenode id of the ancestral vector that is currently stored in the indexed slot.If the slot is free, the value is set to a dedicated SLOT_UNUSED code. Thearray iNode has length n − 2 and is indexed using the unique node ids ofall ancestral vectors in the tree. When the corresponding vector resides inRAM, its array entry holds the corresponding slot id. If the vector does notreside in RAM the array entry is set to the special code NODE_UNPINNED.Henceforth, we denote the availability/unavailability of an ancestral vectorin RAM as pinned/unpinned. The array unpinnable tracks which slots areavailable for unpinning, that is, which slots that currently hold an ancestralvector can be overwritten, if required.

The set of ancestral vectors that are stored in the memory slots changesdynamically during the computation of the PLF (i.e., during full tree traver-sals and tree searches). The pattern of dynamic change in the slot vector alsodepends on the selected recomputation/replacement strategy. For each PLFinvocation, be it for evaluating a SPR move or completely re-traversing thetree, the above data structure is updated accordingly to ensure consistency.

Whenever we need to compute a local tree traversal (following the appli-cation of an SPR move) to compute the likelihood of the altered tree topology,we initially just compute the traversal order which is part of the standardRAxML implementation. The traversal order is essentially a list that storesin which order ancestral probability vectors need to be computed. In other

49

words, the traversal descriptor describes the partial or full post-order treetraversal required to correctly compute the likelihood of a tree. For usingx < n vectors, we introduce a so-called traversal order check, which extendsthe traversal steps (the traversal list) that assume that all n vectors reside inRAM. By this traversal order extension, we guarantee that all missing vectors(not residing in RAM) will be recomputed as needed. The effect of reducingthe number of vectors residing in RAM is that, traversal lists become longer,that is, more nodes are visited and thereby run time increases. When thetraversal is initiated, all vectors in the traversal list that already reside inRAM (they are pinned to a slot) are protected (marked as unpinnable) suchthat, they will not be overwritten by intermediate vectors of the recomputa-tion steps.

If an ancestral vector slot needs to be written/stored by the traversal,there are three cases:

1. The vector resides in a slot (already in memory). We can just readand/or write to this slot.

2. The vector is not pinned, but there exists a free slot, which is thenpinned to this vector.

3. The vector is not pinned and there is no free slot available. A residingvector must be replaced and the corresponding slot needs to be pinnedto the required vector.

Since the traversal only visits each vector at most once, the correspondingchildren of this vector can be unpinned once it has been written to memory.Instead of unpinning them directly, they are merely marked for unpinning.The real overwrite only takes place if the slot is selected by the replacementstrategy for storing another vector. Otherwise, the slot will store the valuesof the current vector for as long as possible for potential future re-use.

Replacement Strategies

In analogy to the replacement strategies discussed in the out-of-core imple-mentation in Section 3.2, there are numerous approaches for deciding whichmemory slot should be overwritten by a new ancestral vector that does cur-rently not reside in RAM. We implement and analyze the following two re-placement strategies.

Random A random slot not flagged as pinned is selected. The randomstrategy is a naıve approach with minimal overhead and is used asbaseline for performance comparisons.

50

MRC Minimum Recomputation Cost. The slot with minimum subtree size(see below) and not flagged as pinned is selected.

The MRC strategy entails a slight overhead for keeping track of whichvectors will be most expensive to recompute and should therefore be kept inRAM for as long as possible. Consider an unrooted binary tree T with n tips.Each inner node i in an unrooted binary tree with n taxa can be regardedas a trifurcation that defines three subtrees Ti,a, Ti,b, and Ti,c correspondingto the three outgoing branches a, b, and c. Given a subtree Ti,x, we defineits subtree size sts(Ti,x) as the number of tips beneath inner node i in thedirection of the outgoing branch x. Thus, sts(Ti,a) + sts(Ti,b) + sts(Ti,c) = nholds for any inner node i in an unrooted binary tree with n taxa/tips.When conducting likelihood computations, the tree is always rooted by avirtual root. Hence, if the virtual root is located in the direction of branch c,the relevant subtree size with respect to the recomputation cost at an innernode i is sts(T rooted

i ) := sts(Ti,a) + sts(Ti,b). We use this rooted subtree sizests(T rooted

i ) to determine the recomputation cost for each ancestral vector i,given a virtual rooting of the tree. In particular, the case sts(T rooted

i ) = 2(both children are tips) is particularly cheap to recompute, since tip vectorsalways reside in RAM and recomputing ancestral vector i is cheap (see above).In a perfectly balanced tree with the root placed in the innermost branch,half of the inner vectors have subtree size sts(T rooted

i ) = 2.As already mentioned, during a partial or full tree traversal for comput-

ing the likelihood, all inner nodes (vectors) involved are oriented in a givendirection toward the virtual root. Evidently, the subtree sizes will changewhen the topology is altered and will need to be updated accordingly.

In order to account for this, we keep an array of subtree sizes, that is,for each inner node we store a subtree size value. Whenever the topologychanges, a local traversal descriptor is created. This local traversal descrip-tor starts at the new virtual root and recursively includes the inner nodeswhose orientation has changed after the given topology change. Since thisexactly corresponds to the set of nodes whose subtree size values must be up-dated, the subtree size array can be updated via the same recursive descentprocedure.

Implementation Details

In the following we discuss some important details of the recomputationprocess.

Largest subtree first The standard implementation of the PLF, where allancestral vectors are available in memory, computes the PLF by conducting

51

1

2 5

4

virtual root

3

Figure 3.8: An unbalanced subtree, where the vector of inner node 1 is orientedin the direction of the virtual root. Bold rectangles represent vectors that mustbe held in memory if we first descend into the left subtree and node 4 is beingwritten.

a post-order traversal from an arbitrary rooting of the tree. Thus, an an-cestral probability vector can be computed once the respective left and rightchild vectors have been computed. In the standard RAxML implementation,the traversal always recursively descends into the left subtree first. The (ar-bitrary) choice whether to descend into the left or right subtree first, doesnot affect performance (nor the results) when all ancestral vectors reside inRAM.

However, when not all vectors reside in RAM, the choice whether todescend into the left or right subtree first does matter. This is particu-larly important if we use the minimum setting x := log2(n) + 2, since oth-erwise we may encounter situations where not enough slots are available(see Figure 3.3.1).

Suppose that, as in the standard implementation, we always descend intothe left subtree first. In the example shown in Figure 3.8, the left subtree issignificantly smaller than the right subtree. We would first descend into theleft subtree, which consists of a single inner node. The child ancestral vectorcorresponding to node 2 must be pinned to its slot. Thereafter, we descendinto the right subtree writing and pinning nodes 3, 4, 5 (always assumingthat we descend into the left —smaller— subtree of each node first). Whilewe keep descending into the right subtree, the ancestral vector correspondingto node 2 remains pinned, because it represents an intermediate result.

In contrast, if we initially descend into the right subtree (which is always

52

larger in the example), there is no need to store intermediate results of the leftsubtree (node 2). Also, nodes 4 and 5 can be immediately unpinned as soonas their parent nodes have been computed. Thus, by inverting the descentorder such as to always descend into the larger subtree first (as required byour proof), we minimize the amount of time for which intermediate vectorsmust remain pinned. Note that, when two subtrees have the same size, itdoes not matter into which subtree we descend first.

If we descend into the smaller subtrees first, there will be more vectorsthat need to remain pinned for a longer time. This would also reduce theeffective size of the set of inexpensive-to-recompute vectors that shall pref-erentially be overwritten, because more vectors that are cheap to recomputeneed to remain pinned since they are holding intermediate results. In thisscenario more expensive-to-recompute vectors will need to be overwritten inmemory and dropped from RAM.

Determining the appropriate descent order (largest subtree first) is trivialand induces a low computational overhead. When the traversal list is com-puted, we simply need to compare the subtree sizes of child nodes and makesure to always descend into the largest subtree first.

Priority List For this additional optimization, we exploit a property of theSPR move technique. When a candidate subtree is pruned (see Figure 2.9)from the currently best tree, it will be re-inserted into several branches ofthe tree from which it was removed to evaluate the likelihood of differentplacements of the candidate subtree within this tree.

In the course of those insertions, the subtree itself will not be changedand only the ancestral vector at the root of the subtree will need to beaccessed for computations. Hence, we maintain a dedicated list of prunedcandidate subtree nodes (a unpinning priority list) that can be preferentiallyunpinned. Because of the design of lazy SPR moves in RAxML (similarSPR flavors are used in GARLI and PHYML) those nodes (corresponding toancestral vectors) will not be accessed while the candidate subtree is insertedinto different branches of the tree. Once this priority list is exhausted, thestandard MRC recomputation strategy is applied.

Full Traversals Full post-order traversals of the tree are required duringcertain phases of typical phylogenetic inference programs, for instance whenoptimizing global maximum likelihood model parameters (e.g., the α shapeparameter of the Γ distribution or the rates in a GTR matrix) on the entiretree. Full tree traversals are also important for the post-analysis of fixedtree topologies, for instance, to estimate species divergence times. Full tree

53

traversals represent a particular case because every inner vector of the treeneeds to be visited and computed. Hence, the number of vectors that needto be computed under our memory reduction approach is exactly identicalto the number of vectors that need to be computed under the standardimplementation. Thus, there is no need for additional computations whilea large amount of memory can be saved. While full tree traversals do notdominate run times in standard tree search algorithms, they can dominateexecution times in other phylogenetic downstream analysis, such as, e.g.,branch length optimization or estimation of species divergence times andlineage-specific substitution rates [31].

3.3.2 Experimental Setup and Results

We have implemented the techniques described in Subsection 3.3.1 in RAxML-Light v1.0.4 [91]. RAxML-Light is a strapped-down dedicated version ofRAxML intended for large-scale phylogenetic inferences on supercomputers.It implements a light-weight software-based checkpointing mechanism andoffers fine-grained PThreads and MPI parallelizations of the PLF. It hasbeen used to compute a tree on a dense simulated MSA with 1481 taxa and20,000,000 sites that required 1TB of RAM and ran in parallel with MPI on672 cores. RAxML-Light will not be maintained anymore, but the ExaMLcode [90], which uses a more efficient MPI paralellization strategy, can beused for phylogenetic inferences on supercomputers. The recomputation im-plementation used for this evaluation is a fork of RAxML-Light, and availableat https://github.com/fizquierdo/vectorRecomp

Evaluation of recomputation strategies

The recomputation algorithm yields exactly the same log likelihood scores asthe standard algorithm. Thus, for validating the correctness of our imple-mentation, it is sufficient to verify that the resulting trees and log likelihoodscores of a ML tree search with the standard and recomputation implemen-tations are identical. The increase of total run time depends on the number xof inner vectors that are held in memory and on the chosen unpinning strat-egy (MRC versus RANDOM). We used INDELible [26] to generate simulatedMSAs of 1500, 3000, and 5000 species. All experiments described in this Sec-tion were conducted for these three datasets. For this purely computationalwork it does not matter whether simulated or real data are used.

Initially, we ran the Parsimonator program [85] to generate 10 distinctrandomized stepwise addition order parsimony starting trees for each MSA.For each starting tree, we then executed a standard ML tree search with

54

0

2000

4000

6000

8000

10000

12000

14000

0 0.2 0.4 0.6 0.8 1

Ave

rage

Run

ning

tim

e (s

)

Reduction Factor (% inner vectors on RAM)

Strategy performance for 10 runs of dataset 1500

stdrecomRANDOM

recomSLOT_MIN_COST

Figure 3.9: Different replacement strategies. The dataset was run with RAMallocations of 10%, 25%, 50% , 75%, and 90%, of the total required memory forstoring all probability vectors. Run times are averaged across 10 searches withdistinct starting trees.

RAxML-Light (sequential version 1.0.4 with SSE3 intrinsics) and ML treesearches with the recomputation version for the two replacement strategies(MRC and RANDOM) and five different RAM reduction factors (-x and-r options respectively). In the recomputation version used for the evalua-tion, both the Largest subtree first and the Priority List optimizations wereactivated. All experiments were executed on a 48-core AMD system with256GB RAM. For all runs, RAM memory usage was measured every 600seconds with top.

Figure 3.10 shows the corresponding decrease in RAM usage. Valuesin Figure 3.10 correspond to maximum observed RAM usage values.

Figure 3.9 depicts the run time increase as a function of available spacefor storing ancestral probability vectors.

Clearly, the MRC strategy outperforms the RANDOM strategy and theinduced run-time overhead, even for a reduction of available RAM spaceto only 10% is surprisingly small (approximately 40%). This slowdown isacceptable, considering that instead of analyzing a large dataset on a ma-chine with 256GB, a significantly smaller and less expensive system with, forinstance, 32GB RAM will be sufficient.

55

0

5

10

15

20

25

30

0 0.2 0.4 0.6 0.8 1

Ave

rage

Mem

ory

used

(MB

)

Reduction Factor (% inner vectors on RAM)

Memory measured for dataset 1500 for 10 trees

stdrecomSLOT_MIN_COST

Figure 3.10: Overall RAM usage when allocating only 10%, 25%, 50% , 75%, and90%, of the required ancestral probability vectors.

3.3.3 Evaluation of traversal overhead

In order to evaluate the overhead of the extended (larger) tree traversalsdue to the required additional ancestral probability vector computations, wemodified the source code to count the number of ancestral vector compu-tations. We distinguish between three cases with different recomputationcosts (see Subsection 3.3.1). For each case, there exists a dedicated PLFimplementation in RAxML.

Tip/Tip Both child nodes are tips.

Tip/Vector One child node is a tip, and the other is an ancestral vector(subtree).

Vector/Vector Both child nodes are ancestral vectors.

Table 3.1 shows a dramatic, yet desired increase, in the number of Tip/Tipvector computations for the MRC strategy. The amount of the slowest typeof ancestral node computations [Vector/Vector], however, is only increasedby 0.16% compared to the standard implementation.

56

Strategy Tip/Tip Tip/Vector Vector/Vector Total Runtime (s)Standard 11,443,484 57,884,490 76,325,233 145,653,207 5678

MRC (0.5) 20,368,957 61,224,562 76,444,874 158,038,393 6453Random (0.5) 37,778,575 85,303,730 104,398,910 227,481,215 7999

Table 3.1: Frequency of ancestral vector cases for the standard implementationand the recomputation strategies (50% of ancestral vectors allocated)

Dataset Standard R:=0.1 R:=0.9500 0.122 0.121 0.1301500 0.430 0.430 0.4345000 2.402 2.412 2.438

Table 3.2: Average run times in seconds for 20 full traversals averaged across 5runs

Evaluation of full tree traversals

We created a simple test case that parses an input tree and conducts 20full tree traversals on the given tree. We used the aforementioned startingtrees and the 500, 1500, and 5000 taxon datasets. Each run was repeated 5times and we averaged running times. All runs returned exactly the samelikelihood scores.

Table 3.2 indicates that, even for very small R values (fraction of innervectors allocated in memory), the run time overhead is negligible comparedto the standard implementation.

The recomputation approach permits to save memory at the cost of someadditional computations, and independently of the data pattern in the align-ment. However, certain data patterns in the MSAs can be exploited to savememory without any overhead in computations. In fact, as we show in thenext Section, if large blocks of gaps are present in the alignment, it is possibleto reduce both the number of computations and the memory requirements.

57

3.4 Subtree Equality Vectors

3.4.1 Gappy Subtree Equality Vectors

The content of this Section has been derived from the followingpeer-reviewed publication:F. Izquierdo-Carrasco, S. Smith, and A. Stamatakis. Algorithms, datastructures, and numerics for likelihood-based phylogenetic inference ofhuge trees. BMC Bioinformatics, 12(1):470, 2011S. Smith generated the 38K and 56K datasets describedin Subsection 3.4.2

The concept of Subtree Equality Vectors (SEVs) to accelerate likelihood com-putations by reducing the number of required floating point operations wasfirst introduced in 2002 [96]. Conceptually similar approaches were presentedin 2004 [64] and 2010 [100].

The underlying idea is based on the following observation: Given twoidentical alignment sites i and j in the MSA that evolve under the sameevolutionary model (GTR parameters, α shape parameter of the Γ function,etc.), and for which a joint branch length has been estimated, their per-sitelog likelihoods LnL(i) and LnL(j) will be identical. Hence, to save compu-tations, one can compress the identical sites into a single site pattern andassign a respective site pattern count (weight) to this site pattern. Thus,for two identical sites i and j, we can compute the per-site log likelihood as2 ·LnL(i). This global compression of alignments (executed prior to conduct-ing likelihood computations) is implemented in all current likelihood-basedcodes. This basic idea of site compression can be extended to the subtreelevel, by using SEVs for instance, to save additional computations. Let usconsider again Equation 2.10, which computes the ancestral probability entryto observe state A at node p and site i.

L(p)A (i) =

�

T�

S=A

PAS(blp)L(l)S (i)

��

T�

S=A

PAS(brp)L(r)S (i)

�

(2.10)

We can now consider again Equation 3.1, which computes the ancestralprobability entry to observe state A at node p and site j.

L(p)A (j) =

�

T�

S=A

PAS(blp)L(l)S (j)

��

T�

S=A

PAS(brp)L(r)S (j)

�

(3.1)

58

Figure 3.11: All alignment columns are different. Therefore, at the root, each sitewill contribute with a different site likelihood. However, at the internal nodes,the first and the last site share identical patterns (green) at the node level sub-alignment. Since the APVs for the first and the last site will be identical in theinternal nodes, there is potential for data (APV entries) re-use.

Both entries only differ on L(l)S and L

(r)S , which by recursion depend only

on the likelihood entries at the tip vectors. We can think of node p as a rootnode of a subtree (clade), whose likelihood is computed with data from asub-alignment. If the data pattern at the tips of the subtree rooted at p areidentical for sites i and j, then L

(p)A (i) and L

(p)A (j) must have the same value.

Therefore, the likelihood entry only needs to be computed once. The sameargument can be made for states C, G and T.

The key technical challenge with this approach is that it requires a largeamount of bookkeeping, to keep track of identical subtree site patterns (fordetails see [96]). However, this approach can be simplified by consideringonly subtree-level presence of gaps in the alignment.

Gaps and undetermined characters are mathematically equivalent in thestandard ML framework. Since structured patches of missing data domi-nate many phylogenomic datasets (the amount of missing data can be upto 90%) [74, 92], we only track subtree site patterns that entirely consist ofgaps/undetermined characters (e.g., we are only interested in subtree sites ofthe from: ---- in a subtree of size 4). We assume that undetermined char-acters (?, N) have been translated into gap symbols (-). Thereby, we avoidthe more complex task (see [64] and [100]) of tracking all identical subtreesite patterns (e.g., detecting all sites of the form: ACCT in a subtree of size4). We show an example in Figure 3.11

This restriction simplifies the required bookkeeping procedure and datastructure significantly, because we only need to know whether a subtree siteconsists entirely of gaps or not. Thus, given an alignment with s sites, itsuffices to enhance the data structures for storing tips and inner nodes by a

59

simple bit vector with s bits. If all-gap sites are represented by 1 and non-gap sites by 0, we simply need to execute a bit-wise AND on the respective bitvectors of the child nodes l and r in conjunction with the tree traversal forcomputing the likelihood to determine the all-gap sites at the ancestral nodep (see Figure 3.12). We can then use this bit vector at p to determine if weneed to compute the likelihood entries of the ancestral probability vector ata site i.

We have implemented this method for DNA and protein data under theΓ model of rate heterogeneity in RAxML v728 (alpha) available at http://www.exelixis-lab.org/web/personal_page/izquierdo/sev.tar.gz. Ev-idently, the efficiency of this approach depends on the proportion of and dis-tribution of gaps in the input alignment. Since areas of missing data aretypically well-structured in current phylogenomic datasets, this approach isexpected to work well with this kind of input data.

While SEVs can speed-up ancestral probability vector computations, SEVsslightly slow down the branch length optimization and likelihood computa-tion (at the root) functions because of the memory accesses to the bit vectors.

Saving Memory with SEVs

SEVs as implemented here, can also be deployed to reduce memory require-ments. As mentioned above, if, at an ancestral node p we encounter anall-gap site, we completely omit its computation. In order to accomplishthis, we need to maintain only one additional ancestral probability vectorsite, that contains the signal for all-gap sites. Consider an ancestral proba-bility vector where 50% of the entries in the all-gap site bit-vector are set to1, that is, where we only need to compute 50% of the ancestral probabilityvector entries with respect to the total alignment length.

We can observe that, in addition to saving 50% of the computations re-quired for this ancestral probability vector, we can also save 50% of the mem-ory space required for storing the ancestral probability vector (see Figure 3.13).Thus, the memory requirements for each ancestral node can be determinedon-the-fly as we traverse the tree, by subtracting the number of entries thatare set to 1 in the bit vector from the input alignment length. Rememberthat, the bit vectors we deploy are always as long as the input alignment.

The key technical problem that arises is that, the required ancestral prob-ability vector lengths at inner nodes will change dynamically when the treetopology changes or even when the tree is just re-rooted. Given a rootingof the tree, one may think of this as ancestral probability vectors becominglonger while one approaches the root of the tree.

At present we have implemented this by dynamically freeing and allo-

60

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��01110

G−−−TG−−−TA−−−G

p

01010

A−T−TA−G−AA−T−AA−A−A

l r

toward virtual root

01010:=01110 AND 01010

all−gap bit vector

ancestral probability vector

with 4 taxa

subtree/subalignmentwith 3 taxa subtree/subalignent

data in black areasdoes not need to becomputed

Figure 3.12: Using Subtree Equality Vectors to save computations for all-gapalignment sites in subtrees.

cating memory (using free() and malloc()) at each ancestral node. Thereallocation only takes place when the all-gap bit-vector count (number ofbits set to 1) corresponding to the required ancestral probability vector doesnot equal the all-gap bit-vector count of the current ancestral probabilityvector at an ancestral node. Due to the large number of calls to malloc()

and free() performance can suffer when the pthreads version is being de-ployed due to thread contention. The technique may not scale well with thenumber of threads because the reallocation of shared memory by one threadcan block other threads. One possible solution is the deployment of lock-less memory allocators [34], where the syncronization overhead dissapearsbecause each thread works with local allocation and memories.

Note that, the concepts presented here can also be applied to phyloge-nomic datasets with joint branch length estimates across partitions, whilethe conceptually different ideas presented in [92] can only be applied to par-titioned phylogenomic datasets with per-partition branch length estimates.

3.4.2 Generation of Biological Test Datasets

To assess our methods, we used two large multi-gene datasets of plants.The first dataset comprises 37,831 taxa and 9,028 sites and was obtained

as follows: We assembled a DNA sequence matrix of 37,831 seed plant taxaconsisting of the chloroplast regions atpB (1,861 taxa, >2.6 Megabases [Mb]),matK (10,886 taxa, >14.3 Mb), rbcL (7,319 taxa, >9.7 Mb), trnK (4,163taxa, >7.5 Mb), and trnL-trnF (17,618 taxa, >13 Mb), and the internal

61

��

��

��

��

��

��

��

��

��

��

��

G−−−TG−−−TA−−−G

p

A−T−TA−G−AA−T−AA−A−A

l

toward virtual root

ancestral probability vector

with 4 taxa

subtree/subalignmentwith 3 taxa subtree/subalignent

01110 01010

01010:=01110 AND 01010

all−gap bit vector

r

ancestral probability entriesfor all−gap columns do notneed to be stored!

Figure 3.13: Using Subtree Equality Vectors to save computations and memoryfor all-gap alignment sites in subtrees.

transcribed spacer (ITS; 26,038 taxa, >14.3 Mb), using the Phylogeny As-sembly with Databases tool (PHLAWD [81]). All sequence alignments wereconducted using MAFFT version 6 [42] for initial alignments and MUSCLEfor profile alignments [19]. Alignment matrix manipulations were performedwith Phyutility [82].

The second dataset comprises 55,593 taxa and 9,853 sites and was ob-tained using the same pipeline as described above. The gene regions usedwere atpB (2,346 taxa, >3.6 Megabases [Mb]), matK (14,848 taxa, >33.6Mb), rbcL (10,269 taxa, >14.9 Mb), trnK (5,859 taxa, >15.3 Mb), and trnL-trnF (25,346 taxa, >30.1 Mb), and the internal transcribed spacer (ITS;37,492 taxa, >30.9 Mb).

For ease of reference we henceforth denote the 37,831 taxon datasets as38K and the 55,593 taxon as 56K. Trees computed on the 56K dataset havebeen published [80].

3.4.3 SEV Performance

We used the 38K and 56K datasets to test memory savings and speedupsachieved by applying the adapted SEV technique to phylogenomic datasets.The gappyness (percentage of missing data in the alignments) is 81.53% for38K and 83.40% for 56K, respectively.

For each alignment, we computed a parsimony starting tree with RAxMLthat was then evaluated (model parameter and branch length optimizationwithout tree search, RAxML -f e option) with RAxML under the GTR+Γ

62

SEVs SEVs with memory saving standardRuntime (s) 4125.1 4116.8 6541.1

Memory (GB) 42 15 41LogLikelihood -5528590 -5528590 -5528590

Table 3.3: Execution times and memory requirements for optimizing model pa-rameters and branch lengths under on the 38K dataset using SEVs, SEVs withmemory saving, and the standard likelihood implementation.

SEVs SEVs with memory saving standardRuntime (s) 7145.2 8095.1 11181.4

Memory (GB) 67 29 67log likelihood -7059556 -7059556 -7059556

Table 3.4: Execution times and memory requirements for optimizing model pa-rameters and branch lengths under on the 56K dataset using SEVs, SEVs withmemory saving, and the standard likelihood implementation.

model using the SEV reimplementation (with and without memory saving)and using the standard likelihood implementation.

The standard implementation required 41GB of memory on the 38Kdataset and 66GB of memory on the 56K dataset. The SEV technique withthe memory saving option enabled (-U option, available as of RAxML v7.2.7)reduced memory footprints under Γ to 14GB (38K) and 21GB (56K) respec-tively. The log likelihood scores for all three implementations were exactlyidentical. As shown in Table 3.3 and Table 3.4, the runtimes of the SEV-based versions are 25-40% faster than for the standard implementation. Theruntime differences between the SEV-based implementation with memorysaving enabled and the plain SEV version without memory saving, can beattributed to differences in memory access patterns. While both versions con-duct the same number of computations, the memory-saving version needs tomake millions of calls to OS routines (free() and malloc()) while the plainSEV version exhibits a higher memory footprint and thereby, potentially, ahigher cache miss rate.

3.5 Summary

In this chapter we have described the memory requirements of phylogeneticinference under ML. Accommodating such huge memory requirements is nec-essary for analyzing phylogenomic datasets. We have presented several tech-niques to effectively reduce these requirements. This will allow for computing

63

the PLF on larger datasets than ever before, especially when the limiting fac-tor is RAM memory.

We have presented the first implementation of the PLF that relies onout-of-core execution. We find that, given the locality of ancestral proba-bility vector access patterns, miss rates are very low, even if the amount ofavailable RAM is limited to a small fraction of the actually required memory.We demonstrate that our out-of-core implementation, performs substantiallybetter than the standard implementation that relies on paging.

We have presented a generic strategy for the exact computation of loglikelihood scores and ML tree searches with significantly reduced memory re-quirements. The additional computational cost incurred by the larger num-ber of required ancestral vector recomputations is comparatively low whenan appropriate vector replacement strategy is deployed. The memory versusadditional computations trade-off can be adapted by the users via a com-mand line switch to fit their computational resources. We also show that,the minimum number of ancestral probability vectors for computing the PLFthat need to be kept in memory for a tree with n taxa is log2(n) + 2. Thisresult may be particularly interesting for designing equally fast, but highlymemory-efficient phylogenetic post-analysis tools that rely on full tree traver-sals.

The MRC strategy has been integrated in RAxML-Light [91] and in thePhylogentic Likelihood Library [27]. This will allow to infer trees and com-pute likelihood scores on a single multi-core system datasets that previouslywould have required a supercomputer. The recomputation strategy clearlyoutperforms the out-of-core approach. For this reason, the later is not cur-rently maintained in any production codebase.

We have adapted and re-implemented the SEV technique for phyloge-nomic datasets with missing data and enhanced it by a novel memory-savingoption. This technique can reduce execution times by 25-40% on sufficiently’gappy’ datasets via omitting redundant computations. More importantly,the revised SEV technique can be deployed to achieve significant memorysavings that are almost proportional to the amount of missing data in thetest datasets. This technique has already been fully integrated into the stan-dard RAxML distribution and is also available in ExaML [90].

On the software engineering side, it is important to note that the re-computation and the SEV memory-saving techniques are orthogonal. Therecomputation technique reduces the number of APVs stored in memory.The SEV technique reduces the size of individual APVs. Therefore, bothtechniques can be applied simultaneously, and combined with potential fu-ture techniques such as the use of lossless compression algorithms for storingancestral probability vectors.

64

Chapter 4

The Backbone Algorithm

The content of this chapter has been derived from the followingpeer-reviewed publication:F. Izquierdo-Carrasco, S. Smith, and A. Stamatakis. Algorithms, datastructures, and numerics for likelihood-based phylogenetic inference ofhuge trees. BMC Bioinformatics, 12(1):470, 2011S. Smith generated the 38K and 56K datasets describedin Subsection 3.4.2

This chapter introduces an algorithm for space-constrained phylogenetic treesearches, which we denote as backbone algorithm. The notion behind thissearch algorithm is that, in large alignments, the sequences at the tips areclose enough to each other to be grouped correctly in the early phases ofthe tree search algorithm. Therefore, it should be possible to devise somesearch heuristics that spend more time evaluating topological moves on in-ner parts of the tree, which are more likely to yield likelihood increases.Furthermore, by exploring a smaller tree space, the convergence criteria, asdescribed in Section 2.6 should be met earlier. The backbone algorithm,which we describe in detail in this chapter, can reduce the time required fortree inferences by more than 50% while yielding equally ‘good‘ trees in thestatistical sense.

4.1 Constraining tree search to a backbone

tree

PhyNav (Phylogenetic Navigator [104]) first introduced the idea to reducethe dimension of the tree for the search phase. PhyNav reduces the number

65

of sequences in the MSA to a subset that contains the most relevant phylo-genetic information. The subtree based on the reduced alignment is faster tosearch on and optimize, and can be used as a scaffold to construct the full,comprehensive tree.

Similarly, we explored the idea of identifying closely related taxa andcollapsing them to a subtree root, which can be considered as a single virtualtip, also called super-taxon. The tree induced by all virtual tips and remainingoriginal (non-collapsed) tips is called backbone tree. This technique can alsopotentially reduce the memory footprint of the tree, because the number ofinternal nodes that are actually updated is reduced.

By clustering taxa into virtual tips, the dimension of the tree can bereduced allowing for a tree search on the backbone tree that is induced by thevirtual tips. Given a perfectly balanced tree, a reduction of 50% correspondsto collapsing each pair of taxa into a single virtual tip. Thus, for each pairof tips, there is one less inner node to operate on, and the total number ofinner nodes is halved. We henceforth denote such a reduction of the treedimension as reduction factor.

Once an appropriate backbone tree has been computed, topological movescan be applied, following the standard tree search strategy such as the onedescribed in Section 2.6. The difference with respect to the standard searchis that the topological moves are restricted to the backbone tree.

In other words, the virtual tips are interpreted as tips in the backbone treeon which we conduct the tree search. In our RAxML proof-of-concept imple-mentation [38], which deploys SPR moves, only subtrees that form part of thebackbone tree are pruned and will exclusively be re-inserted into branchesthat lie within the backbone.

Despite restricting the tree search to the backbone, in our setup, wealways compute the log likelihood score of the comprehensive tree during thebackbone tree search. The log likelihood score of the comprehensive tree canbe easily computed, because virtual tips are ancestral probability vectors thatsummarize the signal of the (excluded) real tips situated below the respectivevirtual tip.

As discussed in Section 3.1, the memory requirements for storing theancestral probability vector representing a virtual tip are significantly higherthan for storing a terminal taxon.

4.2 Algorithm

The main user parameter for the backbone tree algorithm is the desired treesize reduction factor R, where 0.0 < R < 1.0. This parameter controls how

66

much the backbone tree shall be reduced in size. Ideally, the backbone treewill then comprise n·R−2 ancestral nodes and n·R backbone tips. Backbonetips may either be virtual tips (ancestral nodes) or real tips. Evidently,choosing very low values of R may significantly impact the quality of theinference, especially if very short branches are present. According to ourexperiments (see Section 4.3), using R > 0.25 is a conservative minimumvalue for R.

The algorithm comprises the following computational steps:

Starting tree A reasonable starting tree is given by the user or generatedby a fast method, for instance, parsimony.

Tip Clustering We assign the n tips of the starting tree to n · R clusters,that is, c = ⌈n · R⌉, where c is the total number of clusters obtained.For each tip we store a cluster identifier that denotes to which clusterthe tip has been assigned.

Backbone delimitation Determine and mark the ancestral probability vec-tors that will become virtual tips in the backbone. We traverse the treeand use the cluster identifiers to label all ancestral nodes as residinginside, outside or on the boundary of the backbone.

Backbone Tree Search Conduct a standard tree search (see Section 2.6)restricting topological moves to the backbone tree.

To also achieve a memory footprint reduction, one could write a multiplesequence alignment for the backbone to file that will partially consist of nu-cleotide sequences and partially of ancestral probability vectors representingvirtual tips. This reduced alignment can then be parsed together with thebackbone tree for conducting a tree search. This approach, however, has notbeen implemented.

We next outline the four algorithmic steps in detail.

4.2.1 Starting tree

To build a backbone, we assume that a reasonable (i.e., non-random) fullyresolved comprehensive tree T comprising all taxa (e.g., obtained via par-simony using TNT [28], Parsimonator [85] or RAxML [87]) is computed orprovided as input. This comprehensive n-taxon tree has n tips and n − 2ancestral (inner) nodes.

Once the parsimony tree is available, the branch lengths and the statisticalmodel parameters are optimized under Maximum Likelihood.

67

4.2.2 Tip Clustering

We present an approach based on computing a distance matrix and applyingaverage-linkage hierarchical clustering [55].

Hierarchical clustering with k small distance matrices

In standard hierarchical clustering, the first step consists of calculating a dis-tance matrix that contains the pair-wise distances between all items (tips) tobe clustered. However, given a comprehensive tree T with ML estimates ofbranch lengths, we can directly obtain this distance matrix from the tree bycalculating the pair-wise patristic distances. The patristic distance betweentwo taxa is the sum of branch lengths on the path in the tree connecting thetwo taxa. Thus, the distance matrix is symmetric. The space requirementsfor storing such a patristic distance matrix are in O(n2) which can becomeprohibitive for large alignments with n ≥ 30, 000 tips. We observe that, thepair-wise patristic distances between most tips will be very large and hencethese tips will be assigned to different clusters anyway. Therefore, to savememory, one can decompose this process into computing several smaller, par-tial distance matrices, since the comprehensive starting tree already inducesa hierarchical clustering structure. If we subdivide the problem into comput-ing k partial pair-wise distance matrices, and each partial matrix i definesci clusters, we need to ensure that c =

�k

i=0 ci = ⌈n · R⌉, so that the totalnumber of desired clusters still corresponds to the specified reduction factorR. To achieve this, we do not fix the number of partial matrices k a priori.Instead, we define a threshold value m that represents an upper bound forthe number of tips contained in each partial matrix. Let n be the total num-ber of taxa, ni the number of tips in a partial matrix, where ni ≤ m andn =

�k

i=0 ni. From each partial matrix, we extract an amount of clustersproportional to its size, that is, ci ∝ c× ni

n.

This is implemented as follows: First, we find a set of subtrees such that(i) each subtree has as many tips as possible and at most m tips and (ii)each tip is included in exactly one subtree, that is, all tree tips are includedin one subtree and no tip forms part of more than one subtree.

For each such subtree i, we then build a (partial) patristic distance matrixfor all ni subtree tips. Thereafter, we cluster them, by generating a hierar-chical cluster tree. This hierarchical tree may be cut at different levels togenerate a varying number of subtree tip groups. We choose to cut the treesuch that it generates ci clusters of subtree tips. If required, the number ofdesired clusters ci will have been iteratively adjusted beforehand (for furtherdetails see below) for each partial matrix i to ensure that c =

�k

i=0 ci.

68

For example, consider a 40, 000-taxon tree, a reduction factor of 0.5 (cor-responding to 20, 000 clusters), and a partial matrix threshold of 32, 000 taxa.In this example, we may obtain distance matrices of 10, 000 and 30, 000 taxarespectively. Then we will need to extract 15, 000 clusters from the 30, 000taxon distance matrix and 5, 000 clusters from the 10, 000 taxon distancematrix.

To be able to apply this method and compute partial patristic distancematrices, we need to devise an algorithm that selects subtrees from the com-prehensive phylogeny such that they contain at most m taxa. We start byselecting the innermost node of the tree, as described below.

Computation of the innermost node

Each inner node i of an unrooted binary tree T is a trifurcation that definesthree subtrees Ti,a, Ti,b and Ti,c. We define the subtree length stl(Ti) as thesum of all branch lengths in subtree Ti. Thus, stl(Ti,a)+stl(Ti,b)+stl(Ti,c) =stl(T ) holds for any inner node i, where T is the comprehensive tree.

In our current default implementation, we select the innermost node jthat maximizes stl(T )−max{stl(Tj,a), stl(Tj,b), stl(Tj,c)}. An alternative cri-terion for selecting the innermost node is to determine the node that min-imizes the variance of the three outgoing subtree lengths. Other possiblecriteria, that are not based on subtree length may be defined, for instance,as finding the node that minimizes the variance of the node-to-tip distanceor finding the node with the highest minimum node-to-tip distance. Thenode-to-tip distance is defined as the sum of branch lengths on the path inthe tree leading from an ancestral node to a tip.

The tree diameter is defined as the number of nodes on the longest pathbetween any pair of tips. We define the node distance between two nodesas the number of nodes on the path that connects the nodes. The normal-ized node distance is defined as the raw node distance between alternativeinnermost nodes, divided by the tree diameter. We conducted an empiricalassessment (based on our collection of large real-world trees) to compare thenode distances between the innermost node generated by our criterion andthe innermost node of these alternative approaches. Table 4.1 shows the re-sults for a real-world dataset comprising 55,593 taxa (see Subsection 3.4.2).The respective innermost nodes (as identified by the alternative criteria) areeither identical or close neighbors, that is, located in the same region of thetree.

69

NormalizedAlternative criterion Node distance node distance

Lowest subtree length variance 0.00 0.00

Lowest node-to-tip distance variance 2.50 0.01

Maximal minimum node-to-tip distance 11.90 0.06

Table 4.1: Node-distances from the default criterion (55,593 taxa, averaged across10 trees). Several criteria can be employed to select the innermost node of anunrooted tree. The alternative innermost nodes are located close to each otherwith respect to the tree diameter (186 nodes).

Tip clusters are delimited from k subtree roots

Once we have determined the innermost node, we conduct a depth-first treetraversal starting at this node and descending into each of the three subtrees.The depth-first traversal terminates, when a subtree root is encountered thatcomprises ≤ m tips. All subtree roots that contain ≤ m tips are stored ina list for further processing. Thus, when the depth-first traversal has beencompleted, this list of k subtree roots can be used to generate the k partialpatristic distance matrices of maximum size O(m2). In our implementation,we set m := 1024. This is a suitable value, since the time required forprocessing partial distance matrices of such size remains in the order of sec-onds, which is negligible in comparison with total runtime of the tree searchalgorithm.

For each subtree root (i.e., each partial patristic distance matrix), wedetermine how many clusters should approximately be extracted, via ci :=⌊12

+ c · ni

n⌋, where i is the subtree number, ni is the number of tips in

the respective subtree, c = n · R is the total number of desired clusters,and ci is the number of clusters for subtree i. In general, c �=

�k

i=0 ci.

The overhead (or deficit) of clusters, that is given by Δc = c −�k

i=0 ci, isthen proportionally distributed across all remaining partial matrices. Thisprocess is repeated iteratively until no overhead (or deficit) remains. In eachiteration, we reassign ci := ⌈ci + Δc · ci

c⌉ until c =

�k

i=0 ci for every i.Then, for each subtree i = 1...k we proceed as follows:

1. For all tips in subtree i, calculate the patristic distances to all othertips in this subtree and save them in the respective distance matrix.

2. Apply pairwise average clustering to generate a hierarchical tree of joinsfrom the distance matrix.

3. Cut the tree, such that exactly ci clusters are generated.

70

4. Add those clusters to a global list of clusters. Maintain a list that keepstrack to which cluster a tip belongs.

4.2.3 Backbone construction

When all subtrees have been processed, we have a list of c clusters. Notethat, each cluster contains x tips, where 1 ≤ x ≤ m and that each tip isassigned to exactly one cluster. The step to build the backbone from theclusters is not trivial. We use labels (inside, boundary and outside) toidentify which nodes belong to the backbone and which ones do not.

The backbone tree is defined by nodes marked as inside and boundary.Once the clusters have been computed, we build the backbone as follows:Initially, we label each inner node in the tree as inside, tip nodes whichbelong to clusters of size one as boundary, and all remaining terminal nodesas outside. In addition, we maintain a list for storing the cluster identifiersof ancestral nodes that will not form part of the backbone.

Once this is done, we update/adapt the backbone assignment for ancestralnodes: The nodes of the comprehensive tree that represent the k subtree rootswill remain inside the backbone. On each of the k subtree roots, we initiatea post-order traversal to relabel the ancestral nodes, if required, accordingto the following rule set:

• If the two child nodes are labeled as inside or boundary, the ancestralnode remains labeled as inside.

• If one child is labeled as inside or boundary and the other child asoutside, the ancestral node is relabeled as inside and the outside

child node is relabeled as boundary.

• If both children are labeled as outside, we need to check to whichcluster they belong. If they belong to the same cluster, the parentnode is labeled as outside and the shared cluster identifier of the childnodes is propagated to the parent node. If the two children do notbelong to the same cluster, the parent node is labeled as inside andboth children are relabeled as boundary.

When the post-order traversal is about to be completed, we arrive atthe subtree root i again, which was originally labelled as inside. At thispoint, we check whether the adjacent backbone node of the subtree rooti has been labeled as outside. Whenever this is the case (see Figure 4.1for an example), the adjacent backbone node is relabeled as boundary forconsistency.

71

O(2)

I I

O(2)O(2)

O(2)

O(2)

O(2) B(2)

B(3)

B(1)

I

O(2)

I I

(2)(2)

(2)

(2)

(3)

(1)

I

I

II

Figure 4.1: Consistency of labels at the backbone boundaries. At first (left) aninitial backbone exists (thick branches), all inner nodes are labelled as inside (I)and each tip node has a cluster id. After completion of the post-order traversal(right), each inner node has been relabelled accordingly, if required. Here, cluster 2is monophyletic, hence the cluster id was inherited propagated back to the initialbackbone node. This produced a branch (edge) with an inside and an outsidenode; therefore the outside(O) node is relabelled (arrow) as boundary(B) node.

Reduction Factor R n := 37831 (expected) n := 55593 (expected)0.25 12668.0 (9457.75) 19366.7 (13898.25)0.50 22340.0 (18915.5) 33501.5 (27796.5)

Table 4.2: Average number of computed backbone tips over 10 distinct trees. Theaverage number of backbone tips is higher than the expected number n ·R

Given a set of tips that form part of the same cluster, it may occur thatthese tips also form a monophyletic group. In this case, during the post-ordertraversal, all ancestral nodes will be grouped together under the same clusteridentifier and the common ancestral node will become a backbone boundary

(virtual tip). However, if the tips in a cluster are not monophyletic (seefor instance, in Figure 4.2), the application of the above rules requires someadditional boundary relabelling.

Based on the prolegomena, a single cluster may thus induce more thana single virtual tip. As a consequence, the number of virtual tips may ac-tually be higher than the number of clusters. In turn, the reduction of treesize that can be achieved will be smaller than specified by R. The impactand frequency of occurrence of this phenomenon (non-monophyletic clusters)depends on the shape of the tree and the branch lengths. In Table 4.2, weoutline this effect for trees with 37,831 and 55,593 taxa. We computed theaverage number of virtual tips generated by our algorithm on 10 distincttrees per dataset and reduction factors of 0.25 and 0.5 respectively.

72

I I

O(2)

O(2)

B(1)

O(2) B(2)

B(3)

B(4)

I

I I

(2)(2)

(2)

(1)

(3)

(4)

I

I

II

I

I

B(2)

O(2)

Figure 4.2: At first (left) an initial backbone exists (thick branches), all inner nodesare labelled as inside (I) and each tip node has a cluster id. Upon completion ofthe post-order traversal (right), each inner node has been relabelled accordingly.Here, cluster 2 is not monophyletic. Hence, an additional virtual tip is created,that is, cluster 2 generates 2 boundary tips.

4.2.4 Backbone-constrained Tree Search

Once the Backbone has been built, a standard phylogenetic search algorithmcan be applied. We have implemented the above backbone algorithm ina dedicated proof-of-concept RAxML version. This implementation is avail-able for download at http://www.exelixis-lab.org/web/personal_page/izquierdo/backbone.tar.gz

Initially, RAxML will generate a comprehensive randomized stepwise ad-dition order parsimony tree, or read in a user specified tree via -t. Thenit will optimize ML model parameters—including branch lengths—on thecomprehensive tree. Thereafter, it will execute the backbone algorithm asdescribed above. The tree searches on the backbone are based on the stan-dard RAxML hill-climbing algorithm. The standard algorithm implementslazy SPRs steps, because the likelihood scores obtained are only approximate.Like in standard SPRs (see Section 2.6), a subtree is pruned and re-graftedon several neighbouring branches. However, after re-grafting, only the threebranch lengths around the insertion branch are re-optimized, which inducessignificant time savings in comparison with re-optimizing all branch lengthsof the tree. These lazy SPRs are used as a fast pre-scoring mechanism to findgood topologies which are later optimized more thoroughly [95] in a subse-quent evaluation step. In our algorithm, lazy SPR moves are only conductedwithin the backbone.

After each complete cycle of SPR moves (see [87] for details), the back-bone tree will be re-computed based on the currently best tree. Also, thebranch lengths of the entire tree (including those branches not forming part

73

of the backbone) will be re-optimized once after each SPR cycle.

4.3 Evaluation and Results

4.3.1 Performance

To test the backbone algorithm we executed the dedicated RAxML version.with the experimental -L command line option. This option initially builds abackbone tree and then deploys the CAT approximation of rate heterogene-ity [86] with the standard RAxML hill-climbing search algorithm [87, 93] toapply lazy SPR moves (see [87]) within the backbone only. We used treesize reduction factors of 0.25 and 0.5. As starting trees, we used random-ized stepwise addition order parsimony starting trees generated with RAxMLv727 (-y option). For each dataset, we inferred 10 ML trees for each of the10 parsimony starting trees. RAxML was executed using the Pthreads-basedparallel version [97] with 16 threads on unloaded Quad-Core AMD Opteronnodes with 16 cores and 128GB RAM each.

We computed average runtimes over 10 runs for the 38K and 56K datasets(see Subsection 3.4.2) respectively. For each backbone tree, we also computedthe theoretical minimum number of bytes (denoted asMemory for Backbone)required to store the ancestral probability vectors at the virtual tips and theinner nodes which dominate memory requirements. If the branch length opti-mization process, unlike in our current implementation, is limited to optimiz-ing branches within the backbone, this theoretical minimum value representsa good estimate of the memory footprint for a backbone tree search. Wealso computed the respective memory requirements for the comprehensivetree (denoted as Memory for Full tree), which reflect the ’standard’ memoryrequirements when no reduction factor is applied.

These values (see Table 4.3 and Table 4.4) provide a notion of the po-tential memory savings that can be achieved by the backbone approach.In Table 4.3 and Table 4.4 we also provide the respective execution timesand average log likelihood scores obtained by using the backbone algorithm(R := 0.25, R := 0.5) and a comprehensive search on the full tree (R := 1.0).Those values have been averaged over 10 runs (10 starting trees). While exe-cution times can be reduced by the backbone approach, log likelihood scoresobtained by conducting searches on a backbone are slightly worse than thoseobtained by searching on the full tree.

In Figure 4.3 and Figure 4.4 we show that the choice of the random num-ber seed (-p option in RAxML), that determines the shape of the startingtrees, has a significant impact on the final log likelihood score (computed

74

-5.534e+06

-5.533e+06

-5.532e+06

-5.531e+06

-5.53e+06

-5.529e+06

-5.528e+06

-5.527e+06

100 200 300 400 500 600 700 800 900 1000

Tree

Log

Lik

elih

ood

Seed number

log LH for different R and starting seeds (37831 species)

R = 0.5R = 0.25Standard

Figure 4.3: Log Likelihood scores for different Reduction factors (38k dataset).Plot of log likelihood scores under GTR+Γ of the final trees obtained by eachmethod as a function of the starting tree (random number seed) for the 38Kdataset. Each LH score (point) results from an independent search. The lineslinking the points are only guiding the eye.

75

-7.068e+06

-7.067e+06

-7.066e+06

-7.065e+06

-7.064e+06

-7.063e+06

-7.062e+06

-7.061e+06

-7.06e+06

-7.059e+06

100 200 300 400 500 600 700 800 900 1000

Tree

Log

Lik

elih

ood

Seed number

log LH for different R and starting seeds (55593 species)

R = 0.5R = 0.25Standard

Figure 4.4: Log Likelihood scores for different Reduction factors (56k dataset).Plot of log likelihood scores under GTR+Γ of the final trees obtained by eachmethod as a function of the starting tree (random number seed) for the 56Kdataset. Each LH score (point) results from an independent search. The lineslinking the points are only guiding the eye.

76

R=0.25 R=0.5 R=1Runtime (h) 30.41 38.60 54.03

Memory for Backbone (GB) 4.90 7.70 N/AMemory for Full tree (GB) 10.33 10.33 10.33

LogLikelihood (Avg) -5531436 -5530051 -5529406LogLikelihood (Std Dev) 943.26 770.47 1307.16

Avg (logLH - logLH(R=1)) 2030.24 645.31 0.0

Table 4.3: Average runtimes, memory requirements, and log likelihood scores (over10 runs) for the 38K dataset.

R=0.25 R=0.5 R=1Runtime (h) 50.17 63.22 85.89

Memory for Backbone (GB) 8.22 12.72 N/AMemory for Full tree (GB) 16.82 16.82 16.82

LogLikelihood -7063342 -7061516 -7060488LogLikelihood (Std Dev) 1727.90 1761.27 1718.47

Avg (logLH - logLH(R=1)) 2853.41 1028.04 0.0

Table 4.4: Average runtimes, memory requirements, and log likelihood scores (over10 runs) for the 56K dataset.

under GTR+Γ), irrespective of the search strategy that is used. On aver-age, searches on the full tree yield better likelihood scores than searches onbackbone trees. However, the variance of the likelihood score as a function ofthe starting tree (parsimony random number seed) is analogous to the scorevariance between full and backbone tree searches. For example, on the 38Kdataset, the log likelihood scores on 10 final trees obtained for full searchesshow a standard deviation of 1307 log likelihood units. The average differ-ence in log likelihood scores per starting tree between the full search and abackbone search with R := 0.50 is only 645 log likelihood units and 2030 loglikelihood units for backbone searches with R := 0.25, respectively.

Given the runtime savings that can be achieved by the backbone ap-proach, backbone tree searches can be used, for instance, to explore a largernumber of parsimony starting trees which substantially influence the finallog likelihood scores. A reasonable strategy for finding best-known ML treesmay consist in starting many fast searches with a relatively aggressive settingof R := 0.25 to identify/determine a set of ’good’ starting trees that yieldthe best final log likelihood scores. In a second step, full tree searches canbe conducted on those promising starting trees to find trees with even betterscores.

77

R := 0.25 R := 0.5 R := 1R := 0.25 182.6 169.9 188.0R := 0.5 169.9 152.8 146.2R := 1 188.0 146.2 133.0

True Tree 398.8 382.0 388.0

Table 4.5: Average symmetric differences (over 5 runs) for the 1500 simulateddataset.

4.3.2 Simulated Datasets (Accuracy)

We used simulated datasets in order to better understand the impact ofthe backbone algorithm on topological accuracy. We ran INDELible [26] togenerate simulated MSAs of 1500 taxa (575 bp) and 5000 taxa (1074 bp). Wecompared the RF distance [69] (number of bipartitions that differ betweentwo topologies) between the true tree and the topologies from the standardfull search and the backbone-based ones. For each dataset, the full search andthe backbone search with R := 0.25 and R := 0.5 were ran five times withdifferent starting trees. Table 4.5 shows the average symmetric differencesamong all approaches for the dataset with 1500 taxa. We see that, in termsof topological accuracy, applying reductions of R := 0.25 and R := 0.5yield topologies that are close to the standard full search. Furthermore, thedistance to the true tree is not increased by the reduction.

4.4 Summary

In this chapter we have proposed an algorithm for reducing the tree size forphylogenetic inference under likelihood-based methods on trees with severaltens of thousands of taxa. We have explored different backbone construc-tion techniques and described the method that worked best with respect tofinal log likelihood scores. Such backbone-based techniques can help to re-duce memory footprints and execution times. However, in almost all casesthey yield final trees with worse likelihoods compared to comprehensive treesearches on a full, unreduced tree. We find that likelihood scores of final treesheavily depend on the respective starting trees and conclude that backboneapproaches can be deployed for identifying ’good’ starting trees, that canthen be further refined using a comprehensive tree search.

78

Chapter 5

Introduction to GPUProgramming

In this chapter, we briefly introduce the main programming models for GPUcomputing, where parallel compute-intensive calculations are offloaded to theGPU, while the rest of the code is run on the CPU.

5.1 Overview

GPUs (Graphics Processing Units) were originally designed as a dedicatedhardware architecture for accelerating graphics rendering. GPGPU (General-Purpose computation on GPUs) [62], or GPU computing, refers to the use ofGPUs to accelerate scientific applications, a concept that dates back to theearly 2000s [102].

Modern CPUs consist of a few cores optimized for serial processing, whileGPUs contain thousands of small efficient cores designed for parallel floating-point operations because of the computational requirements of graphics ren-dering.

The main programming models/interfaces for GPUs are CUDA (ComputeUnified Device Architecture) and OpenCL. CUDA is a parallel computing(GPUs) platform introduced by NVIDIA in 2006. It includes a SDK andAPI that, using the C language, gives access to the virtual instruction setand memory of the parallel computational elements in CUDA GPUs.

OpenCL (Open Computing Language) is as a more generic framework. Itwas initially introduced in 2009 and is currently maintained by the Khronosgroup [14]. OpenCL programs can be executed across heterogeneous plat-forms, such as central processing units (CPUs), graphics processing units(GPUs), and digital signal processors (DSPs). OpenCL includes a language

79

(based on C99) for writing kernels (functions that execute on OpenCL de-vices), plus application programming interfaces (APIs) that are used to defineand then control the platforms.

The high level APIs of CUDA and OpenCL allow the application pro-grammer to create C programs that can execute specified functions (Kernels).These Kernels can be run on the GPU’s streaming processors.

In both platforms, the master system that steers the computations isdenoted as host, which is usually a CPU. In addition, one or several devicesare available. A device is a massively parallel processor with a large numberof arithmetic and floating-point processing units. In terms of memory, bothapproaches assume a similar memory hierarchy, although the terminologydiffers.

5.2 CUDA

In this Section, we introduce the main terminology and concepts of the CUDAframework [61].

5.2.1 CUDA Hardware and Architecture

From a hardware architecture perspective, NVIDIA GPUs consist of scal-able arrays of multi-threaded Streaming Multiprocessors (SMs). A group ofthreads running on the same processor core is called a thread block. The num-ber of threads in a thread block is limited by the hardware. On current GPUs,a thread block can contain up to 1024 threads [61]. Furthermore, threadblocks can be scheduled in any order because they are required to execute in-dependently. During a CUDA program execution, as thread blocks completetheir execution, new blocks are launched in the vacant streaming multipro-cessors. Finally, blocks are organized in one, two or three-dimensional gridsof threads.

We now briefly introduce the NVIDIA Fermi architecture [11]. The Fermiarchitecture (see Figure 5.1) includes up to 512 CUDA cores, which are dis-tributed across 16 streaming multiprocessors, each with 32 CUDA cores.

Each streaming multiprocessor (see Figure 5.2) has 16 load/store units,four SFUs (special function units), and 64KB of on-chip memory which isused as shared memory and L1 cache. A streaming multiprocessor managesand executes threads in groups of 32 parallel threads (warp). A warp executesone common instruction at a time.

A coherent L2 cache of 768KB is shared across all multiprocesors in theGPU. The host interface connects the GPU to the CPU via a PCI Express

80

Figure 5.1: The Fermi Architecture (left) contains 16 streaming multiprocessors.Source [11].

Figure 5.2: Architecture of a Fermi streaming multiprocessor. Source [59].

81

bus.The SIMT (Single Instruction, Multiple Threads) and SIMD (Single In-

struction, Multiple Data) architectures are closely related, since a single in-struction is executed on multiple processing elements on different data ora data vector. In SIMT architectures (GPUs), the programmer can writethread-level parallel code for independent threads, as well as data-parallelcode for coordinated threads. However, with respect to performance, it isimportant to avoid divergence among threads belonging to the same wrap,because a thread can either execute the same instruction as the other threadsin the wrap, or idle. Therefore, for maximum efficiency, all threads of a wrapshould share the same execution path.

In 2012 NVIDIA released a newer architecture called Kepler [60].In thecourse of this thesis, however, we only made use of the Fermi architecture.

5.2.2 CUDA Programming Model

In the CUDA programming model, the host is a CPU running the main(serial) C program. The host has its own RAM memory. The CUDA threadsare executed on a separate device (GPU).

A Kernel is a CUDA C (an extension of C) function, which is executedin parallel N times by N different CUDA threads.

Barriers are used to synchronize thread blocks and coordinate memoryaccesses to global memory. Figure 5.3 shows this memory hierarchy in detail:Each thread has a private local memory. Each thread block has shared mem-ory, that is visible only to threads within the same block (memory near thecorresponding processor cores). All threads (and thread blocks) have accessto global memory.

The CUDA programming model assumes that the device has its ownDRAM memory (device memory). Therefore, the programmer is responsiblefor managing the memory units that are visible to kernels through calls tothe CUDA runtime. This includes memory allocation and deallocation onthe device, as well as data transfer between host and device memory.

5.2.3 Performance Considerations

Latencies should be hidden The latency is the number of clock cyclesthat a warp is waiting before executing its next instruction. This can hap-pen when the input operands of the next instruction are not available yet.Another reason can be that the wrap is waiting due to a memory fence or asynchronization point. Each GPU multiprocessor can in principle hide theselatencies and maximize the utilization of its functional units. If all wrap

82

Figure 5.3: CUDA memory hierarchy. Source [61]

83

Figure 5.4: This access pattern results in a single 128-byte transaction, indicatedby the red rectangle. Source [61]

schedulers always have at least one instruction to issue for one wrap at everyclock cycle, the latency can be fully hidden. In other words, ideally a wrapwill be issued at every cycle, and it is therefore desirable to have a highnumber of resident wraps, so that these can be alternatively scheduled.

Data transfer between Host and Device The amount of data transferbetween host and device should be whenever possible minimized due to thelow bandwidth. A typical approach is to sacrifice parallelism in the kernelsby moving more code from the host to the device. This involves addingkernel code and data structures to compute and store intermediate resultson the GPU DRAM that never need to communicate with the host. Transferoverheads can be minimized by batching many small transfers into singlelarger transfers.

Accesses to global memory Global memory resides in device memory,which is accessed via 32-, 64-, or 128-byte memory transactions. Segmentsof device memory (32-, 64- or 128-bytes) are aligned if their first addressis a multiple of their size. If threads within a warp access such neighbor-ing aligned memory elements, only one memory transaction is needed. Inother words, data memory layouts should be designed by the programmerso that the warps read or write contiguous memory elements, as depictedin Figure 5.4.

5.3 OpenCL

The programming model of OpenCL is analogous to the one presented forCUDA. We briefly present it to clarify differences in terminology.

OpenCL (Open Computing Language) is an open standard for parallelprogramming of heterogeneous systems. It provides a language (a subset ofISO C99) for software developers to write portable code on SIMT (SingleInstruction, Multiple Threads) architectures.

84

The OpenCL Execution Model consists of an application running on aHost (CPU), which offloads work to one or more Compute Devices (for in-stance GPUs). Each compute device is composed of one or more ComputeUnits. In CUDA, these are called Streaming Multiprocessors (SM).

A Kernel represents the code for a work-item (thread). Work-items arethe basic units of work. Work-items are grouped into local work-groupsequivalent to CUDA thread blocks. OpenCL applications can access varioustypes of memory: Host memory (on the host CPU), global (visible to allwork-groups, e.g., DRAM on the GPU board), local (shared within a work-group, called shared in CUDA), and private (registers per work-item, calledlocal in CUDA).

5.3.1 OpenCL performance portability

OpenCL provides developers portability by enabling the usage of and deploy-ment on diverse processing platforms. In particular, performance differenceswith CUDA are rather small on GPUs (CUDA performs at most 30% better),and tend to disappear under fair comparisons [22].

The OpenCL standard also guarantees functional compliance with otherdevices such as CPUs, DSPs, and other hardware platforms, that is, OpenCLportable code will run correctly and generate the same results. Thus, themain advantage is that a single implementation can be executed on differentplatforms. However, performance portability is not guaranteed. In orderto obtain maximum performance, OpenCL code still requires architecture-specific implementations.

Recently, Zhang, Sinclair and Chien [114] studied the performance porta-bility of OpenCL across diverse architectures (NVIDIA GPU, Intel Ivy BridgeCPU, and AMD Fusion APU) with typical benchmarks, such as SpMV(Sparse Matrix Vector multiply) and FFT (Fast Fourier Transform). Theresults showed poor performance for single-source OpenCL programs (on av-erage 15% performance in comparison with state-of-the art implementations).However, with architecture-oriented the performance was improved to 67%of the performance on the Ivy Bridge CPU [114]. In general, there is a signif-icant gap between single-source OpenCL programs and architecture-orientedtuned OpenCL programs.

85

Chapter 6

GPU implementation ofPhylogenetic Kernels

The content of this chapter has been derived from the following peer-reviewed publication:F. Izquierdo-Carrasco, N. Alachiotis, S. Berger, T. Flouri, S. P. Pissis,and A. Stamatakis. A generic vectorization scheme and a gpu kernel forthe phylogenetic likelihood library. In Parallel and Distributed ProcessingWorkshops and Phd Forum (IPDPSW), 2013 IEEE International Sym-posium on, 2013Simon Berger designed and implemented the generic Vectorizationscheme described in Section 6.2.

In this chapter, we describe in detail a GPU implementation for the computa-tion of the main phylogenetic functions as a proof-of-concept implementationfor the Phylogenetic Likelihood Library (PLL, introduced in Section 2.7).These functions involve the computation of the likelihood score and theNewton-Raphson method for branch length optimization. The proof-of-concept implementation works for DNA data and the Γ model of rate het-erogeneity.

We also introduce a GPU-specific memory organization scheme that re-duces data transfer between the GPU and the CPU to an absolute minimum,thereby improving performance. The memory layout of the ancestral proba-bility vectors (APVs) stored in the GPU is an adapted version of a genericvectorization scheme.

This generic vectorization scheme for the phylogenetic function (PLF)allows to transparently deploy vector units of arbitrary length for PLF com-putations. These vector instruction can be x86 intrinsics (128-bit wide SSE3

86

instructions and 256-bit wide AVX instructions) as well as SIMT instructionson GPUs. A generic vectorization scheme is important to ensure portabilityof the code to increasing vector lengths (e.g., the 512-bit wide vector unitson the recent Xeon Phi processor).

According to our experiments, our GPU implementation of the PLF (Phy-logenetic Function) is approximately twice as fast as the highly tuned x86version of the PLF that relies on manually inserted AVX vector intrinsics.

The remainder of this chapter is organized as follows: In Section 6.1 wesurvey related work on PLF libraries and GPU implementations. Thereafter,in Section 6.2, a generic vectorization scheme for PLF computations is in-troduced. In the subsequent Section 6.3 we cover technical details of theGPU implementation. Thereafter, we describe the experimental setup andthe results obtained (Section 6.4) and conclude in Section 6.5.

6.1 Related work

Early work on porting the RAxML likelihood functions, which comprise thecore of the PLL, to GPUs in the pre-CUDA and pre-OpenCL era was re-ported in [12]. Exploiting fine-grain parallelism with GPUs for the PLF haspreviously been addressed in [65] and [115] for MrBayes [73]. However, theseimplementations represent case studies or only cover a small subset (for spe-cific data types such as DNA data) of the PLF in MrBayes. Hence, theseinitial efforts do not represent production-level implementations, but ratherproof-of-concept studies.

The BEAGLE [4] (general purpose library for evaluating the likelihoodof sequence evolution on trees) library introduced an application program-ming interface (API) for PLF computations and also offers efficient imple-mentations thereof. BEAGLE can exploit modern hardware using SSE3 in-trinsics, multi-threading, and GPUs. It has been integrated into Bayesianprograms (BEAST [99] and MrBayes [73]) and Maximum Likelihood pro-grams (GARLI [116]). The BEAGLE paper [4] reports performance resultsfor DNA and Codon data on two 15-taxon datasets. The test datasets con-tained 8558 unique nucleotide (DNA) site patterns and 6080 unique codonsite patterns, respectively. For each of the three programs integrated withBEAGLE, the authors measured the speedup of the BEAGLE CPU, SSE3,and GPU (under single and double precision) implementations with respectto the corresponding native implementations. The largest speedups were ob-tained for the GPU implementation. For GARLI, only GPU speedups werereported (factor 3.8 for DNA data and 12 for codon data under double preci-sion). The BEAGLE-based version of MrBayes yielded a maximum speedup

87

of 16 (DNA data) and of 31 (Codon data) on the GPU using double pre-cision arithmetics. Note that the relative speedup for MrBayes comparingthe BEAGLE CUDA against the BEAGLE SSE3 performance was approx-imately 4.6 for DNA data. BEAST showed similar speedups for the GPUimplementation under double precision (14-fold for DNA data and 37-foldfor codon data). The speedups for single precision were larger. However, forlarge-scale real-world datasets (in particular with a high number of taxa),double precision arithmetics are typically required to guarantee numericalstability of the PLF [7].

The PPL library introduced in Section 2.7, which we use to develop ourGPU proof-of-concept implementation, offers additional features that BEA-GLE does not support. The PLL can also use AVX intrinsics. Further-more, it implements numerical optimization functions such as for instancethe Newton-Raphson method for branch length optimization. BEAGLE de-fers these tedious programming tasks to the application programmer. It onlyoffers functions for computing the first and second derivative of the like-lihood function that can then be used by the application programmer toimplement a Newton-Raphson branch length optimization procedure. More-over, BEAGLE does currently not allow for conducting partitioned analyseswhich, given that partitioned analyses (distinct sets of likelihood model pa-rameters are estimated for different parts of the multiple sequence alignment)are becoming increasingly common, represents a drawback of BEAGLE. Asa consequence, BEAGLE does also not implement techniques [98, 113] forimproving parallel load balance for partitioned analyses. Unlike the PLL,it does not offer a fine-grain MPI parallelization of the PLF and is hencelimited to stand-alone shared memory nodes. Finally, BEAGLE does notimplement the PSR (originally CAT) model of rate heterogeneity [86], whichcan yield substantial computational savings in terms of floating point op-erations and memory utilisation compared to the standard Γ model of rateheterogeneity [110].

6.2 Generic Vectorization

An important part of the PLF is the newview() function (see Section 2.7),which updates the ancestral probability vectors (APVs) at inner nodes of thetree in the course of a post-order traversal. The innermost loop of newview()calculates the sum over products between elements of the transition proba-

88

Node vSite 1 Site 2

Rate 0 Rate 1 Rate 0 Rate 1LALCLGLT LALCLGLT LALCLGLT LALCLGLT

Figure 6.1: Memory layout of an ancestral probability vector

Node vSite 1/2 interleaved

Rate 0 Rate 1L1,AL2,AL1,CL2,CL1,GL2,GL1,TL2,T L1,AL2,AL1,CL2,CL1,GL2,GL1,TL2,T

Figure 6.2: Memory layout of an ancestral probability vector with a vector widthof 2 (VW := 2)

bility matrix P and corresponding elements in the APV L.

L(p)A (i) =

�

T�

S=A

PAS(blp)L(l)S (i)

��

T�

S=A

PAS(brp)L(r)S (i)

�

(2.10)

We assume DNA data and the Γ model of rate heterogeneity. The entriesof these vectors are computed according to Equation 2.10 and stored follow-ing the layout shown in Figure 6.1, assuming DNA data with 4 states, 2 Γrates, and 4 alignment sites.

For each alignment site, the ancestral probability vector contains 2 rateblocks. Each rate block contains 4 probabilities (one per state and site s,denoted by Ls,A, Ls,C ,Ls,G, and Ls,T , for site s). Using this memory layout,the probability values of the states can be read efficiently from contiguousmemory locations to calculate the scalar products in Equation 2.10.

This memory layout is used directly in the initial ad hoc SSE3 and AVXversions of RAxML. The calculation of the scalar products in the innermostnewview() loop can be implemented by using element-wise multiply andhorizontal add operations. However, this approach is only efficient, if thenumber of states (e.g., 4) is equal to or larger than the width of the vectorunit. Since we use double precision floating point numbers this is the caseboth for SSE3 (vector width: 2), as well as AVX (vector width: 4) vectorunits. In contrast to this, modern GPUs have much wider vector units. Inaddition, the width of x86 vector units is also expected to increase (e.g., IntelXeon Phi). Hence, the initial ad hoc vectorization scheme can no longer beused for the PLL. Moreover, the manual vectorization for each model anddata type combination is error-prone and labor-intensive. Thus, we devised

89

a more generic vectorization scheme that is easier to port to new models andcan conveniently be adapted to vector units of arbitrary length.

In order to use the wider vector units on GPUs, we introduce a new andmore generic, vectorization scheme. Instead of exploiting parallelism withinthe innermost loop iteration of newview(), the new scheme now calculatesa part of the ancestral probability vectors simultaneously for multiple sites.We denote this approach as across-site vectorization. In principle, across-site vectorization is analogous to the sequential implementation of the PLF:The calculations of the scalar products in the innermost newview() loop arecarried out sequentially. The main difference is that the scalar products arenow being calculated for multiple sites (i.e., 2 or 4 sites for SSE3/AVX ormore than 64 sites on the GPU) in parallel. In the SSE3 and AVX imple-mentations, this parallelism is exploited by using vector intrinsics. A similarscheme has been previously used for the inter-sequence vectorization of thePaPaRa 2.0 [8] dynamic programming algorithm.

This simple vectorization scheme can only be used when the data (i.e., theAPVs) are stored using an appropriate memory layout. Such a layout needsto allow for reading the probability values of a specific state and rate (e.g.,Ls,A) that belong to neighboring alignment sites from contiguous memorylocations, that is, we require that Ls,A and Ls+1,A (for a given rate) occupycontiguous positions in memory. Since this is not possible using the standardmemory layout (see Figure 6.1), we introduce an appropriately adapted andflexible (regarding the vector unit width) memory layout (see Figure 6.2) .

To assess the efficiency of this more generic vectorization scheme, we im-plemented it on both CPUs (using SSE as well as AVX) and GPUs. Themajor change consists in an adapted memory layout for the APVs whichnow allows to efficiently exploit across-site parallelism on CPUs and GPUs.Note that, SSE and AVX instructions currently do not offer efficient opera-tions for loading data from non-contiguous sites (data locations) into vectorregisters (see Figure 6.1). Generally, GPUs offer greater flexibility with re-spect to loading values from non-contiguous memory locations (e.g., loadingthe values corresponding to state A and discrete Γ rate 0 of sites 0, 1, . . . , 32).However, overall GPU performance can be increased by accessing values fromcontiguous global memory locations, because read/write accesses can be coa-lesced and delays related to bank conflicts can be avoided. We have thereforegeneralized the ancestral probability vector memory layout to store corre-sponding values from different sites in contiguous memory. This allows foraccessing the data at contiguous memory locations for the vectorized versionof Equation 2.10. The memory layout is parameterized by the desired vectorunit width (VW ). For VW := 1, the memory layout corresponds exactlyto the original memory layout of RAxML (see Figure 6.1). The analogous

90

layout for VW := 2 is shown in Figure 6.2. As outlined in Figure 6.2, corre-sponding values from different alignment sites (e.g., state A, discrete Γ rate 0of sites 0 and 1) are located at contiguous memory locations. Therefore, theycan directly be loaded into an SSE register via a single load operation. Giventhis altered and adaptive memory layout, implementing a vectorized versionof Equation 2.10 for sites 0 and 1 becomes straight-forward. In Section 6.3we show how we adapt this scheme to arbitrary widths for GPUs.

However, there do exist some limitations. Equation 2.10 can only be vec-torized for sites that evolve according to the same model of evolution. Inother words, the sites need to share the same P matrices and the same αshape parameter that determines the form of the Γ curve. With partitioneddatasets, the parameters of the model of evolution will be optimized indepen-dently for each partition. Thus, the maximum vector width is limited by thenumber of sites that evolve according to the same model, which correspondsto the length of the given partition.

Moreover, it is also difficult to apply the above scheme to the, otherwiseefficient, PSR model of rate heterogeneity [86]. Instead of integrating thelikelihood over different rates, it assigns one rate category (out of typically25) to each alignment site. This means that, there are at least 25 differentP matrices and that Equation 2.10 can only be vectorized across sites thatevolve according to the same P matrix (rate category). Hence, devising ageneric vectorization scheme for the PSR model, is not straight-forward.

6.3 GPU Implementation

We now describe the design and implementation of GPU kernels for the keyfunctions of the PLL (newview(), evaluate(), and coreDerivative()).The model of rate heterogeneity is the Γ model. These functions are describedin detail in Section 2.7, and account for more than 95% of total executiontime in likelihood-based phylogenetic inference programs [10, 93]. For thesefunctions, the ancestral probability vectors are read as input and, in the caseof newview(), an additional APV is written as output.

These APV access patterns have two important implications in the designof the GPU implementation.

The first design criterion is associated with improved GPU thread perfor-mance when threads access contiguous memory locations in global memory.To maximize global memory throughput in the OpenCL model, it is essen-tial to optimize memory coalescence and minimize address scatter [1]. Thestandard layout for APVs (see Figure 6.1) is problematic, because contiguousmemory locations do not store likelihood entries belonging to the same state

91

and rate. We address this by using the ancestral probability vector memorylayout (see Figure 6.2) presented in Section 6.2 and adapting it to GPUs.

The second challenge is that severe performance penalties are induced byfrequently transferring large amounts of data between the CPU and GPU.Because the APVs dominate the memory requirements in the PLF com-putations (see discussion in Section 3.1), we devise an appropriate memoryorganization strategy. Under this specific scheme, the ancestral probabilityvectors are stored and updated exclusively on the GPU. The host programon the CPU simply orchestrates the tree search and invokes PLF computa-tions on the ancestral probability vectors that reside on the GPU. We havedeveloped an OpenCL kernel that implements this strategy. At each kernelcall, the host program only needs to pass the memory addresses, that is, thestarting positions of an ancestral probability vector (corresponding to a nodeof the tree) in GPU memory to the GPU. Apart from that, the CPU onlyneeds to communicate the substantially smaller P matrices and one addi-tional variable to the GPU. The variable indicates which PLF function (e.g.,newview(), evaluate(), etc.) shall be executed.

Each kernel call returns at most one or two floating-point values to theCPU. In particular, calls to evaluate() return the overal log likelihood,and calls to coreDerivative() return the first and second derivatives ofthe likelihood function. When newview() is invoked, no values are returnedbecause this function simply updates the ancestral probability vectors thatreside in GPU memory.

6.3.1 Kernel Implementation

Update of ancestral probability vectors The CPU version of newview()(see Figure 6.4) includes three distinct cases: tip-tip (both children are leaves),inner-tip (one child is a leave and the other is an inner node), and inner-inner(both children are inner nodes). For details, see Section 2.8.

In the GPU version, these three cases have been reduced to one genericcase (inner inner). This also induces a change in the layout of the vectorsat the tips, which are stored in the form of inner ancestral probability vec-tors (tip-APV ), rather than as a look-up table that is indexed by the rawalignment sequence data (for details see [89]).

While this doubles the memory requirements for storing ancestral prob-ability vectors, it simplifies the storage of vectors in GPU memory, as wellas the OpenCL code implementation (all cases are executed with the samekernel, avoiding conditional statements that would deteriorate performance).Hence, we allocate space for storing 2n − 2 ancestral probability vectors onthe GPU, where n is the number of taxa in the multiple sequence input align-

92

ment. Finally, the newview() function also implements a numerical scalingprocedure to avoid numerical underflow in likelihood computations (for adetailed description, see [89]).

The GPU execution of the traversal descriptor described in Section 2.8 isdepicted in Figure 6.4. Each entry in the descriptor represents an inner-inneroperation, where three APVs are involved. The input APVs r and q are thechild nodes. The ouput APV p is the parent of r and q. The parent/childrenrelationship refers to the direction of the virtual root. The APVs r and qcan be tip-APVs. The APV p always corresponds to an inner node in thetree. The branch length connecting the corresponding nodes of APVs p andq is given by bqp. The branch length connecting p and r is given by brp. Thetraversal descriptor is processed sequentially as follows: for each entry in thedescriptor, the host (CPU) computes the matrices P (brp) and P (bqp). Thehost then transfers the APV identifiers of q, r and p (the APVs are storedin GPU global memory), and the P matrices to the device (GPU). The hostthen launches the newview() kernel. For each work-group, the correspondingP matrix is copied to local memory. Thus, threads belonging to the samework-group are executed in the same streaming multiprocessor, read andwrite contiguous memory positions (APV entries) from global memory, andhave fast access to the P entries required to compute the new APV values.The kernel also performs additional operations related to numerical scaling.The host does not need to read back anything from the device.

Evaluation of the Likelihood The GPU implementation of evaluate()is analogous to the newview() kernel. Because of the changed ancestralprobability vector representation at the tips, we simply omit the case wherethe left or right node of the branch at which the likelihood is calculated is atip. The host must read back a double precision floating-point number (thelog likelihood value).

Branch length optimization We observed that, during branch lengthoptimization on the GPU, the overhead for invoking the pre-computationkernel (sumGAMMA()) and storing the results is larger than re-computing theproduct of the entries prior to each invocation of coreDerivative(). Hence,we merged sumGAMMA() and coreDerivative() into a single kernel.

Re-loading tip information in the GPU Apart from the 2n − 2 fullancestral probability vectors, we also store the raw alignment sequence dataas well as the so-called tipVector data structure on the GPU. The reasonfor this is that we need to re-calculate the ancestral probability vectors at the

93

tips each time we change the values in the instantaneous substitution matrixQ (e.g., when optimizing the parameters of the General Time Reversiblemodel of nucleotide substitution), which is required to calculate P . Changesto Q occur frequently when the rates in the Q matrix are being optimized.

While the values in the tip-APVs are normally expected to be constant,this is not the case for the numerical implementation of the PLF used inthe PLL. In fact, the matrix of left Eigenvectors is multiplied with the tipprobability vectors prior to any further likelihood calculations. This allowsto save some computations later-on.

Each time the values in the Q matrix are changed this induces a changeof the Eigenvector decomposition. As introduced in Section 2.1, the Eigen-vector decomposition is used to exponentiate Q for obtaining the transitionprobability matrix P for a given branch length t (i.e., P (t) = eQt). Thus, theancestral probability vectors at the tips need to be updated accordingly whenthe Eigenvector decomposition changes. However, transferring n tip vectorsfrom the CPU to the GPU has a deteriorating effect on performance. Thus,we only transfer the substantially smaller tipVector array that contains theproduct of the left Eigenvectors with all possible DNA states (this is thelook-up table used in the non-GPU PLL implementation) from the CPU tothe GPU.

Once the tipVector is available on the device, the new ancestral proba-bility vectors at the tips (tip-APVs) can be efficiently updated on the GPUusing the already available raw alignment sequence data. The overhead ofre-computing the ancestral probability vectors at the tips is negligible; it ac-counts for less than 1% of total run-time. In contrast to this, transferringall n ancestral probability vectors for the tips from the CPU to the GPU foreach change in Q, induced a run time overhead of up to 70%.

6.3.2 GPU Memory Organization

Due to the aforementioned performance considerations, the APVs must bestored on the GPU. In order to correctly calculate the PLF, however, thereare other APV-related data structures that must be stored in the GPU. Welist these requirements below:

• Ancestral probability vectors for the n− 2 inner nodes and n tips.

• The raw sequence alignment data of the tips that is required for re-computing the ancestral probability vectors at the tips when the Qmatrix changes.

94

• The tipVector array that is also required for re-assembling the tip-APVs.

• A weight vector that indicates how many times each alignment sitepattern occurs in the uncompressed MSA. This is called site patterncompression and is used in all standard PLF implementations to avoidunrequired computations (the per-site-LH is the same for two identicalsites).

• Two diagptable arrays (one per child node) representing the P ma-trices for newview() and evaluate() invocations.

• A globalScaler array of size 2n−2 for storing (and later-on un-doing)the scaling multiplications conducted to avoid numerical underflow ateach node of the tree.

• Buffers to accumulate results, sum the per-site log likelihoods, and sumover the number of scaling multiplications.

In terms of memory requirements, this scheme is dominated by the 2n−2ancestral probability vectors. Thus, memory requirements can be approxi-mated in advance. The proof-of-concept GPU implementation assumes thatenough memory is available on the GPU. When this is not the case, it shouldbe possible to apply the memory reduction strategy presented in Section 3.3,trading the lack of memory for additional computations. Alternatively, thecomputations can be split up among several GPUs.

6.3.3 OpenCL Implementation

In OpenCL (see Section 5.3), the work-group size, which is set by the kernelprogrammer, corresponds to the number of threads that are executed perstreaming multiprocessor (SM). After experimenting with several multiplesof 32, we empirically determined that a value of 64 worked best for ourtarget application and HW platform. At each kernel call, all threads withina block can read data from contiguous positions in global memory. Thus,in our configuration we access 64 contiguous entries that share the samestate and are evolving according to the same discrete Γ rate category. OurGPU kernel initializes the tips, and reads/writes the ancestral probabilityvectors according to this data layout (workgroup := 64), which is representedin Figure 6.3.

In order to improve performance, we applied optimization techniques suchas loop unrolling [1], and storing the transition probabilities matrices and the

95

Node v

Site 0/1/ . . . /x interleaved

Rate 0 Rate 1

L1,A . . Lx,A L1,C . . Lx,C L1,G . . Lx,G L1,T . . Lx,T L1,A . . Lx,A L1,C . . Lx,C L1,G . . Lx,G L1,T . . Lx,T

...Node v

Site x+ 1/x+ 2/ . . . /n interleaved

Rate 0 Rate 1

L1,A . . Lx,A L1,C . . Lx,C L1,G . . Lx,G L1,T . . Lx,T L1,A . . Lx,A L1,C . . Lx,C L1,G . . Lx,G L1,T . . Lx,T

Figure 6.3: GPU Memory layout of an ancestral probability vector with work-group size x. Only 2 discrete Γ rates are represented.

eigenvectors in shared memory (local memory for each SM). We also explic-itly use registers to store global variables that are read and written sev-eral times during kernel execution. The newview implementation is depictedin Figure 6.4.

6.4 Experimental setup and results

We simulated DNA data sets of different dimensions using INDELible [26] onrandom trees under the Jukes-Cantor model. Initially, we used INDELible(v1.03) to generate a large alignment of 15 taxa (species) and 900,000 sites.We then used a ruby script to extract subsets of unique site patterns fromthis alignment such as to generate 15-taxon datasets with distinct numbers ofunique sites. Thus, in these datasets, the number of sites actually correspondsto the number of distinct alignment patterns which facilitates the discussionof the results. The use of simulated data is sufficient for measuring theperformance of the PLL GPU version.

We executed the default RAxML-Light [91] search algorithm (version1.0.5) to infer trees on these datasets. To conduct a realistic performanceassessment for a real application using a GPU implementation of the PLF forthe PLL, we re-implemented the search algorithm of RAxML-Light [91] usingthe PLL library. We compared the GPU implementation (see Section 6.3)and the AVX implementation based on the new generic memory layout (seeSection 6.2) against the fastest serial version of RAxML-Light using the adhoc AVX vectorization. The three code versions are implemented with thePLL library. They only differ with respect to their PLF implementations(newview(), evaluate(), sumGAMMA(), coreDerivative()). As mentionedbefore, the execution times for these functions dominate the run time. Wemanually instrumented the code to measure how much time is spent in eachfunction. In the current datasets, the cumulative execution time of these

96

Figure 6.4: GPU implementation for newview. The CPU computes a traversaldescriptor, which is a list of operations that must be executed before computingthe likelihood at the virtual root. The host transfers the APVs identifiers ofq, r and p, and the P matrices to the device (GPU). For each work-group, thecorresponding P matrix is copied to local memory (LM). The APV entries for qand r are read from GPU global memory. The APV entries for p are computedon a SM, and written to global memory.

97

Patterns 2048 4096 8192 16384 32768 65536 131072 262144

AVX 5.76 14.50 25.01 75.81 117.30 300.05 508.45 1503.48

AVX-NEW 6.20 14.90 25.29 77.23 117.57 302.24 510.82 1506.45

GPU 40.19 52.75 50.12 90.83 85.46 166.69 246.03 652.90

Table 6.1: Total run times (in seconds) for the RAxML-Light search algorithm.AVX-NEW corresponds to the AVX implementation with the new generic memorylayout for APVs.

four functions accounts for more than 95% of overall execution time on alldatasets, and for 98% on the largest dataset with 262,144 unique site pat-terns.

The source code and results are available at http://www.exelixis-lab.org/web/personal_page/izquierdo/gpu.tar.gz

It is straight-forward to verify code correctness since the RAxML-Lightsearch algorithm is deterministic for given, fixed starting trees. Thus, it issufficient to compare the resulting tree topologies and log likelihood scores.

We executed the AVX versions (standard and new generic layout) on anIntel i5-3550 CPU running at 3.30GHz with 8GB RAM. The GPU versionwas executed on the same host system, which is also equipped with a NVIDIATesla C2075 card (448 CUDA cores, 1.15GHz, and 6GB GDDR5 of memory).

The total execution times are shown in Table 6.1. We achieved overallspeedups exceeding a factor of two for the longest test dataset (see Fig-ure 6.5). The three PLL kernels show comparable speedups. We observed amaximum speedup of three for the derivative computation. For newview(),which consumes the largest fraction of execution time, we obtained a maxi-mum speedup of 2 (see Figure 6.6).

The number of scaling multiplications is proportional to the number oftaxa (see [89] for details). To verify the correctness of the numerical scalingprocedure, we also generated and executed a dataset with 200 taxa to forcethe codes to conduct numerical scaling. For this larger dataset, we did not runthe full RAxML-Light algorithm, but a benchmark to evaluate the likelihoodand optimize the branch lengths of a given a starting tree. As Table 6.2shows, the scaling does not significantly affect GPU performance, that is,the speedups are comparable to those observed on the 15-taxa dataset, wherescaling is not required.

Overall, there are no significant GPU speedups for DNA data. This ismainly because we are comparing the GPU code to the probably fastestcurrently available PLF implementation that relies on code that has beenmanually vectorized and tuned for AVX intrinsics. Thus, the GPU speedupsobtained for DNA data are substantially lower than those reported in the

98

Patterns 2048 4096 8192AVX 2.04 4.13 8.25GPU 4.76 5.46 7.66

Table 6.2: Total run times (in seconds) for 10 evaluations and branch lengthoptimization of a 200 taxa tree.

0

0.5

1

1.5

2

2.5

3

1024 2048 4096 8192 16384 32768 65536 131072 262144

Tota

l Spe

edup

vs.

AV

X v

ersi

on

Number of unique patterns

Speedup for RAxML-Light application (AVX is used as reference version)

GPUAVX-NEW

Figure 6.5: Speedups for a full application run of RAxML-Light. The reference isthe AVX version with the standard layout.

BEAGLE paper. However, the speedups reported for BEAGLE compareGPU performance of BEAST, MrBayes, and GARLI, to plain C and SSE3-based implementations. Nonetheless, the performance of our PLL GPU im-plementation is expected to improve when models/data with more states areused, such as protein models, because they perform more computations perdata accesses than for DNA data. Note that, the amount of PLF floatingpoint computations per site increases with the squared number of states.

Nonetheless, GPUs can yield two-fold speedups over our highly optimizedmanual AVX implementation. Another interesting observation is that almostno performance penalty is induced by using the more generic vectorizationscheme with AVX instructions.

99

0

0.5

1

1.5

2

2.5

3

3.5

1024 2048 4096 8192 16384 32768 65536 131072 262144

Tota

l Spe

edup

vs.

AV

X v

ersi

on

Number of unique patterns

Speedup for main PLL functions (AVX is used as reference version)

GPU-newview()AVX-NEW-newview()

GPU-evaluate()AVX-NEW-evaluate()

GPU-derivatives()AVX-NEW-derivatives()

Figure 6.6: Speedups for each function of the PLL. The reference is the AVXversion with the standard layout.

6.5 Summary

We have presented a GPU implementation for the main functions that arerequired for Bayesian and ML-based phylogenetic inference. In our approach,we store all ancestral probability vectors in the GPU memory to avoid trans-ferring large amounts of data between the GPU and the CPU. We have alsointroduced an alternative and more generic layout for the ancestral probabil-ity vectors, which is suitable for x64 vector intrinsics and GPU architectures.This layout facilitates porting the library to larger x86 vector units that haverecently become available.

100

Chapter 7

Perpetual Phylogenies withPUmPER

This chapter describes a framework named PUmPER, which makesuse of the following publicly available open source tools: PHLAWD [81](developed by Stephen A. Smith, who also implemented the featuresdescribed in Subsection 7.2.1), RAxML-Light [91], Parsimonator [85],and standard-RAxML [87] (developed by Alexandros Stamatakis). JohnCazes implemented the scripts described in Subsection 7.3.3. The con-tent of this Chapter has been derived from the following peer-reviewedpublication:F. Izquierdo-Carrasco, J. Cazes, S. Smith, and A. Stamatakis. PUmPER:Phylogenies Updated Perpetually. Bioinformatics, 30(10):1476–1477,2014

In this chapter we focus on topics related to automated inference and exten-sion of phylogenetic trees. While the previous chapters focused on low leveloptimizations, here we discuss the benefits of automation with a focus onsaving man-hours.

Existing phylogenies of taxonomic groups need to be updated as newdata for new species, individuals and/or new genes are added to databases.The straight forward approach is to re-initiating phylogenetic inferences fromscratch (every time data are added to public databases), which represents awaste of effort (man hours) and computations/energy. Nonetheless, addingnew taxa or genes to a phylogenetic tree may also unravel new evolutionaryrelationships that were not supported by previous, smaller datasets. Albeitstill an on-going debate, the taxon sampling density can have an impact onfinal tree shapes [9, 117].

101

Thus, it is worth exploring the tree topologies generated by datasetswhose taxon sampling has been extended. In this chapter, we address thesechallenges, and present a framework called PUmPER (Phylogenies UpdatedPerpetually). PUmPER can iteratively construct multi-gene alignments (withPHLAWD) and phylogenetic trees (with RAxML-Light) for a given NCBI tax-onomic group. Existing large reference phylogenies (and alignments) can beextended with PUmPER, without human intervention, and without the needto re-compute everything from scratch.

According to its configuration, PUmPER can detect when a sufficient amountof new data for the target clade and genes are available in GenBank, andautomatically update existing alignments and phylogenies. We call this pro-cedure a perpetual tree update. This can be helpful for commonly useddatasets such as rbcL for seed plants, which has been used for broad scalephylogenies since 1993 [13], and also large ribosomal datasets that are usedextensively [68].

PUmPER can be deployed either as a stand-alone tool on single machines, oron High Performance Computing systems, where phylogenetic inferences canbe offloaded to a cluster. PUmPER is available (under the GNU GPL license)at https://github.com/fizquierdo/perpetually-updated-trees.PUmPER does not only allow for extending phylogenies at a lower cost (interms of energy and man hours), but it also yields equally good likelihoodtrees as de novo tree inferences conducted from scratch.

The rest of this chapter is organized as follows. In Section 7.1, we dis-cuss previous work on automated inference and extension of phylogenies.In Section 7.2 we provide an overview of the PUmPER framework. Specificdetails about the software design of the stand-alone and distributed versionsare provided in Section 7.3, including a description of the pipeline setup thatperpetually updates the Viridiplantae clade. An evaluation of this pipelineis presented in Section 7.4.

7.1 Related Work

Previous work on perpetually updated trees focused on a framework calledmor, which was designed for maintaining a specific automated taxonomy ofHomobasidiomycetes [32]. The mor framework retrieves, screens, aligns, andanalyzes nuc-lsu rDNA (nuclear large subunit ribosomal DNA) sequences ofHomobasidiomycetes from GenBank and generates three phylogenetic treeseach week. It generates an unconstrained jackknife neighbor-joining tree, atopologically constrained maximum parsimony consensus tree, and a topolog-ically constrained maximum likelihood tree. While mor continuously updates

102

phylogenies, its main purpose is to produce automated taxonomies. To thisend, a ’clade parser’ is used to translate trees into rank-free classificationsusing node-based phylogenetic taxon definitions. For details, see [32]. Themor framework is accessible as a web service, where phylogenetic trees arepublished and archived on a weekly basis (at http://mor.clarku.edu). Themor software is written in Perl and available as open-source code. It can,in principle, be modified to work with any gene and group of organisms. Atpresent, however, the web service is not being actively development anymore(pers. comm. with David Hibbett; March 27, 2013). The last archived filesdate back to 2008. The latest sequence added dates back to 2010, and thelargest Maximum Likelihood tree produced by the system comprises 8019taxa.

Further work has been done on automating the process of phylogenetic in-ference. STAP (Small Subunit rRNA Taxonomy and Alignment Pipeline) [108]is a pipeline that uses publicly available packages, such as ClustalW [103],PhyML [29] and BLASTN [3], to automate the process of phylogenetic infer-ence and taxonomic assignment for ss-rRNA sequence data. STAP retrievesdata from two public databases of ss-rRNA sequences: Greengenes [16] andRDP-II [49]. STAP is a collection of scripts available as a Perl package.

Phylometrics [84] is another automated pipeline for inference of phyloge-netic trees. Like STAP, it uses BLASTN hits and ClustalW for alignmentbuilding, and PhyML for Maximum Likelihood tree inference. The pipelineis implemented in PHP as a web application, which can be installed locallyor hosted remotely. In addition, batch jobs can be queued for each stage ofthe pipeline.

The total running time and memory usage of such automated pipelinesis depends on the performance of their core components (e.g., ClustalW foralignment and PhyML for tree inference). They can also be compared interms of feature availability, extendability, and usability. In this chapter, wepresent a flexible framework which can be used to automate the process ofalignment construction and tree inference for arbitrary taxonomic groups. Itcan be deployed on both standalone (single machines) and High PerformanceComputing systems, enabling long-term availability of up-to-date phyloge-netic trees.

7.2 Framework Overview

In the following, we outline the structure of PUmPER, our framework for per-petually updating phylogenetic trees of more than 20,000 taxa.

PUmPER is composed by (i) the multiple sequence alignment (MSA) gen-

103

Figure 7.1: Initial iteration. An initial alignment is built for sequences covering agiven clade and description search term (gene). Parsimony starting trees are usedfor Maximum Likelihood searches. The best trees are kept.

eration/extension component, and (ii) the phylogenetic tree inference com-ponent, which infers/extends the trees via Maximum Likelihood (ML) treesearches. The MSA component is an extension of PHLAWD [81], which canretrieve GenBank sequences and subsequently build or extend MSAs. Thetree inference component is based on RAxML-Light [91], a dedicated HPCversion of RAxML that can be executed on clusters using the Message Pass-ing Interface (MPI). It can be used to infer new trees from scratch or toextend given trees by inserting additional taxa.

On top of these two components, we developed an iterative procedurethat perpetually updates trees. Each iteration consists of two stages: thegeneration of a MSA, and the subsequent inference of a set of trees based onthe generated MSA.

The initial iteration is special, since it builds the initial MSA and ML treeset from scratch. We call the remaining iterations update iterations, becausethey simply extend the MSAs and trees of the preceding iteration.

The setup of the initial iteration (see Figure 7.1) involves editing a con-figuration file for PHLAWD. In this file, the user must provide the NCBI taxo-nomic rank (clade name) and the gene(s) for which a MSA shall be assembled.PUmPER invokes PHLAWD to query GenBank and construct a initial MSA.

PUmPER uses Parsimonator to generate an initial set of distinct (random-ized step wise addition order) parsimony starting trees based on this initialMSA. For each parsimony tree, PUmPER conducts an independent ML treesearch with RAxML-Light to generate a set of ML trees. The user can spec-ify the number of tree searches to be conducted and the size of the tree setto be kept in a configuration file.

In an update iteration (see Figure 7.2), PUmPER carries out the followingfour steps:

1. Update (re-assembly) of the MSA with PHLAWD with new GenBank dataaccording to the initial configuration file.

104

Figure 7.2: Update iteration. The alignment is extended and previous trees arere-used to continue searching in ML space with a set of different starting trees.

2. Generation of distinct randomized stepwise addition order parsimonystarting trees with Parsimonator to extend the set of trees from theprevious iteration with the newly added taxa.

3. ML optimization of the comprehensive parsimony starting trees (fromstep 2) with RAxML-Light.

4. Selection of a subset of these ML trees (based on their likelihood scores)that will be used as starting points for the next iteration.

7.2.1 MSA Construction/Extension with PHLAWD

MSA construction in PHLAWD has been described in [81], but for the purposesof understanding the perpetual procedure, we briefly outline the basic PHLAWDprocedure again. PHLAWD requires the user to supply a configuration file thatspecifies for which organisms (as defined by the NCBI taxonomy) and whichgene region(s) to construct a dataset. The user can identify the focal generegion by supplying a set of exemplar sequences. These sequences will beused for pairwise alignments and homology assessment. Additionally, theuser can provide search terms that will be compared against the descriptionof the sequences in the database to limit the scope of the sequence search.These are used in a Smith–Waterman procedure that discards sequences thatare too dissimilar to the exemplars. Using the remaining sequences, PHLAWDattempts to construct MSAs. If the sequences included in the MSA aretoo divergent to construct a reliable MSA, PHLAWD splits up these sequencesbased on a guide tree, the default of which is the NCBI taxonomy. Thesesubsets are initially aligned independently (with MAFFT [42]), then profile

105

alignment with MUSCLE [19] is used to align the subsets using the guidetree.

The PUmPER framework requires some adaptations in PHLAWD. We imple-mented an option to also pass user-supplied sequences to PHLAWD in additionto retrieving them from GenBank. This facilitated tests with simulated data,and also allows the user to extend the sequence set, by sequences that are notavailable in GenBank (e.g., simulated data or unpublished sequences). Fortests with simulated data, we also extended PHLAWD to read-in user-suppliedcomprehensive (containing all taxa) or non-comprehensive (containing a sub-set of taxa) MSA guide trees. These trees can be used as an alternative tothe NCBI taxonomy for splitting up sets of sequences for profile alignmentand putting them back together. This feature can also be used to assist inthe profile alignments of the user-supplied sequences. We also changed theunderlying PHLAWD file organization. Previously PHLAWD stored all the inter-mediate alignments and other information in flat files. Now all files producedby a PHLAWD run are stored in a SQLite database file. This allows for eas-ier replication of the PHLAWD MSA procedure by storing the order of subsetalignment profiling. This is required for PHLAWD-based MSA extension whennew sequences are added. The database also stores sequences that have beenretrieved from GenBank and included in the MSA as well as sequences thathave been added by users. We extended PHLAWD to allow for updating thislocal database with new sequences.

When PHLAWD has already been executed once and new sequences wereadded to the database (automatically or by the user), PHLAWD can be executedin update mode (updatedb option). Initially, the sequences of each new taxonare aligned to the closest existing subalignment. Depending on the informa-tion available, the closest subalignment is either determined by taxonomy orsequence similarity. Then, PHLAWD executes profile-profile alignments in thesame order as in the original run. The information about the sub-alignmentprofile-profile alignment order is stored in the SQLite database. If the newMSA comprises 20% or more species than the preceding alignment, PHLAWDwill re-divide the sequences into subsets and re-align them from scratch. The20% threshold was empirically determined as ’good’ default value but can bechanged by the user.

Automated Assembly of Multi-gene datasets with PHLAWD

PUmPER also supports the generation of multi-gene alignments. For each generegion of interest, an independent PHLAWD instance (single-gene MSA) is run.Each instance has its own configuration and exemplar sequence file. There-after, PUmPER concatenates all single-gene MSAs into a multi-partitioned

106

dataset and stores the gene boundaries in a RAxML-formatted partition file.During an update iteration, each PHLAWD instance is extended independently(as described above).

The end result of a PHLAWD stage, be it initial or update, is a supermatrixstored in PHYLIP format in a folder for each iteration, which acts as aninterface with the phylogenetic inference stage. Since these two stages aredecoupled, it is straight-forward to substitute PHLAWD by another MSA con-struction method, that is, user-provided alignments can be used seamlesslywith PUmPER.

7.2.2 Phylogenetic Inference

The second stage of every iteration is the phylogenetic inference of a setof trees based on the most recent MSA. The number of independent treesearches conducted at each iteration depends on two user parameters: theparsimony parameter p, and the size of the tree set b that shall be selectedand kept in the end. In our experiments, we used constant numbers for pand b over all iterations (p := 30, b := 10 for initial iterations and p := 3,b := 10 for update iterations). These parameters can be modified by the userfor each individual iteration. We denote the values of p and b for iteration ias p(i) and b(i).

Generation of starting trees

In the initial iteration i := 0, p determines how many randomized stepwiseaddition order parsimony starting trees will be generated (i.e., Parsimonatoris invoked with p distinct random seeds).

In an update iteration i > 0, given an extended MSA and a set of selectedML trees (from the preceding iteration), p(i) denotes how many comprehen-sive randomized stepwise addition order parsimony starting trees will be gen-erated from the tree set of iteration i− 1. Thus, PUmPER calls Parsimonatorto extend the b(i−1) trees from the preceding iteration. In each call, p(i) dis-tinct comprehensive parsimony trees are generated from the same preceding(non-comprehensive) ML tree.

Maximum Likelihood Inference

Each comprehensive parsimony starting tree topology is then optimized un-der ML with RAxML-Light. Thus, PUmPER conducts p(0) ML searches forthe initial iteration (i = 0) and p(i) · b(i− 1) ML searches for all consecutive

107

iterations (i > 0). The choice of which flavor of RAxML-Light will be de-ployed (SSE3/AVX vectorization, Pthreads or MPI) depends on the availablehardware.

Scoring and selecting the best trees

PUmPER waits until all ML searches have finished, and then scores the p(i) ·b(i − 1) topologies (-f J option) with standard-RAxML [87] under the Γmodel of rate heterogeneity [110]. This option will also compute SH-likebranch support values as described in [29]. Then, PUmPER selects the b(i)best-scoring ML tree topologies, which will be the starting tree set used initeration i + 1, and finalizes the iteration.

7.2.3 Manual and automatic tree updates

Update iterations can be initiated manually through the command line inter-face. However, these updates can also be triggered automatically. An updateiteration is started if (i) the alignment from the previous iteration has beenextended and (ii) the phylogenetic analyses of the previous iteration havebeen completed. PUmPER can generate a cron job that periodically checks ifthe two conditions are true.

The MSA extension using PHLAWD can be automated via another cron

job that periodically (default: once per week) queries GenBank and willlaunch PHLAWD to extend the MSA if enough new sequences (according tothe configuration) have become available.

7.3 Software and Availability

PUmPER is available as open source code. In terms of installation requirements,all components used in the framework are open source and publicly availableat http://phlawd.net (PHLAWD) and http://www.exelixis-lab.org/

(RAxML). The design comprises Ruby modules that can be included in Rubyscripts. Each Ruby module encapsulates some independent functionality,that is, the user does not need to be aware of the specific usage of theunderlying tools.

For instance, in Code Sample 1 we show part of the class implementa-tion that abstracts the PHLAWD usage. The hash @opts contains a listof parameters read from user input and configuration files, and define theconfiguration of PHLAWD, as well as the database to be used.

108

Ruby Code Sample 1 Code extract from the PHLAWD wrapper. Multipleinstances of PHLAWD are generated to independently build gene multiplesequence alignments.class Phlawd

def initialize(opts, log)

@opts = opts

@phlawd_runner = PhlawdRunner.new(log, @opts[’phlawd_binary’])

@instances = find_folder_instances

@genbank_db = GenbankDB.new(@instances, @opts)

end

def run_initial

@phlawd_iteration = PhlawdIteration.new(@phlawd_runner)

# Run phlawd sequentially

valid_instances.each do |instance|

instance.run_initial unless File.exist? instance.result_file

@phlawd_iteration.add_fasta_alignment instance

end

@phlawd_iteration

end

def find_folder_instances

working_dir = @opts[’phlawd_working_dir’]

instances = []

Dir.entries(working_dir).reject{|f| f=~ /^\.+$/}.each do |f|

path = File.join working_dir, f

if File.directory? path

instances << PhlawdInstance.new(path, @phlawd_runner)

end

end

instances

end

end

109

The application programmer can generate multi-gene alignments witha simple call as shown in Code Sample 2. Further detailed examples andconfiguration details are available in the code repository.

Ruby Code Sample 2 Calling PHLAWD from the PUmPER framework.A multi-gene dataset can be generated with automated subsequent calls toPHLAWD.

opts = PerpetualTreeConfiguration::Configurator.new(ARGV.first)

log = PerpetualTreeUtils::MultiLogger.new

phlawd = PerpetualPhlawd::Phlawd.new(opts, log)

msa = phlawd.run_initial

Configuration files are used to determine specific settings. While ourmain use case is the automated update of phylogenetic trees, the frameworkcan be easily used to build custom phylogenetic pipelines. For example, ifalignments are already available, the PHLAWD component can be omitted. Theon-line repository includes an installation guide, as well as basic usage andconfiguration examples.

7.3.1 Standalone implementation

When using the default configuration, PUmPER operates in stand-alone modeon a single server. PHLAWD and RAxML-Light are executed locally on thisserver. The distinct RAxML-Light tree searches are conducted one afterthe other, but the framework can be configured such that the Pthreads ver-sion of RAxML-Light is used on a multi-core machine. RAxML-Light imple-ments the memory saving techniques described in Section 3.3 and Section 3.4.Thus, this stand-alone version allows for updating large trees on a medium-sized lab server.

7.3.2 Distributed implementation

For large trees, the computational resources of a single server may be insuffi-cient due to memory and/or time constraints. Thus, PUmPER can also offloadthe computationally intense ML calculations to a cluster system. Thereby,the tree can be updated in a timely manner while the process is still or-chestrated on a local server. This requires PUmPER to interface with remotesystems using standard communications tools (scp and ssh), batch sched-ulers (SGE, SLURM, etc.), and to also use executables that have been optimized

110

for the remote system (Parsimonator, RAxML-Light, and RAxML). Althoughthis adds another level of complexity, it is required for trees that take daysand/or multiple nodes to process.

In the PUmPER framework, the local server orchestrates the remote workflow (ML calculations). By using cluster-specific configuration files and batchscheduler templates, the local server creates and submits batch scripts toexecute steps 2, 3, and 4 of the update process (see Section 7.2). At the endof each step, the batch job transfers the results back to the local system.

The ML optimization of Step 3 may run for a long time and require (mul-tiple) restarts from a checkpoint file. RAxML-Light offers such a checkpointand restart facility, which allows for conducting a single tree inference inmultiple steps if the run time exceeds the queue limits. For instance, for thesetup described in Subsection 7.3.3, the standard queue time limit was 24hours, which was occasionally exceeded, thus requiring a restart.

We have successfully used PUmPER with two popular job submission en-gines: SGE and SLURM. It should be straight-forward to adapt the currenttemplate files to other schedulers. However, cluster setups, security policies,queuing system configurations, etc., are different on each individual instal-lation. Therefore, deploying PUmPER in conjunction with a cluster using SGE

or SLURM might still need manual reconfiguration.

7.3.3 Custom iPlant setup

We are currently running a pipeline based on PUmPER as part of the iPlant col-laborative (http://www.iplantcollaborative.org/). The goal is to main-tain and make available perpetually updated trees for the Viridiplantae clade(using the rbcL, matK, and atpB genes). In this Section, we describe somedetails concerning the setup of this pipeline.

We use a dedicated server to control the workflow on a remote cluster.This server, Wooster, is a dedicated virtual machine (VM) provided by theiPlant collaborative to orchestrate the inference of perpetually-updated trees.It is configured with 8 Intel Westmere cores and 16 GB of memory. Theprocesses running on the local server are relatively lightweight.

There were two clusters available, both at the Texas Advanced ComputingCenter (TACC), and part of the XSEDE (Extreme Science and EngineeringDiscovery Environment) program. The Linux cluster, Lonestar, was usedduring the development phase of the cluster computing component. TheLonestar cluster is composed of just under 2,000 compute nodes each with two6-core Intel Westmere processors and 24 GB of RAM. Lonestar uses the SunGrid Engine (SGE) batch facility to schedule jobs and allows users to connectdirectly via ssh. This batch scheduling facility along with the use of ssh for

111

remote commands allows the use of a locally managed master server, in thiscase Wooster, to distribute the computationally intensive tasks. By usingtemplate files to describe the cluster and the appropriate batch system, wehave developed the cluster component to be portable to most HPC systems.

In January of 2013, a newer, more powerful cluster, Stampede, was madeavailable at TACC, and is available to iPlant via an XSEDE project. Stam-pede is composed of 6,400 nodes, each with two 8-core Intel Sandy Bridgeprocessors, 32 GB of RAM and a Xeon Phi Coprocessor. Although, RAxML,RAxML-Light, and Parsimonator, do not yet take advantage of the XeonPhi, the RAxML family of codes contains AVX optimizations to take advan-tage of Sandy Bridge processors. Stampede differs from Lonestar in a fewareas that require changes to the template file.

Since the cluster file and batch template files were developed for use onLonestar and the SGE system, they had to be modified to run on the newersystem, Stampede. This is easily done for the cluster file, which contains adescription of the compute node resources and the path to the appropriatebinaries. Below is an example of a cluster template file for Stampede.

# Info about cluster

cores_per_node: 16

mem_per_node: 31000

#Project required for batch scheduler

project: TG-MCB110022

submission: slurm

# Installed binaries, absolute paths in remote machine

parsimonator: ~/remote/wooster/bin/parsimonator-AVX

raxmllight: ~/remote/wooster/bin/raxmlLight-AVX

raxmllight_MPI: ~/remote/wooster/bin/raxmlLight-MPI-AVX

raxmllight_pthreads: ~/remote/wooster/bin/raxmlLight-PTHREADS-AVX

raxmlHPC_pthreads: ~/remote/wooster/bin/raxmlHPC-PTHREADS-SSE3

Since most of the job control logic has been integrated into the local serverrather than the scheduler, the batch script templates are simple and may beported to most batch scheduling systems. The only part that must be portedare the batch directives. An example of the SLURM directives required to runthe RAxML-Light component of PUmPER is given below.

#SBATCH -J raxmllight_<%=params[:exp_name_run_num]%>

#SBATCH -d singleton # ensures that each job with this name

# will only run one at a time

112

#SBATCH -n <%=params[:num_tasks]%>

#SBATCH -p normal

#SBATCH -o raxmllight_<%=params[:exp_name_run_num]%>.o%j

#SBATCH -e raxmllight_<%=params[:exp_name_run_num]%>.o%j

#SBATCH -t 24:00:00

#SBATCH -A <%=params[:project]%>

The params variables are set by the local server to launch individuallynamed jobs. The only requirement of the scheduler is that it should be ableto handle a simple job dependency in which jobs of the same name are runconsecutively. The logic of staging input files and starting or restarting theRAxML-Light component is handled in the body of the batch file and shouldnot need to be changed from system to system.

In our TACC setup, at the end of each iteration the best tree is uploadedto the iPlant collaborative tree visualization system, which uses Phyloviewer(publicly available at http://portnoy.iplantcollaborative.org/) to cre-ate a tree visualization on the iPlant server, Portnoy, with a link to the latesttree.

For example, Figure 7.3 shows the best-scoring and most up-to-date treefrom Table 7.2.

7.4 Evaluation and Results

We have tested and evaluated PUmPER with simulated and real biologicaldatasets. For each experiment (e.g., different clade or gene), we executedseveral iterations. For each update iteration, we also executed a control runwhich we denote as scratch iteration. A scratch iteration behaves like aninitial iteration, that is, it builds the MSA from scratch on all sequences.It also executes the same number of independent ML tree searches as theupdate iteration, but without using previous topologies. We used the CON-SEL package [78] to statistically assess if update and scratch iterations yieldtopologies with significantly different likelihood scores.

7.4.1 Biological examples

We have constructed two biologically relevant datasets for testing the PUmPERapproach. The first dataset consists of the rbcL gene region for the clade ofland plants (Embryophyta). The second dataset consists of the 18S ribosomalregion for the Eukaryota clade.

PHLAWD uses identity and coverage to determine what sequences are simi-lar enough based on Smith–Waterman comparisons. Values of 0.5 were used

113

Figure 7.3: A perpetually updated tree for the 18S gene. Viewed with Phyloviewer(http://portnoy.iplantcollaborative.org).

for rbcL and 0.4 for 18S. These numbers were based on plots of identity andcoverage using the seqquery command in PHLAWD. In order to simulate theupdate procedure on real data, each of these two datasets was created forsequences available before January 2008, before January 2010, and beforeSeptember 2012.

114

Iter

atio

nT

axa

Sit

esA

vg

LH

(30)

Avg

LH

(10)

Ru

nti

me(

h)

Avg

Bra

nch

sup

por

t20

0812

072

1437

-848

794.

80-8

4874

5.23

46.5

567

.78

2010

1696

214

27-1

0058

24.2

5-1

0057

62.8

168

.36

64.2

520

10sc

ratc

h16

962

1427

-100

5931

.37

-100

5863

.32

70.8

964

.26

2012

(Sep

t)21

791

1424

-110

8161

.66

-110

7598

.42

93.4

059

.56

2012

(Sep

t)sc

ratc

h21

791

1424

-110

8243

.29

-110

7774

.80

97.4

259

.46

Table7.1:

Originalrunandtwoupdates

oftherbcLdatasets.Average

MLscores

attheendofeach

iteration(averaged

over

all30treesandthe10besttrees)andoverallruntimes

ofallsearches.Thebranch

supportistheaverageofSH-likesupport

values

onthebesttree.Therunningtimeisthesum

ofthe30

MLsearches.

115

Iter

atio

nT

axa

Sit

esA

vg

LH

(30)

Avg

LH

(10)

Ru

nti

me(

h)

Avg

Bra

nch

sup

por

t20

0814

634

2363

-195

0468

.79

-195

0281

.56

74.1

670

.24

2010

1848

022

14-2

3403

14.0

6-2

3401

42.3

888

.73

68.7

320

10sc

ratc

h18

480

2214

-234

0563

.92

-234

0260

.37

94.0

868

.59

2012

(Sep

t)23

298

2110

-278

2234

.35

-278

1965

.38

116.

4868

.13

2012

(Sep

t)sc

ratc

h23

298

2110

-278

2132

.94

-278

1959

.11

124.

2368

.00

Table7.2:

Original

runandtwoupdates

ofthe18Sdatasets.

Average

MLscoreat

theendof

each

iteration(averaged

over

all30

treesandthe10

besttrees)andoverallruntimeof

allsearches.Thebranch

supportistheaverageof

SH-likesupport

values

onthebesttree.Therunningtimeisthesum

ofthe30

MLsearches.

116

Table 7.1 and Table 7.2 show the average ML scores for three updateiterations. Each update iteration was run with parameters p := 3, b := 10,resulting in 30 independent ML searches.

We concatenated all resulting trees from scratch and update iterationsbased on the same biological real MSA, and used the CONSEL package [78]to assess the confidence of phylogenetic tree selection, that is, which treeswere significantly better in terms of likelihood scores. The confidence setaccording to the approximately unbiased (AU) test [76], delimited by a p-value of p := 0.05, included for each year and dataset topologies from boththe scratch and update iterations.

For example, in Table 7.3 we present the resulting confidence set of trees(out of all possible 60 trees) that were selected by CONSEL for the 18Sdataset update labelled as 2010 (see Table 7.2). Trees with ids from 1 to 30correspond to the PUmPER trees, and trees with id from 31 to 60 correspondto the restart approach. We clearly see that these tree sets, which are notsignificantly different from each other, include trees from both approaches.

117

Rank Tree id LH difference AU (p-value)

1 18 -26.5 0.7012 29 26.5 0.6663 26 36.6 0.6724 27 71.5 0.6025 41 125.7 0.5176 1 133.9 0.5047 60 140.6 0.4768 40 145.5 0.4649 20 167.4 0.44810 59 173.3 0.44311 52 183.8 0.46212 36 185.6 0.42313 19 199.8 0.31714 6 203.8 0.35815 8 230.0 0.30816 2 234.3 0.34317 16 271.5 0.27018 11 279.4 0.27419 30 320.0 0.20120 15 320.6 0.28421 13 345.0 0.25322 14 345.2 0.17823 21 356.5 0.22224 4 375.2 0.17725 39 413.6 0.18926 28 428.8 0.12227 10 429.0 0.11128 57 430.0 0.17129 56 440.3 0.20330 25 441.7 0.10531 55 456.8 0.13832 7 458.5 0.13833 9 460.3 0.04934 51 470.2 0.09535 12 471.0 0.10336 23 517.3 0.07737 22 528.8 0.05238 42 545.7 0.03439 45 558.8 0.08240 32 562.1 0.05741 37 616.9 0.04242 47 620.1 0.057

Table 7.3: AU test for iteration 2010 of 18S dataset. Trees are ranked by log like-lihood (first column). The second column shows the tree id, 1 to 30 are PUmPERtrees (bold), 31 to 60 scratch trees. The third column shows the log LH differencewith the best tree. The fourth column is the p-value as computed by the AU(aproximately unbiased) test. Only trees within the confidence set (p > 0.05) areshown.

7.4.2 Simulated Data

We used INDELible [26] to generate a simulated dataset with 9097 taxa ona tree inferred on the rbcL gene for the Viridiplantae clade. We used thesimulated sequences and the underlying true tree as guide tree to asses theiterative MSA update procedure in PHLAWD.

From the 9079 existing sequences in the simulated alignment, we selected4000 at random as user sequences to generate the initial PHLAWD-based MSA.The remaining sequences were randomly distributed across 3 update blocksof 1345, 1396, and 2329 sequences. Each update block was used as user-sequences to extend the MSA, generating extended MSAs of 5345, 6741, and9079 taxa.

We pruned the true tree accordingly such that for each MSA and iterationa corresponding true tree was available. We used these pruned true trees todetermine the topological accuracy of the inferred trees (at each iteration)using the Robinson-Foulds distance [69], denoted as RF-distance.

As before, each iteration included a total of 30 independent ML searches.We observed that starting from extended topologies (update iterations) doesneither increase nor decrease topological accuracy (see Table 7.4).

119

Iter

atio

nT

axa

Sit

esA

vg

LH

(30)

Avg

LH

(10)

RF

(tru

etr

ee)

Ru

nti

me(

h)

Avg

sup

por

t0

4000

1500

-589

036.

97-5

8903

5.58

0.14

62.

9277

.63

153

4515

00-7

1568

3.68

-715

682.

660.

163

7.46

76.1

41

scra

tch

5345

1500

-715

682.

75-7

1568

1.41

0.16

27.

8976

.16

267

4115

00-8

3843

7.33

-838

436.

480.

176

6.21

74.7

22

scra

tch

6741

1500

-838

440.

38-8

3843

8.06

0.17

48.

1774

.73

390

7915

00-1

0334

99.2

3-1

0334

98.2

30.

184

10.1

073

.12

3sc

ratc

h90

7915

00-1

0334

98.1

5-1

0334

95.8

50.

185

19.2

773

.12

Table

7.4:Original

runandthreesimulatedupdates

ofthesimulateddatasets.

Average

Likelihoodat

theendof

each

iteration(30totaltreesand10

besttrees)

andtotalruntimeof

allsearches.Thebranch

supportvalues

aretheaverageof

SH-likesupport

values

ofthebesttree.Therunningtimeisthesum

ofthe30

maximum

likelihoodsearches.

120

7.5 Discussion

According to our first results, the iterative MSA and tree extension (PUmPER)approach yields neither significantly better, nor worse trees than the standard(inference from scratch) approach with respect to the likelihood scores. Thetopological accuracy in our simulations is comparable in both approaches.The relatively high RF distances are expected for phylogenetic reconstruc-tions on short simulated single-gene datasets [44].

The runtimes in the PUmPER (iteration-based) approach are slightly, butconsistently lower. We view the main contribution, however, in saving man-hours: alignment construction, job setup, filtering, and post-analyzing resultsare tedious tasks that consume a significant, and hard to estimate, amountof time.

The framework offers the required flexibility to set up self-maintainedon-going analysis such as, for example, simultaneous perpetual inference ofgene/species trees using multi-gene and single-gene phylogenetic inferences.

7.6 Summary

We have presented and made available a framework named PUmPER that canbe used to maintain and perpetually update phylogenetic trees. We haveused this framework to implement and set up a pipeline for updating phy-logenies based on multi-gene alignments for the Viridiplantae clade. PUmPERcan operate in stand-alone mode on a single server, but also offload com-putationally expensive ML searches to an external cluster. The perpetuallyupdated phylogenies can be computed slightly faster and are not significantly(in the statistical sense) worse nor better than phylogenies that are inferredfrom scratch.

We are currently using PUmPER to maintain a perpetually updated phy-logeny for the green plants within the framework of the iPlant collaborative.

121

Chapter 8

Conclusion And Outlook

8.1 Conclusion

The main goal of this thesis was to study and develop methods for analysinglarge-scale datasets for Maximum Likelihood based phylogenetic inference.We have approached this challenge using different strategies: reduction ofmemory requirements, reduction of running time, and reduction of man-hours.

The reduction of memory footprints involved three different techniques.The out-of-core and the recomputation strategy share the idea of only storinga subset of the required ancestral probability vectors in memory. The remain-ing vectors are stored on disk (out-of-core) or obtained through additionalcomputations (recomputation). Our assessment shows that the recomputa-tion approach clearly outperforms the out-of-core strategy. We have provedthat the recomputation technique can compute the log likelihood by storingonly log2(n) + 2 APVs for trees with n taxa. Furthermore, the overhead inrunning time is negligible for full tree traversals. For partial tree traversals,the overhead is typically 10% in running time when only half of the requiredAPVs are stored in memory. These features can be useful for inference oflarge trees when memory is a limiting factor.

The re-implementation of the SEV technique can reduce memory almostproportionally to the amount of missing data, without sacrificing runtime.Furthermore, the SEV and recomputation techniques are orthogonal andcan thus be deployed simultaneously, as we have shown in the RAxML-Lightimplementation.

We have also presented an algorithm to reduce the tree size and thus thespace of possible tree topologies during tree search. We have shown that,for large topologies comprising tens of thousands of taxa, the resulting trees

122

can be computed faster and are not significantly worse than trees obtainedfrom standard searches. Backbone-based likelihood scores where consistentlylower, but correlated with topologies inferred with the same starting treeunder a standard search. Therefore, we conclude that this approach can beuseful to identify good starting trees. In other words, the backbone approachcould be used to discard non-promising starting trees during the early phaseof the phylogenetic search.

We have explored and efficiently ported the computation of the phyloge-netic likelihood function to GPUs. This required adapting the memory layoutfor ancestral probability vectors in order to achieve optimal performance fora GPU architecture. We have shown that, for large datasets, offloading PLFcomputations on DNA data to the GPU can be up to two times faster thanthe most efficient CPU vectorized code for our RAxML benchmark code.

While we used the RAxML codebase to develop proof-of-concept imple-mentations of the above mentioned techniques, they are generic enough tobe applicable to other state-of-the-art likelihood-based codes.

Finally, we have released the PUmPER open source framework, which canbe used to build perpetual phylogenetic pipelines. This framework can beused to maintain and extend up-to-date comprehensive reference trees con-taining all taxa of a specific taxonomic rank. PUmPER is written in Ruby andcan be configured to operate in stand-alone mode on a single server, or tooffload computationally expensive maximum likelihood searches to an exter-nal cluster. Currently, we are using this pipeline to maintain and updatephylogenies for a multi-gene alignment of the Viridiplantae clade.

8.2 Future Work

Most of the potential future work concerns the GPU implementation andthe PUmPER framework. The out-of-core approach has been completely aban-doned since it was clearly outperformed by the recomputation approach. Theother two orthogonal memory-saving techniques (SEV and recomputation)are currently being integrated in the codebase of the PLL library.

8.2.1 GPUs

With respect to future work on the GPU implementation of the PLF, weplan to fully integrate the proof-of-concept GPU kernel with the PLL andsupport all models and data types (e.g., protein data, partitioned datasets,and the PSR model of rate heterogeneity). Another possible enhancement is

123

to implement the re-computation technique described in Subsection 3.3.1 onGPUs, since this would enable the computation of larger datasets on GPUs.

In order to leverage the computational power of modern desktop systemsand clusters, we intend to implement a hybrid CPU/GPU system (usingGPUs and x86 cores). The underlying idea is that, during the execution ofthe GPU Kernel, the CPU does not need to wait for the kernel to complete,but that it can also execute some PLF computations. For instance, the APVscan be divided into disjoint fragments that are assigned to the GPU and tothe CPU. When the operations stored in the traversal descriptor are executed,each processing unit can update asynchronously its APV fragment. Once allfragments have been updated, the CPU can easily combine the partial resultswhen required, for instance, for evaluate() and coreDerivative(). Whilethe CPU and GPU implementations are already available, the implementa-tion of this approach is not straight-forward, because load balancing issuesmust be addressed.

8.2.2 PUmPER

Future developments for the PUmPER project include developing a web-serviceto facilitate the use of the automated update pipeline for a broader commu-nity. The framework modules can be easily adapted to fit a Model-View-Controller (MVC) based web application. In this context, the user couldconfigure his perpetually updated phylogenetic inference (genes, clade, num-ber of phylogenies) and visualize the resulting trees from a web browser.

We also intend to integrate the recently introduced approximation tech-niques for computing bootstrap values [54] into RAxML-Light and the frame-work, and consider replacing RAxML-Light by the newer and more efficientExaML code [90] for phylogenetic inference.

8.3 Outlook

Given the current developments in Next-Generation Sequencing technologies,such as single molecule sequencing [58], the cost for sequencing genomes isexpected to keep decreasing in the next years. At the same time, the scale ofon-going and planned biological data analysis projects is larger than ever interms of data volume. For instance, the 100K Genome Project http://www.genomicsengland.co.uk/ will sequence the full genomes of 100,000 patientsover the next five years.

In this context, the techniques presented here will become more relevantover the next years. In particular, we envision that phylogenetic inference of

124

alignments including hundreds or thousands of genes may become commonpractice. The GPU implementation, whose performance improves with se-quence length, may become more efficient for these type of analyses. Thus,further work needs to be conducted to improve the performance of these andupcoming hardware architectures, as well as on the design of the aforemen-tioned hybrid CPU/GPU system.

Furthermore, the grand challenge of inferring the tree of life, that is a phy-logeny comprising all described species, remains unsolved. Inferring largertrees comprising more species is required to understand biological questions,such as species diversification [67, 80], that cannot be answered by analysingsmaller phylogenies. The inference of such large-scale phylogenies, however,poses challenges related to the alignment size. On the one hand, memoryrequirements increase linearly with the sequence length and the number ofspecies. These requirements can be reduced by simultaneously applying theorthogonal memory saving techniques (Subtree Equality and recomputationof ancestral vectors). On the other hand, adding more taxa to the alignmentincreases the amount of possible topologies in tree space. Thus, runningtime will remain a limiting factor, and exploring aggressive heuristics to re-duce tree space may, therefore, be required. To this end, approaches like thebackbone algorithm may be a good starting point for future research.

In general, the techniques presented in this thesis can help to improve thescalability of current state-of-the-art phylogenetic codes, and thus enable theanalysis of large phylogenies at the genome scale.

125

List of Figures

2.1 From sequences to alignment . . . . . . . . . . . . . . . . . . . 92.2 Concatenated multiple sequence alingment . . . . . . . . . . . 102.3 Protein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4 Unrooted Phylogenetic Tree . . . . . . . . . . . . . . . . . . . 132.5 Unrooted Phylogenetic Tree with branch lengths . . . . . . . . 132.6 Unrooted Phylogenetic Tree with a virtual root . . . . . . . . 172.7 Ancestral Probability Vectors . . . . . . . . . . . . . . . . . . 182.8 Density function for the Γ distribution . . . . . . . . . . . . . 202.9 A lazy SPR move . . . . . . . . . . . . . . . . . . . . . . . . . 222.10 Node records . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.1 Standard Memory Layout of an ancestral probability vector . 323.2 Vectors stored on memory and disk . . . . . . . . . . . . . . . 363.3 Miss rates for different replacement strategies . . . . . . . . . 403.4 Effect of Read skipping . . . . . . . . . . . . . . . . . . . . . . 413.5 Miss rates for fractions of memory . . . . . . . . . . . . . . . . 413.6 Execution times for full tree traversals . . . . . . . . . . . . . 433.7 Recomputation with a Balanced Subtree . . . . . . . . . . . . 463.8 Recomputation with an unbalanaced subtree . . . . . . . . . . 523.9 Replacement Strategies . . . . . . . . . . . . . . . . . . . . . . 553.10 Overall RAM usage with partial allocation . . . . . . . . . . . 563.11 Generic SEVs . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.12 SEV savings in computations . . . . . . . . . . . . . . . . . . 613.13 SEV savings in computations and memory . . . . . . . . . . . 62

4.1 Consistency of labels at the backbone boundaries . . . . . . . 724.2 Increase of backbone tips due to topology conflicts . . . . . . . 734.3 Log Likelihood scores for the 38K dataset . . . . . . . . . . . 754.4 Log Likelihood scores for the 56K dataset . . . . . . . . . . . 76

5.1 The Fermi Architecture . . . . . . . . . . . . . . . . . . . . . . 815.2 Architecture of a streaming multiprocessor . . . . . . . . . . . 81

126

5.3 CUDA memory hierarchy . . . . . . . . . . . . . . . . . . . . . 835.4 Coalesced access to global memory . . . . . . . . . . . . . . . 84

6.1 Standard Memory Layout . . . . . . . . . . . . . . . . . . . . 896.2 Memory Layout for vector width of 2 . . . . . . . . . . . . . . 896.3 Memory Layout for vector in GPU . . . . . . . . . . . . . . . 966.4 GPU implementation for newview() . . . . . . . . . . . . . . . 976.5 GPU speedups (full run) . . . . . . . . . . . . . . . . . . . . . 996.6 GPU speedups (functions) . . . . . . . . . . . . . . . . . . . . 100

7.1 Initial Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . 1047.2 Update Iteration . . . . . . . . . . . . . . . . . . . . . . . . . 1057.3 A perpetually updated tree for the 18S gene. . . . . . . . . . . 114

127

List of Tables

3.1 Frequency of ancestral vector cases for different strategies. . . 573.2 Average run times for full traversals . . . . . . . . . . . . . . . 573.3 SEV Evaluation for the 38K dataset . . . . . . . . . . . . . . . 633.4 SEV Evaluation for the 56K dataset . . . . . . . . . . . . . . . 63

4.1 Evaluation of methods to identify the innermost node . . . . . 704.2 Average number of computed backbone tips . . . . . . . . . . 724.3 Average Performance for the 38K dataset . . . . . . . . . . . . 774.4 Average Performance for the 56K dataset . . . . . . . . . . . . 774.5 Average Symmetric Difference . . . . . . . . . . . . . . . . . . 78

6.1 GPU evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 986.2 GPU scaling evaluation . . . . . . . . . . . . . . . . . . . . . . 99

7.1 PUmPER iterations of the rbcL dataset . . . . . . . . . . . . . 1157.2 PUmPER iterations of the 18S dataset . . . . . . . . . . . . . 1167.3 AU test for iteration 2010 of 18S dataset . . . . . . . . . . . . 1187.4 PUmPER iterations on simulated dataset . . . . . . . . . . . . 120

128

List of Acronyms

AA - AminoacidAPI - Application Programming InterfaceAPV - Ancestral Probability VectorAS - Ancestral StateAVX - Advanced Vector ExtensionsBS - Bootstrap SupportCPU - Central Processing UnitCUDA - Compute Unified Device ArchitectureDNA - Deoxyribonucleic acidEM - External MemoryGPGPU - General Purpose computation on GPUGPU - Graphics Processing UnitGTR - General Time ReversibleLGT - Lateral Gene TransferNJ - Neighbour JoiningNNI - Nearest Neighbour InterexchangeOTU - Operational Taxonomic UnitPLF - Phylogenetic Likelihood FunctionPLL - Phylogenetic Likelihood LibraryPSR - Per-Site Rate (model of rate heterogeneity)RNA - Ribonucleic acidSDK - Software Development KitSIMD - Single Instruction, Multiple DataSIMT - Single Instruction, Multiple ThreadSPR - Subtree Pruning and RegraftingSSE3 - Streaming SIMD Extensions 3TBR - Tree Bisection and ReconnectionTU - Taxonomic UnitVM - Virtual MachineVW - Vector (Unit) Width

129

Bibliography

[1] N. Alachiotis, S. A. Berger, and A. Stamatakis. Coupling simd andsimt architectures to boost performance of a phylogeny-aware align-ment kernel. BMC Bioinformatics, 13:196, 2012.

[2] Altera. White paper implementation of the smith-waterman algorithmon a reconfigurable supercomputing platform, 2007.

[3] S. Altschul, T. Madden, A. Schaffer, J. Zhang, Z. Zhang, W. Miller,and D. Lipman. Gapped BLAST and PSI-BLAST: a new generation ofprotein database search programs. Nucleic acids research, 25(17):3389,1997.

[4] D. L. Ayres, A. Darling, D. J. Zwickl, P. Beerli, M. T. Holder, P. O.Lewis, J. P. Huelsenbeck, F. Ronquist, D. L. Swofford, M. P. Cum-mings, A. Rambaut, and M. A. Suchard. BEAGLE: An ApplicationProgramming Interface and High-Performance Computing Library forStatistical Phylogenetics. Systematic Biology, 61(1):170–173, 2012.

[5] E. Barkan, E. Biham, and A. Shamir. Rigorous bounds on cryptanalytictime/memory tradeoffs. In In Advances in Cryptology-CRYPTO 2006,volume 4117 of LNCS, pages 1–21. Springer-Verlag, 2006.

[6] D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and E. W.Sayers. GenBank. Nucleic acids research, 37(Database issue):D26–31,Jan. 2009.

[7] S. A. Berger and A. Stamatakis. Accuracy and performance of singleversus double precision arithmetics for maximum likelihood phylogenyreconstruction. In R. Wyrzykowski, J. Dongarra, K. Karczewski, andJ. Wasniewski, editors, Parallel Processing and Applied Mathematics,volume 6068 of Lecture Notes in Computer Science, pages 270–279.Springer Berlin Heidelberg, 2010.

130

[8] S. A. Berger and A. Stamatakis. PaPaRa 2.0: A Vectorized Al-gorithm for Probabilistic Phylogeny-Aware Alignment Extension;Exelixis-RRDR-2012-5; http://sco.h-its.org/exelixis/pubs/Exelixis-RRDR-2012-5.pdf. Technical report, Heidelberg Institute forTheoretical Studies, 2012.

[9] O. Bininda-Emdons and A. Stamatakis. Reconstructing the Tree ofLife: taxonomy and systematics of species rich taxa, volume 72, chapterTaxon sampling versus computational complexity and their impact onobtaining the Tree of Life, pages 77–95. CRC press, 2006.

[10] F. Blagojevic, D. S. Nikolopoulos, A. Stamatakis, and C. D.Antonopoulos. RAxML-Cell: Parallel Phylogenetic Tree Inference onthe Cell Broadband Engine. In Proc. of International Parallel andDistributed Processing Symposium (IPDPS2007), 2007.

[11] A. R. Brodtkorb, T. R. Hagen, and M. L. Saetra. Graphics process-ing unit (gpu) programming strategies and trends in gpu computing.Journal of Parallel and Distributed Computing, 73(1):4 – 13, 2013.

[12] M. Charalambous, P. Trancoso, and A. Stamatakis. Initial ExperiencesPorting a Bioinformatics Application to a Graphics Processor. In Proc.of the 10th Panhellenic Conference on Informatics (PCI 2005), pages415–425, 2005.

[13] M. W. Chase, D. E. Soltis, R. G. Olmstead, D. Morgan, D. H. Les, B. D.Mishler, M. R. Duvall, R. A. Price, H. G. Hills, Y.-L. Qiu, K. A. Kron,J. H. Rettig, E. Conti, J. D. Palmer, J. R. Manhart, K. J. Sytsma,H. J. Michaels, W. J. Kress, K. G. Karol, W. D. Clark, M. Hedren,B. S. Gaut, R. K. Jansen, K.-J. Kim, C. F. Wimpee, J. F. Smith, G. R.Furnier, S. H. Strauss, Q.-Y. Xiang, G. M. Plunkett, P. S. Soltis, S. M.Swensen, S. E. Williams, P. A. Gadek, C. J. Quinn, L. E. Eguiarte,E. Golenberg, J. Learn, Gerald H., S. W. Graham, S. C. H. Barrett,S. Dayanandan, and V. A. Albert. Phylogenetics of seed plants: Ananalysis of nucleotide sequences from the plastid gene rbcl. Annals ofthe Missouri Botanical Garden, 80(3):pp. 528–548+550–580, 1993.

[14] K. G. Consortium. Opencl: The open standard for parallel program-ming of heterogeneous systems.

[15] D. Darriba, A. Aberer, T. Flouri, T. Heath, F. Izquierdo-Carrasco, andA. Stamatakis. Boosting the performance of bayesian divergence timeestimation with the phylogenetic likelihood library. In Parallel and

131

Distributed Processing Workshops and Phd Forum (IPDPSW), 2013IEEE International Symposium on, 2013.

[16] T. DeSantis, P. Hugenholtz, N. Larsen, M. Rojas, E. Brodie, K. Keller,T. Huber, D. Dalevi, P. Hu, and G. Andersen. Greengenes, a chimera-checked 16s rrna gene database and workbench compatible with arb.Appl Environ Microbiol, 72(7):5069–72, 2006.

[17] A. Drummond and A. Rambaut. Beast: Bayesian evolutionary analysisby sampling trees. BMC evolutionary biology, 7(1):214, 2007.

[18] P. Dri and Z. Galil. A time-space tradeoff for language recognition.Theory of Computing Systems, 17:3–12, 1984. 10.1007/BF01744430.

[19] R. C. Edgar. Muscle: multiple sequence alignment with high accuracyand high throughput. Nucleic Acids Research, 32(5):1792–1797, 2004.

[20] A. W. F. Edwards and L. L. Cavalli-Sforza. Reconstruction of evolu-tionary trees. Systematics Association Publication., 6:67–76, 1964.

[21] I. Elias. Settling the intractability of multiple alignment. J. Comput.Biol., 13(7):1323–1339, Sep 2006.

[22] J. Fang, A. Varbanescu, and H. Sips. A comprehensive performancecomparison of cuda and opencl. In Parallel Processing (ICPP), 2011International Conference on, pages 216–225, 2011.

[23] J. Felsenstein. Evolutionary trees from DNA sequences: a maximumlikelihood approach. J. Mol. Evol., 17:368–376, 1981.

[24] J. Felsenstein. Confidence Limits on Phylogenies: An Approach Usingthe Bootstrap. Evolution, 39(4):783–791, 1985.

[25] J. Felsenstein. Inferring Phylogenies. Sinauer Associates, Inc., 2004.

[26] W. Fletcher and Z. Yang. INDELible: a flexible simulator of biologicalsequence evolution. Molecular biology and evolution, 26(8):1879–1888,2009.

[27] T. Flouri, F. Izquierdo-Carrasco, A. Stamatakis, et al. PLL: Phylo-genetic likelihood library; http://www.libpll.org. Work in Progress,2013.

[28] P. A. Goloboff, S. A. Catalano, J. M. Mirande, C. A. Szumik, J. S.Arias, M. Kallersjo, and J. S. Farris. Phylogenetic analysis of 73060taxa corroborates major eukaryotic groups. Cladistics, 25:1–20, 2009.

132

[29] S. Guindon, J. Dufayard, V. Lefort, M. Anisimova, W. Hordijk, andO. Gascuel. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of phyml 3.0. Sys-tematic biology, 59(3):307, 2010.

[30] J. Hauser, K. Kobert, F. Izquierdo-Carrasco, K. Meusemann, B. Misof,M. Gertz, and A. Stamatakis. Heuristic algorithms for the proteinmodel assignment problem. In Z. Cai, O. Eulenstein, D. Janies,and D. Schwartz, editors, Bioinformatics Research and Applications,volume 7875 of Lecture Notes in Computer Science, pages 137–148.Springer Berlin Heidelberg, 2013.

[31] T. Heath, M. Holder, and J. Huelsenbeck. A dirichlet process prior forestimating lineage-specific substitution rates. Molecular biology andevolution, 29(3):939–955, 2012.

[32] D. S. Hibbett, R. H. Nilsson, M. Snyder, M. Fonseca, J. Costanzo, andM. Shonfeld. Automated phylogenetic taxonomy: An example in thehomobasidiomycetes (mushroom-forming fungi). Systematic Biology,54(4):660–668, 2005.

[33] D. H. Huson, R. Rupp, and C. Scornavacca. Phylogenetic Networks.Cambridge University Press, Cambridge, 2010.

[34] L. Inc. Lockless memory allocator; http://locklessinc.com.

[35] F. Izquierdo-Carrasco, N. Alachiotis, S. Berger, T. Flouri, S. P. Pissis,and A. Stamatakis. A generic vectorization scheme and a gpu kernel forthe phylogenetic likelihood library. In Parallel and Distributed Process-ing Workshops and Phd Forum (IPDPSW), 2013 IEEE InternationalSymposium on, 2013.

[36] F. Izquierdo-Carrasco, J. Cazes, S. Smith, and A. Stamatakis.PUmPER: Phylogenies Updated Perpetually. Bioinformatics,30(10):1476–1477, 2014.

[37] F. Izquierdo-Carrasco, J. Gagneur, and A. Stamatakis. Trading run-ning time for memory in phylogenetic likelihood computations. InJ. Schier, C. M. B. A. Correia, A. L. N. Fred, and H. Gamboa, ed-itors, BIOINFORMATICS, pages 86–95. SciTePress, 2012.

[38] F. Izquierdo-Carrasco, S. Smith, and A. Stamatakis. Algorithms, datastructures, and numerics for likelihood-based phylogenetic inference ofhuge trees. BMC Bioinformatics, 12(1):470, 2011.

133

[39] F. Izquierdo-Carrasco and A. Stamatakis. Computing the phylogeneticlikelihood function out-of-core. In Parallel and Distributed ProcessingWorkshops and Phd Forum (IPDPSW), 2011 IEEE International Sym-posium on, pages 444–451, 2011.

[40] F. Izquierdo-Carrasco and A. Stamatakis. Inference of huge trees undermaximum likelihood. In Parallel and Distributed Processing SymposiumWorkshops & PhD Forum (IPDPSW), 2012 IEEE 26th International,pages 2490–2493. IEEE, 2012.

[41] T. Jukes and C. Cantor. Evolution of protein molecules., chapter III,pages 21–132. Academic Press, New York, 1969.

[42] K. Katoh and H. Toh. Recent developments in the MAFFT multiplesequence alignment program. Briefings in Bioinformatics, 9(4):286–298, 2008.

[43] H. Kishino and M. Hasegawa. Evaluation of the maximum likelihoodestimate of the evolutionary tree topologies from dna sequence data,and the branching order in hominoidea. Journal of Molecular Evolu-tion, 29(2):170–179, 1989.

[44] A. Kupczok, H. Schmidt, and A. Haeseler. Accuracy of phylogenyreconstruction methods combining overlapping gene data sets. Algo-rithms for Molecular Biology, 5(1):1–17, 2010.

[45] C. Lanave, G. Preparata, C. Saccone, and G. Serio. A new methodfor calculating evolutionary substitution rates. Journal of MolecularEvolution, 20:86–93, 1984.

[46] N. Lartillot, S. Blanquart, and T. Lepage. PhyloBayes. v2. 3, 2007.

[47] D. Lipman and W. Pearson. Rapid and sensitive protein similaritysearches. Science, 227(4693):1435–1441, 1985.

[48] P. Li and N. Goldman. Models of molecular evolution and phylogeny.Genome Research, 8:1233–1244, 1998.

[49] B. Maidak, J. Cole, T. Lilburn, C. Parker Jr, P. Saxman, R. Farris,G. Garrity, G. Olsen, T. Schmidt, and J. Tiedje. The rdp-ii (ribosomaldatabase project). Nucleic Acids Res, 29(1):173–4, 2001.

[50] S. Manavski and G. Valle. Cuda compatible gpu cards as efficienthardware accelerators for smith-waterman sequence alignment. BMCBioinformatics, 9(Suppl 2):S10, 2008.

134

[51] E. R. Mardis. Next-generation sequencing platforms. Annual Reviewof Analytical Chemistry, 6:287–303, 2013.

[52] P. O. L. Mark Holder. Phylogeny estimation: traditional and bayesianapproaches, 2003.

[53] B. Minh, L. Vinh, A. Haeseler, and H. Schmidt. pIQPNNI: parallel re-construction of large maximum likelihood phylogenies. Bioinformatics,21(19):3794–3796, 2005.

[54] B. Q. Minh, M. A. T. Nguyen, and A. von Haeseler. Ultrafast approx-imation for phylogenetic bootstrap. Molecular biology and evolution,2013.

[55] J. N. M.J.L. de Hoon, S. Imoto and S. Miyano. Open source clusteringsoftware . Bioinformatics, 20(9):1453–1454, 2004.

[56] B. Moret, U. Roshan, and T. Warnow. Sequence-length requirementsfor phylogenetic methods. Algorithms in Bioinformatics, pages 343–356, 2002.

[57] S. B. Needleman and C. D. Wunsch. A general method applicable tothe search for similarities in the amino acid sequence of two proteins.Journal of Molecular Biology, 48(3):443 – 453, 1970.

[58] T. P. Niedringhaus, D. Milanova, M. B. Kerby, M. P. Snyder, and A. E.Barron. Landscape of Next-Generation Sequencing Technologies. Anal.Chem., 83(12):4327–4341, May 2011.

[59] NVIDIA. Nvidia’s next generation cuda compute architecture: Fermi;http://www.nvidia.de/content/pdf/fermi white papers/nvidia fermi -compute architecture whitepaper.pdf, 2009.

[60] NVIDIA. Nvidia’s next generation cuda compute architecture: Ke-pler gk110; http://www.nvidia.com/content/pdf/kepler/nvidia-kepler-gk110-architecture-whitepaper.pdf, 2012.

[61] NVIDIA. Cuda c programming guide (design guide, ver-sion 5.5); http://docs.nvidia.com/cuda/pdf/cuda c programming -guide.pdf, 2013.

[62] J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krger,A. Lefohn, and T. J. Purcell. A survey of general-purpose computationon graphics hardware. Computer Graphics Forum, 26(1):80–113, 2007.

135

[63] H. Philippe and M. Blanchette. Overview of the first phylogenomicsconference. BMC Evolutionary Biology, 7(Suppl 1):S1, 2007.

[64] S. Pond and S. Muse. Column sorting: Rapid calculation of the phy-logenetic likelihood function. Systematic biology, 53(5):685–692, 2004.

[65] F. Pratas, P. Trancoso, L. Sousa, A. Stamatakis, G. Shi, and V. Kin-dratenko. Fine-grain parallelism using multi-core, cell/be, and gpusystems. Parallel Computing, 38(8):365–390, 2012.

[66] M. Price, P. Dehal, and A. Arkin. FastTree 2–ApproximatelyMaximum-Likelihood Trees for Large Alignments. PLoS ONE,5(3):e9490, 2010.

[67] R. A. Pyron and J. J. Wiens. Large-scale phylogenetic analyses revealthe causes of high tropical amphibian diversity. Proceedings of theRoyal Society B: Biological Sciences, 280(1770), 2013.

[68] C. Quast, E. Pruesse, P. Yilmaz, J. Gerken, T. Schweer, P. Yarza,J. Peplies, and F. O. Glckner. The silva ribosomal rna gene databaseproject: improved data processing and web-based tools. Nucleic AcidsResearch, 41(Database-Issue):590–596, 2013.

[69] D. Robinson and L. Foulds. Comparison of phylogenetic trees. Math.Biosci, 53(1-2):131–147, 1981.

[70] S. Roch. A Short Proof that Phylogenetic Tree Reconstruction by Max-imum Likelihood Is Hard. IEEE/ACM Transactions on ComputationalBiology and Bioinformatics, pages 92–94, 2006.

[71] T. Rognes. Faster smith-waterman database searches with inter-sequence simd parallelisation. BMC Bioinformatics, 12(1):221, 2011.

[72] F. Ronquist and J. Huelsenbeck. MrBayes 3: Bayesian phylogeneticinference under mixed models. Bioinformatics, 19(12):1572–1574, 2003.

[73] F. Ronquist, M. Teslenko, P. van der Mark, D. L. Ayres, A. Darling,S. Hhna, B. Larget, L. Liu, M. A. Suchard, and J. P. Huelsenbeck. Mr-bayes 3.2: Efficient bayesian phylogenetic inference and model choiceacross a large model space. Systematic Biology, 2012.

[74] B. Roure, D. Baurain, and H. Philippe. Impact of missing data onphylogenies inferred from empirical phylogenomic data sets. MolecularBiology and Evolution, 30(1):197–214, 2013.

136

[75] J. Shendure and H. Ji. Next-generation DNA sequencing. Naturebiotechnology, 26(10):1135–1145, Oct. 2008.

[76] H. Shimodaira. An Approximately Unbiased Test of Phylogenetic TreeSelection. Systematic Biology, 51:492–508, 2002.

[77] H. Shimodaira and M. Hasegawa. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Molecular Bi-ology and Evolution, 16(8):1114, 1999.

[78] H. Shimodaira and M. Hasegawa. CONSEL: for assessing the confi-dence of phylogenetic tree selection. Bioinformatics, 17(12):1246–1247,2001.

[79] M. Simonsen, T. Mailund, and C. N. S. Pedersen. Building very largeneighbour-joining trees. In A. L. N. Fred, J. Filipe, and H. Gamboa,editors, BIOINFORMATICS, pages 26–32. INSTICC Press, 2010.

[80] S. Smith, J. Beaulieu, A. Stamatakis, and M. Donoghue. Understand-ing angiosperm diversification using small and large phylogenetic trees.American Journal of Botany, pages ajb–1000481v1, 2011.

[81] S. A. Smith, J. M. Beaulieu, and M. J. Donoghue. Mega-phylogenyapproach for comparative biology: an alternative to supertree and su-permatrix approaches. BMC Evolutionary Biology, 9(37), 2009.

[82] S. A. Smith and C. W. Dunn. Phyutility: a phyloinformatics tool fortrees, alignments and molecular data. Bioinformatics, 24(5):715–716,2008.

[83] T. Smith and M. Waterman. Identification of common molecular sub-sequences. Journal of Molecular Biology, 147(1):195 – 197, 1981.

[84] S. Smits and C. Ouverney. Phylometrics: a pipeline for inferring phy-logenetic trees from a sequence relationship network perspective. BMCBioinformatics, 11 Suppl 6, 2010.

[85] A. Stamatakis. Parsimonator; https://github.com/stamatak/parsimonator-1.0.2.

[86] A. Stamatakis. Phylogenetic models of rate heterogeneity: A high per-formance computing perspective. In In Proceedings of the 20th Inter-nationational Parallel and Distributed Processing Symposium (IPDPS),2006.

137

[87] A. Stamatakis. RAxML-VI-HPC: maximum likelihood-based phyloge-netic analyses with thousands of taxa and mixed models. Bioinformat-ics, 22(21):2688–2690, 2006.

[88] A. Stamatakis. Phylogenetic Search Algorithms for Maximum Likeli-hood, pages 547–577. John Wiley & Sons, Inc., 2011.

[89] A. Stamatakis. Orchestrating the phylogenetic likelihood functionon emerging parallel architectures. Bioinformatics–High PerformanceParallel Computer Architectures, B. Schmidt, Ed. CRC Press, pages85–115, 2012.

[90] A. Stamatakis and A. J. Aberer. Novel parallelization schemes forlarge-scale likelihood-based phylogenetic inference. In Parallel Dis-tributed Processing (IPDPS), 2013 IEEE 27th International Sympo-sium on, pages 1195–1204, 2013.

[91] A. Stamatakis, A. J. Aberer, C. Goll, S. A. Smith, S. A. Berger, andF. Izquierdo-Carrasco. RAxML-Light: a tool for computing terabytephylogenies. Bioinformatics, 28(15):2064–2066, 2012.

[92] A. Stamatakis and N. Alachiotis. Time and memory efficient likelihood-based tree searches on phylogenomic alignments with missing data.Bioinformatics, 26(12):132–139, 2010.

[93] A. Stamatakis, F. Blagojevic, D. Nikolopoulos, and C. Antonopou-los. Exploring new search algorithms and hardware for phylogenetics:Raxml meets the ibm cell. The Journal of VLSI Signal Processing Sys-tems for Signal, Image, and Video Technology, 48(3):271–286, 2007.

[94] A. Stamatakis and F. Izquierdo-Carrasco. Result verification, codeverification and computation of support values in phylogenetics. Brief.Bioinformatics, 12(3):270–279, May 2011.

[95] A. Stamatakis, T. Ludwig, and H. Meier. RAxML-III: A Fast Programfor Maximum Likelihood-based Inference of Large Phylogenetic Trees.Bioinformatics, 21(4):456–463, 2005.

[96] A. Stamatakis, T. Ludwig, H. Meier, and M. J. Wolf. AxML: A FastProgram for Sequential and Parallel Phylogenetic Tree CalculationsBased on the Maximum Likelihood Method. In Proceedings of 1st IEEEComputer Society Bioinformatics Conference (CSB2002), pages 21–28,2002.

138

[97] A. Stamatakis and M. Ott. Efficient computation of the phylogeneticlikelihood function on multi-gene alignments and multi-core architec-tures. Phil. Trans. R. Soc. series B, Biol. Sci., 363:3977–3984, 2008.

[98] A. Stamatakis and M. Ott. Load Balance in the Phylogenetic Likeli-hood Kernel. In Proceedings of ICPP 2009, 2009. accepted for publi-cation.

[99] M. A. Suchard and A. Rambaut. Many-core algorithms for statisticalphylogenetics. Bioinformatics, 25(11):1370–1376, 2009.

[100] J. Sumner and M. Charleston. Phylogenetic estimation with partiallikelihood tensors. Journal of theoretical biology, 262(3):413–424, 2010.

[101] S. Sunagawa et al. Metagenomic species profiling using universal phy-logenetic marker genes. Nat Meth, in press.

[102] C. J. Thompson, S. Hahn, and M. Oskin. Using modern graphics ar-chitectures for general-purpose computing: a framework and analysis.In MICRO, pages 306–317. ACM/IEEE, 2002.

[103] J. Thompson, T. Gibson, and D. Higgins. Multiple sequence alignmentusing clustalw and clustalx. Curr Protoc Bioinformatics, Chapter 2,2002.

[104] L. S. Vinh, H. A. Schmidt, and A. von Haeseler. Phynav: A novelapproach to reconstruct large phylogenies. In C. Weihs and W. Gaul,editors, GfKl, Studies in Classification, Data Analysis, and KnowledgeOrganization, pages 386–393. Springer, 2004.

[105] J. S. Vitter. Algorithms and data structures for external memory.Found. Trends Theor. Comput. Sci., 2(4):305–474, January 2008.

[106] T. Wheeler. Large-scale neighbor-joining with ninja. In S. Salzbergand T. Warnow, editors, Algorithms in Bioinformatics, volume 5724 ofLecture Notes in Computer Science, pages 375–389. Springer Berlin /Heidelberg, 2009.

[107] S. Whelan, P. I. W. de Bakker, and N. Goldman. Pandit: a database ofprotein and associated nucleotide domains with inferred trees. Bioin-formatics, 19(12):1556–1563, 2003.

[108] D. Wu, A. Hartman, N. Ward, and J. Eisen. An automated phyloge-netic tree-based small subunit rrna taxonomy and alignment pipeline(stap). PLoS One, 3(7):e2566, 2008.

139

[109] J. Xu and R. J. Lipton. On fundamental tradeoffs between delaybounds and computational complexity in packet scheduling algorithms.IEEE/ACM Trans. Netw., 13:15–28, February 2005.

[110] Z. Yang. Maximum likelihood phylogenetic estimation from DNA se-quences with variable rates over sites. J. Mol. Evol., 39:306–314, 1994.

[111] Z. Yang. Among-site rate variation and its impact on phylogeneticanalyses. Trends in Ecology & Evolution, 11(9):367 – 372, 1996.

[112] Z. Yang. Computational Molecular Evolution. Oxford University Press,Oxford, 2006.

[113] J. Zhang and A. Stamatakis. The multi-processor scheduling problemin phylogenetics. In Parallel and Distributed Processing SymposiumWorkshops & PhD Forum (IPDPSW), 2012 IEEE 26th International,pages 691–698. IEEE, 2012.

[114] Y. Zhang, I. Sinclair, Mark, and A. Chien. Improving performanceportability in opencl programs. In J. Kunkel, T. Ludwig, and H. Meuer,editors, Supercomputing, volume 7905 of Lecture Notes in ComputerScience, pages 136–150. Springer Berlin Heidelberg, 2013.

[115] J. Zhou, X. Liu, D. Stones, Q. Xie, and G. Wang. Mrbayes on agraphics processing unit. Bioinformatics, 27(9):1255–1261, 2011.

[116] D. Zwickl. Genetic Algorithm Approaches for the Phylogenetic Analysisof Large Biological Sequence Datasets under the Maximum LikelihoodCriterion. PhD thesis, University of Texas at Austin, April 2006.

[117] D. Zwickl and D. Hillis. Increased taxon sampling greatly reducesphylogenetic error. Systematic Biology, 51(4):588–598, 2002.

140

141

Documents

sco.h-its.orgsco.h-its.org/exelixis/pubs/dissFernando.pdf · Die dritte Methode wird SEV (Subtree Equality Vectors) genannt und reduziert den Speicherbedarf nahezu proportional zum