Tuesday, June 20, 2017

Cichlids, species and trees

Lake Malawi, in south-eastern Africa, is famous for its large diversity of cichlid fishes. Indeed, it sometimes seems to have more biologists studying these fish than there are actual fish in the lake, even though there are allegedly hundreds of cichlid fish species in that lake. In this sense, it is somewhat similar to Lake Baikal, in southern Siberia, home to the sole species of freshwater seals.

The cichlid biologists are interested in describing the extensive fish diversity, pondering its origin, and thus its contribution to the study of speciation. After all, we are talking about what is usually claimed to be "the most extensive recent vertebrate adaptive radiation". So, we are talking here as much about population genetics as we are about ichthyology.

Inevitably, the genome biologists have been spotted in the vicinity of the lake; and we now have a preliminary report from them:
Milan Malinsky, Hannes Svardal, Alexandra M. Tyers, Eric A. Miska, Martin J. Genner, George F. Turner, Richard Durbin (2017) Whole genome sequences of Malawi cichlids reveal multiple radiations interconnected by gene flow. BioRxiv 143859.
These authors summarize the situation like this:
We characterize [the] genomic diversity by sequencing 134 individuals covering 73 species across all major lineages. Average sequence divergence between species pairs is only 0.1-0.25%. These divergence values overlap diversity within species, with 82% of heterozygosity shared between species. Phylogenetic analyses suggest that diversification initially proceeded by serial branching from a generalist Astatotilapia-like ancestor. However, no single species tree adequately represents all species relationships, with evidence for substantial gene flow at multiple times.
The last sentence seems to be somewhat disingenuous. How could a single tree be expected to describe this scale of biodiversity? Any rapid radiation of diversity is unlikely to be completely tree-like. The increase in diversity can be modeled as a tree, sure, but it is very unlikely that there will be instant separation of the taxa, and so the tree model will be ignoring a large part of the evolutionary action. There will, for example, be ongoing introgression between the diverging taxa, as well as hybridization due to incomplete breeding barriers. These avenues for gene flow can best be modeled as a network, not a tree.

The issue here is that the authors write the paper solely from the perspective of an expected phylogenetic tree, and then feel compelled to explain why they do  not produce such a tree. Indeed, the authors present their paper as a study of "violations of the species tree concept".

For data analysis, they proceed as follows:
To obtain a first estimate of between-species relationships we divided the genome into 2543 non-overlapping windows, each comprising 8000 SNPs (average size: 274kb), and constructed a Maximum Likelihood (ML) phylogeny separately for each window, obtaining trees with 2542 different topologies.
So, only two sequence blocks produced the same tree, presumably by random chance. An example "tree" for 12 OTUs is shown in the diagram. It superimposes a possible mitochondrial trees on a summary of the "genome tree".

Example phylogeny from Malinsky (2012)

The authors continue:
The fact that we are using over 25 million variable sites suggests these differences are not due to sampling noise, but reflect conflicting biological signals in the data. For example, gene flow after the initial separation of species can distort the overall phylogeny and lead to intermediate placement of admixed taxa in the tree topology.
Note that gene flow is seen to "distort" the phylogeny rather than being an integral part of it. In this case, "phylogeny" apparently refers solely to the diversification part evolutionary history, rather than to the whole history.

The ultimate questions from this paper are: "what is a species concept?", and "what is a species tree?". The authors write a lot about species and trees, and yet their data provide very clear evidence that both "species" and "tree" are very restrictive concepts for studying the cichlids of Lake Malawi.

Tuesday, June 13, 2017

Bayesian inference of phylogenetic networks

Over the years, a number of methods have been explored for constructing evolutionary networks, starting with parsimony criteria for optimization, and moving on to likelihood-based inference. However, the development of Bayesian methods has been somewhat delayed by the computational complexities involved.

Network from Radice (2012)

The earliest work on this topic seems to be the thesis of:
Rosalba Radice (2011) A Bayesian Approach to Phylogenetic Networks. PhD thesis, University of Bath, UK.
Apparently, the only part of this work to be published has been:
Rosalba Radice (2012) A Bayesian approach to modelling reticulation events with application to the ribosomal protein gene rps11 of flowering plants. Australian & New Zealand Journal of Statistics 54: 401-426.
The method described requires the prior specification of the species tree (phylogeny), and the position and number of the reticulation events. The algorithm was implemented in the R language.

More recently, methods have been developed that infer phylogenies by using (i) incomplete lineage sorting (ILS) to model gene-tree incongruence arising from vertical inheritance, and (ii) introgression / hybridization to model gene-tree incongruence attributable to horizontal gene flow. ILS has been addressed using the multispecies coalescent.

The first of these publications was:
Dingqiao Wen, Yun Yu, Luay Nakhleh (2016) Bayesian inference of reticulate phylogenies under the multispecies network coalescent. PLoS Genetics 12(5): e1006006. [Correction: 2017 PLoS Genetics 13(2): e1006598]
The method requires the set of gene trees as input, along with the number of reticulations. The algorithm was implemented in the PhyloNet package.

In the past few months, two manuscripts have appeared that try to co-estimate the gene trees and the species network, using the original sequence data (assumed to be without recombination) as input:
Dingqiao Wen, Luay Nakhleh (2017) Co-estimating reticulate phylogenies and gene trees from multi-locus sequence data. bioRxiv 095539. [v.2; v.1: 2016]
Chi Zhang, Huw A Ogilvie, Alexei J Drummond, Tanja Stadler (2017) Bayesian inference of species networks from multilocus sequence data. bioRxiv 124982.
The algorithm for the first method has been implemented in the PhyloNet package, while the second has been implemented in the Beast2 package.

Finally, another manuscript describes a method utilizing data based on single nucleotide polymorphisms (SNPs) and/or amplified fragment length polymorphisms (AFLPs), which thus sidesteps the assumption of no recombination:
Jiafan Zhu, Dingqiao Wen, Yun Yu, Heidi Meudt, Luay Nakhleh (2017) Bayesian inference of phylogenetic networks from bi-allelic genetic markers. bioRxiv 143545.
This method has also been implemented in PhyloNet.

Due to the computational complexity of likelihood inference, all of these methods are currently severely restricted in the number of OTUs that can be analyzed, irrespective of whether these involve multiple samples from the same species or not. In this sense, parsimony-based inference or approximate likelihood methods are still useful for constructing evolutionary networks of any size. However, progress is clearly being made to alleviate the computational restrictions.

Tuesday, June 6, 2017

Bears, genomes and gene flow

It has traditionally been assumed that speciation occurs when gene flow between populations ceases. However, nothing in biology ever remains simple — the more we study any biological phenomenon the more complex it becomes. So, speciation with gene flow is becoming a more commonly discussed topic. This is especially so with the advent of genome sequencing, which allows us to study the extent of gene flow in the past, rather than solely in the present.

A case in point is the recent paper by:
Vikas Kumar, Fritjof Lammers, Tobias Bidon, Markus Pfenninger, Lydia Kolter, Maria A. Nilsson and Axel Janke (2017) The evolutionary history of bears is characterized by gene flow across species. Nature Scientific Reports 7: 46487.
This paper considers the evolutionary relationships among seven species of bears, with multiple genome samples from four of those species. The coalescent species tree (based on 18,621 genome fragments > 25 kb), which accounts for incomplete lineage sorting (ILS), is well supported, as shown here.

However, numerous individual genome-fragment trees support alternative topologies. For example, 38% of the trees support a topology where the Asiatic black bear is the sister to the American black - Brown - Polar bear clade. This suggests that there is more than simply ILS that creates the conflicting genome trees.

The authors applied several different data analyses to investigate the possibility of gene flow among the species. They found considerable evidence for gene flow, as shown in the network (the arrow colors represent different analyses).

Indeed, each of the six in-group species could conceivably be connected by gene flow to each of the other five species. The network shows evidence that the Brown, Asiatic and Sloth bears might have all five connections, while the Polar and Sun bears have four, and the American bear has three.

As the authors note, some of this potential gene flow cannot have occurred directly between species, because they live in different habitats. Instead, it may be remnants of ancestral gene flow, or gene flow through a vector species. In particular, the strongest signal of gene flow connects the Asiatic black bear with the ancestor of the American black - Brown - Polar bear clade.

Ancestral gene flow is of considerable importance when studying evolution. Charles Darwin was perhaps the first to note (in his notebooks) that we should always treat ancestors as species not as taxonomic groups, no matter how big the groups of descendants now are. Whole kingdoms and phyla were once a single species, if the contemporary groups are monophyletic

Tuesday, May 30, 2017

Killer arguments and the nature of proof in historical sciences

Some long time ago, somebody told me this joke, which I just found again on the internet in an English version (following jokes.cc.com, with modifications based on my memory):
Teacher: "Four crows are on the fence. The farmer shoots one. How many are left?"
Little Johnny: "None."
Teacher: "Listen carefully: Four crows are on the fence. The farmer shoots one. How many are left?"
Little Johnny: "None."
Teacher: "Can you explain that answer?"
Little Johnny: "One is shot, the others fly away. There are none left."
Teacher: "Well, that isn't the correct answer, but I like the way you think."
Little Johnny: "Teacher, can I ask a question?"
Teacher: "Sure."
Little Johnny: "There are three women in the park. The first one reads a love novel, the second one reads the newspaper, and the third one updates her FaceBook profile, which one of them is married?"
Teacher: "The one reading the newspaper?"
Little Johnny: "No. The one with the wedding ring on, but I like the way you think."
Given the title of this post, you may wonder why I tell you that joke. The reason is that for me, the essence of the joke is expressing the situation we often have in the historical sciences when we talk about "proof", be it of the closer relationship of different species, or the ultimate relationship of languages. Given the evidence we are given, we can reach an awful lot of conclusions in order to arrive at a convincing story, but if we see the wedding ring on somebody's hand, we know the true story no matter what other evidence we are given. The wedding ring in the joke serves as a killer argument — no matter what other evidence we consider, it is much more likely that the person who is married is the one with the ring than anybody else.

We often face similar situations in the historical sciences where we seek some kind of true story behind a couple of facts, when we are given external evidence that is just pointing to the right answer, or — let's be careful — the most probable answer, independent of where the other evidence might point to. We can think of similar situations in crime investigations, where we may think that a large body of evidence convicts some person as a murderer until we see some video proof that reveals the real offender.

That crime investigations have a lot in common with research in the historical sciences has been noted before by many people, notably the famous Umberto Eco (1932-2016), who edited a whole anthology on the role of circumstantial evidence in linguistics, semiotics, and philosophy (Eco and Sebeok 1983) where scholars compared the work of Sherlock Holmes with the work of people in the historical sciences. What Sherlock Holmes and historical linguists (and also evolutionary biologists) have in common is the use of abduction as their fundamental mode of reasoning. The term itself goes back to Charles Sanders Peirce (1839-1914), who distinguished it from deduction and induction:
Accepting the conclusion that an explanation is needed when facts contrary to what we should expect emerge, it follows that the explanation must be such a proposition as would lead to the prediction of the observed facts, either as necessary consequences or at least as very probable under the circumstances. A hypothesis then, has to be adopted, which is likely in itself, and renders the facts likely. This step of adopting a hypothesis as being suggested by the facts, is what I call abduction. I reckon it as a form of inference, however problematical the hypothesis may be held. (Peirce 1931/1958: 7.202)
Our problem in the historical sciences is that we are searching an original situation: what was the case a long time ago, based on general knowledge about (evolutionary or historical) processes and the results of this situation. When Sherlock Holmes looks at a crime scene, he sees the results of an action and uses his knowledge of human behaviour to find the one who was responsible for the crime. When doctors listen to the heartbeat of patients who are short of breath, they try to find out what causes their disease by making use of their knowledge about symptoms and the diseases that could have caused them. When linguists look at words from different languages, they make use of their knowledge of processes of language change and language contact in order to work out why those languages are so similar.

As do medical practitioners or crime investigators, we have our general schema, our protocol, which we use to carry out our investigations. Biologists search for similar DNA sequences, linguists look for similar sound sequences. In most cases, this works fine, although we are usually left with uncertainties and things that do not really seem to add up. As long as we can quietly follow the protocol, we are fine; and even if the results of our research do not necessarily last for a long time, being superceded by more recent research, we usually have the impression that we did the best we could, given the complex circumstances with their complex circumstancial evidence. But once in a while, we uncover evidence similar to video proofs in crime investigation, or wedding rings as in the Little Johnny joke — evidence that is so striking that we have to put our protocol to one side and just accept that there is only one solution, no matter what the rest of our evidence or our protocol might point to.

In 1879, Ferdinand de Saussure (1857-1913) predicted two consonantal sounds in Proto-Indo-European based on circumstantial evidence (Saussure 1879). In 1927, Jerzy Kuryłowicz (1895-1978) could show that one of the sounds was still pronounced in Hittite, an Indo-European language that was not known during Saussure's time (Lehmann 1992: 33), and had just been deciphered. While Saussure followed protocol in his investigation, Kuryłowicz provided the video proof, and only since then, Saussure's hypothesis has become communis opinio in historical linguistics.

I assume that nobody will doubt the existence of different kinds of proof, different qualities of proof, in historical disciplines. If we are left with nothing else but our protocol, we can derive certain conclusions, but we can easily abandon our protocol once we have been presented with those killer arguments, that specific kind of proof that is so striking that we do not need to bother to have a look at any alternative facts again. I do not know of any similar examples in biology, but in linguistics (and in crime investigation, at least judging from the criminal novels I have read), it is obvious that our evidence cannot only be ranked, but that we also have a huge incline between the standard evidence we use to make most of our arguments and those killer arguments that are so striking that no doubt is left.

In the short story The Adventure of the Beryl Coronet, Sherlock Holmes says:
[When] you have excluded the impossible, whatever remains, however improbable, must be the truth.
But this is only partially true, as in Sherlock Holmes' cases the truth is usually (but not always!) presented in such a form that it does not leave any place for doubt. Sherlock Holmes is a genius at finding the wedding rings on the fingers of his witnesses. As historical scientists, we are often much less lucky, but probably also less talented than Mr. Holmes. We are thus left with the fundamental problem of not knowing how to find the killer evidence, or how to quantify the doubt in those cases where we just follow the general protocol of our discipline.

  • Eco, U. and T. Sebeok (1983) The Sign of Three. Dupin, Holmes, Peirce. Indiana University Press: Bloomington.
  • Lehmann, W. (1992) Historical linguistics. An Introduction. Routledge: London.
  • Peirce, C. (1931/1958) Collected Papers of Charles Sanders Peirce. Harvard University Press: Cambridge, Mass.
  • Saussure, F. (1879) Mémoire sur le système primitif des voyelles dans les langues indo-européennes. Teubner: Leipzig.

Tuesday, May 23, 2017

A test case for phylogenetic methods and stemmatics: the Divine Comedy

In a previous post I gave an outline of stemmatics, and briefly touched on the adoption and advantages of phylogenetic methods for textual criticism (On stemmatics and phylogenetic methods). Here I present the results of an empirical investigation I have been conducting, in which such methods are used to study some philological dilemmas of a cornerstone work in textual criticism, Dante Alighieri's Divine Comedy. I am reproducing parts of the text and the results of a paper still under review; the NEXUS file for this research is available on GitHub.

Before describing the analysis, I discuss the work and its tradition, as well as some of the open questions concerning its textual criticism. This should not only allow the main audience of this blog to understand (and perhaps question) my work, but it is also a way to familiarize you with the kind of research conducted in stemmatics. After all, the first step is the recensio, a deep review of all information that can be gathered about a work.

The Divine Comedy

The Divine Comedy is an Italian medieval poem, and one of the most successful and influential medieval works. It is written in a rigid structure that, when compared to other works, guaranteed it a certain resistance to copy errors, as most changes would be immediately evident. Composed of three canticas (Inferno, Purgatory, and Paradise), the first of its 100 cantos were written in 1306-07, with the work completed not long before the death of the author in 1321. Written mostly during Dante's exile from his home city, Florence (Tuscany), like many works of the time it was published as the author wrote it, and not only upon completion. In fact, it is even possible, while not proven, that the author changed some cantos and published revisions, thus being himself the source of unresolvable differences.

No original manuscript has survived, but scholarship has traced the development of the tradition from copies and historical research. The poem is one of the most copied works of the Middle Ages, with more than 600 known complete copies, besides 200 partial and fragmentary witnesses. For of comparison, there are around 80 copies of Chaucer's Canterbury Tales,which is itself a successful work by medieval standards

Commercial enterprises soon developed to attend the market demand of its success. In terms of geographical diffusion, quantitative data suggests that, before the Black Death that ravaged the city of Florence in 1348, scribal activity was more intense in Tuscany than in Northern Italy, where the author had died. Among the hypotheses for its textual evolution, the results of my investigation support the widespread hypothesis that Dante published his work with Florentine orthography in Northern Italy. That is, the first copies adopted Northern orthographic standards, which would then revert to Tuscan customs, with occasional misinterpretations, when the work found its way back to Florence. These essentials of the transmission must be considered when curating a critical edition, as the less numerous Northern manuscripts, albeit with an adapted orthography, can in general be assumed to be closer to the archetype (if there ever was one to speak of) than Florentine ones.

The tradition is characterized by intentional contamination, as the work soon became a focus of politics and grammar prescriptivism. Errors and contamination have already been demonstrated in the earliest securely dated manuscript, the Landiano of 1336 (cf. Shaw, 2011), and can be already identified in the first commentaries dating from the 1320s (such as in the one by Jacopo Alighieri, the author's son).

Critical studies

Here are some details about previous studies. I have included considerable stemmatic information, but I include a biological analogy to help make sense for non-experts.

The first critical editions date from the 19th century, but a stemmatic approach would only be advanced at the end of that century, by Michele Barbi. Facing the problem of applying Lachmann's method to a long text with a massive tradition, in 1891 Barbi proposed his list of around 400 loci (samples of the text), inviting scholars to contribute the readings in the manuscripts they had access to. His project, which intended to establish a complete genealogy without the need for a full collatio, had disappointing results, with only a handful of responses. Mario Casella would later (1921) conduct the first formal stemmatic study on the poem, grouping some older manuscripts in two families, α and β, of unequal number of witnesses but equal value for the emendatio. His two families are not rooted at a higher level, but he observed that they share errors supporting the hypothesis of a common ancestor, likely copied by a Northern scribe.

Casella's stemma, reproduced from Shaw (2011).

Forty years later, Giorgio Petrocchi proposed to overcome the large stemma by employing only witnesses dating from before the editorial activity of Giovanni Boccaccio, as his alterations and influence were considered to be too pervasive. Petrocchi defended a cut-off date of 1355 as being necessary for a stemmatic approach that would otherwise have been impossible, given the level of contamination of later copies. The restriction in the number of witnesses was contrasted by his expansion of the collatio to the entire text, criticizing Barbi's loci as subjective selections for which there was no proof of sufficiency.

Making use of analogies with biology, we may say that Barbi proposed to establish a tree from a reduced number of "proteins" for all possible "taxa". Casella considered this to be impracticable and, selecting a few representative "fossils", built a tree from a large number of phenotypic characteristics. Finally, Petrocchi produced a network while considering the entire "genome" for all "fossils" dated from before an event that, while well-supported in theory (we could compare its effects to a profound climate change), was nonetheless arbitrary.

Petrocchi's stemma, reproduced from Shaw (2011).

Questions about Petrocchi's methodology and assumptions were soon raised, particularly regarding the proclaimed influence of Boccaccio, without quantitative proofs either that his editions were as influential as asserted or that all later witnesses were superfluous for stemmatics. Later research focused on questioning his stemma. For example, the absence of consensus about the relationship between the Ash and Ham manuscripts, the supposedly weak demonstration of the polytomy of Mad, Rb, and Urb (the "Northern manuscripts"), and the dating of Gv (likely copied fifty to a hundred years after Petrocchi's assumption). Evidence was presented that Co, a key manuscript in his stemma, could not be an ancestor of Lau (its copyist was still active in the 15th century), and that Ga contained disjunctive errors not found in its supposed decedents. Abusing once more of the biological analogy, the dating of his "fossils" was in some cases plainly wrong.

Federico Sanguineti presented an alternative stemma in 2001, arguing that a rigorous application of stemmatics would evidence errors in Petrocchi. To that end, he decided to resurrect Barbi's loci and trace the first complete genealogy, without arbitrary and a priori decisions about the usefulness of the textual witnesses. Sanguineti defended the suggestion that, after this proper recensio, a small number of manuscripts (which he eventually set to seven) would be sufficient for emendation. His stemma, described as "optimistic in its elegance and minimalism" (Shaw 2011), resulted in a critical edition that heavily relied in a single manuscript, Urb, the only witness of his β family (as Rb was displaced from the proximity it had in Petrocchi's stemma, and Mad was excluded from the analysis). Keeping with the biological analogy, he proposed building a tree from an extremely reduced number of "proteins", but for all "taxa". In the end, however, the reduced number of "proteins" was considered only for seven "taxa", selected mostly due to their age.

Sanguineti's stemma, reproduced from Shaw (2011).

The edition of Sanguineti was attacked by critics, who confronted the limited number of manuscripts used in the emendatio, the position of Rb, the high value attributed to LauSC, and the unparalleled importance of Urb, all resulting in an unexpected Northern coloring to the language of a Florentine writer. Regarding his methodology, reviewers pointed out that stemmatic principles had not been followed strictly, as the elimination was not restricted to descripti, but extendied to branches that were considered to be too contaminated

The digital edition of Prue Shaw (2011) was developed as a project for phylogenetic testing of Sanguineti's assumptions. Her edition includes complete manuscript transcriptions, and the transcriptions include all of the layers of revision of each manuscript (original readings and corrections by later hands), and are complemented by high-quality reproductions of the manuscripts. After testing the validity of Sanguineti's method and stemma, Shaw concluded that his claims do not "stand up to close scrutiny", and that the entire edition is compromised, because Rb "is shown unequivocally to be a collaterale of Urb, and not a member of α as [Sanguineti] maintains".

Applying phylogenetic methods

With the goal of following and, to a large part, replicating Shaw (2011), I have analyzed signals of phylogenetic proximity for validating stemmatic hypotheses, produced both a computer-generated and a computer-assisted phylogeny (equivalent to a stemma), and evaluated the performance of suchphylogenies with methods of ancestral state reconstruction.

I wanted to investigate the proximity of witnesses and the statistical support for the published stemmas. After experiments with rooted graphs, I made a decision to use NeighborNets, in which splits are indicative of observed divergences and edge lengths are proportional to the observed differences. These unrooted split networks were preferable because they facilitated visual investigation, and also provided results for the subsequent steps. These involved exploring the topology and evaluating potential contaminations, guiding the elimination of taxa whose data would be redundant for establishing prior hypotheses on genealogical relationships. Analyses were conducted using all manuscript layers and critical editions, both with and without bootstrapping, thus obtaining results supported in terms of inferred trees as well as of character data.

NeighborNet of the manuscripts and revisions from my data, generated with SplitsTree
(Huson & Bryant 2006)

The analysis confirmed most of the conclusions of Shaw (2011) — there are no doubts about the proximity and distinctiveness of Ash and Ham, with Sanguineti's hypothesis (in which they are collaterals) better supported than Petrocchi's hypothesis (in which the first is an ancestor of the second). The proximity of Mart and Triv was confirmed; but the position of the ancestors postulated by Petrocchi and Sanguineti should be questioned in face of the signals they share with LauSC, perhaps because of contamination. The most important finding, in line with Shaw and in contrast with the fundamental assumption of Sanguineti, is the clear demonstration of the relationship between Rb and Urb.

The relationship analyses allowed the generation of trees for further evaluation. Despite the goal of a full Bayesian tree-inference, I discarded that option because, without a careful and demanding selection of priors, it would yield flawed results. As such, I made the decision to build trees using both stochastic inference and user design (ie. manually). This postponed more complex topology analyses for future research, but generated the structures needed by the subsequent investigation steps; both trees are included in the datafile.

The second tree (shown below), allowing polytomies and manually constructed by myself, tries to combine the findings of Petrocchi and Sanguineti by resolving their differences with the support of the relationship analyses. Using Petrocchi's edition as a gold standard, and considering only single hypothesis reconstructions, parsimonious ancestral state reconstruction agree with 9,016 characters (79.9%). When considering multiple hypotheses, instead, reconstructions agree with 10,226 characters (90.7%). Cases of disagreement were manually analyzed and, as expected, most resulted from readings supported by the tradition but refuted by Petrocchi on exegetic grounds.

My proposed tree for the manuscripts selected by Sanguineti,
generated with PhyD3 (Kreft et al., 2017).

This tree suggests that, in general, Petrocchi's network is better supported than the tree by Sanguineti, as phylogenetic principles lead us to expect — the first was built considering statistical properties and using all of available data, while the second relied in many intuitions and hypothesis never really tested. In particular, it supports the findings of Shaw and, as such, allows us to indicate the critical edition of Petrocchi as the best one. Even more important, however, it is a further evidence of the usefulness of phylogenetic methods, when appropriately used, in stemmatics.


Alagherii, Dantis (2001) Comedìa. Edited by Federico Sanguineti. Firenze: Edizioni del Galluzzo.

Alighieri, Dante (1994) La Commedia Secondo L’antica Vulgata: Introduzione. Edited by Giorgio Petrocchi. Opere di Dante Alighieri v. 1. Firenze: Le Lettere.

Huson, Daniel H.; Bryant, David (2006) Application of phylogenetic networks in evolutionary studies. Molecular Biology and Evolution 23: 254–267.

Inglese, Giorgio (2007) Inferno, Revisione del testo e commento. Roma: Carocci.

Kreft, Lukasz; Botzki, Alexander; Coppens, Frederik; Vandepoele, Klaas; Van Bel, Michiel (2017) PhyD3: a Phylogenetic Tree Viewer with Extended PhyloXML Support for Functional Genomics Data Visualization. BioRxiv. Doi: 10.1101/107276.

Leonardi, Anna M.C. (1991) Introduzione. In: La Divina Commedia, by Dante Alighieri. Milano: Arnoldo Mondadori Editore.

Shaw, Prue (2011) Commedia: a Digital Edition. Birmingham: Scholarly Digital Editions.

Trovato, Paolo (2016) Metodologia editoriale per la Commedia di Dante Alighieri. Ferrara. https://www.youtube.com/watch?v=BfKUOAR9PXA. Date of access: March 19, 2017.

Tuesday, May 16, 2017

Connecting tree and network edges

I have struggled over the years to try to understand the relationship between trees and networks. In one sense, networks are generalizations of trees, and in another sense a tree is just a simplified network. But it is not always that simple.

For example, not all networks can be created by adding edges to a tree (see Networks vs augmented trees); so the connection between trees and networks is not always obvious. Moreover, it is not always easy to determine which tree edges are present in any given network, or which network edges are present in a given tree.

Nevertheless, this should be basic information in phylogenetics — otherwise, how can we know when a tree is adequate for our purposes, or when a network is needed?

It turns out that I have not been alone in struggling to connect trees and networks. Fortunately, some of these other people decided to actually do something about it, rather than simply struggling on. As a result, a computerized way to relate much of the important information connecting trees with networks now exists.
Klaus Schliep, Alastair J. Potts, David A. Morrison and Guido W. Grimm
Intertwining phylogenetic trees and networks.
Methods in Ecology and Evolution (Early View)
To quote the authors:
Here we provide a framework, implemented in the PHANGORN library in R, to transfer information between trees and networks. This includes: (i) identifying and labelling equivalent tree branches and network edges, (ii) transferring tree branch-support to network edges, and (iii) mapping bipartition support from a sample of trees (e.g. from bootstrapping or Bayesian inference) onto network edges.
These three functions are illustrated in this figure, taken from the paper. It should be self-explanatory to anyone who has tried to relate the edges of trees and networks; but if it is not, then you can read an explanation in the paper.

The R library referred to, including the source code, along with some examples and vignettes, can be accessed on the PHANGORN CRAN page.

Note that PHANGORN (originally created by Klaus Schliep) also contains other functions related to estimating phylogenetic trees and networks, using maximum likelihood, maximum parsimony, distance methods and hadamard conjugation. Specifically, it allows you to: estimate phylogenies, compare trees and models, and explore tree space and visualize phylogenetic trees and split graphs.

Tuesday, May 9, 2017

Dante and the tree model

I was preparing a blog post on phylogenetic methods for the study of the Divine Comedy, by Dante Alighieri (1265-1321), and it occurred to me that a note on Dante's contribution to the tree model might also be worthwhile. This medieval poet cannot, of course, be described as the father of the Stammbaum, but he should probably be listed among the many sources for the development of the model, and of the linguistic theories that it supported in the 19th century.

The study of Dante's works became almost an international mania with the rise of Romanticism in the 18-19th centuries, and scholars are not strangers to his more obscure works. One of these works is an abandoned linguistic essay entitled De vulgari eloquentia ("On the eloquence in the vernacular language", circa 1305). In short, it is an unfinished manual on composition, a "poetics", with an introductory chapter discussing the appropriate language for poetry. The essay is written in Latin, but from the first paragraphs the author declares that the same language is not suitable for literature, as it is not a living language. Latin is then reserved for scientific and philosophical matters, and the author "ventures in a quest" for a good literary vernacular.

The first paragraphs are full of medieval opinions on language, such as the confusion arising from the Tower of Babel, and a discussion about how a linguistic ability would be superfluous for demons and angels alike. Towards the end the author starts to favor an artificial vernacular language, concluding that no living language (among the 14 dialects of the Italian peninsula) would be good enough. This latter idea was not followed when he wrote the Divine Comedy (which was written in Tuscan, Dante's native dialect), and this probably explains why the essay was abandoned just when the composition of that poem was begun.

However, between the biblical linguistics and the poetic formalism, Dante explores linguistic matters with an almost modern (and sometimes surprising) mindset. For example, he discusses how birds don't talk but simply repeat air movements, he discusses how grammar (i.e. Latin and Greek) is a codification, he provides a detailed, while subjective, map of the Italian vernaculars of the 12th century, and, what matters for us here, explains that not all linguistic differences are due to the "vengeful confusion" arising from the Tower of Babel. Being human constructions, he says, languages are unstable and, as such, change, as proved by many similarities that can't be random and don't really add much confusion (i.e. their differences are too feeble to be a consequence of the punishment of an almighty god). Our problem, he continues, is that changes are gradual and subtle, and as such we don't perceive them; but they do exist, as someone who returns to a city after many years can confirm, or as can be recorded when moving from city to city.

The (genealogical) tree model is implicit but undeniable in the eighth chapter of the first book, when the author uses words such as "root", "planted", and "branches". Here, I also report the original word in Latin, along with a translation adapted from Botterill (2006):
The confusion of languages [after the Tower of Babel] leads me [...] to the opinion that it was then that human beings were first scattered throughout the whole world, into every temperate zone and habitable region, right to its furthest corners. And since the principal root [radix] from which the human race has grown was planted [plantata] in the East, and from there our growth has spread, through many branches [palmites] and in all directions, finally reaching the furthest limits of the West [...]. [...] these people brought with them a tripartite language. Of those who brought it, some found their way to southern Europe and some to northern; and a third group, whom we now call Greeks, settled partly in Europe and partly in Asia. Later, from this tripartite language (which had been received in that vengeful confusion), different vernaculars developed, as I shall show later. For in that whole area that extends from the mouth of the Danube (or the Meotide marshes) to the westernmost shores of England, and which is defined by the boundaries of the Italians and the French, and by the [Atlantic] ocean, only one language prevailed, although later it was split up into many vernaculars by the Slavs, the Hungarians, the Teutons, the Saxons, the English, and several other nations. Only one sign of their common origin remains in almost all of them, namely that nearly all the nations listed above, when they answer in the affirmative, say [see the map above, from Elisabeth Burr] Starting from the furthest point reached by this vernacular (that is, from the boundary of the Hungarians towards the east), another occupied all the rest of what, from there onwards, is called Europe; and it stretches even beyond that. All the rest of Europe that was not dominated by these two vernaculars was held by a third, although nowadays this itself seems to be divided in three: for some now say oc, some oïl, and some , when they answer in the affirmative; and these are the Hispanic, the French, and the Italians. Yet the sign that the vernaculars of these three peoples derive from one and the same language is plainly apparent: for they can be seen to use the same words to signify many things, such as 'God', 'heaven', 'love', 'sea,' 'earth', 'is', 'lives', 'dies', 'loves', and almost all others. Of these peoples, those who say oc live in the western part of southern Europe, beginning from the boundaries of the Genoese. Those who say , however, live to the east of those boundaries, all the way to that outcrop of Italy from which the gulf of the Adriatic begins, and in Sicily. But those who say oïl live somewhat to the north of these others, for to the east they have the Germans, on the west and north they are hemmed in by the English sea and by the mountains of Aragon, and to the south they are enclosed by the people of Provence and the slopes of the Apennines.
The De vulgari eloquentia has routinely been printed alongside the Divine Comedy, and was studied, to give some examples, by Thomas Warton in his History of the English Poetry (London, 1775), by Johann Gottfried Eichhorn in his Allgemeine Geschichte der Cultur und Litteratur des neueren Europa (Göttingen, 1796), and by August Pott (a student of Franz Bopp) in his Indogermanischer sprachstamm" (1840). The essay was copied in Germany even before the introduction of the printing press; and a German translation, Über die Volkssprache (K. L. Kannegießer, 1845), was published in Leipzig when August Schleicher was already active in linguistic studies.

By this time, it seems, the work was almost a commonplace topic of discussion — when defending his model for the Italian language, and complaining about people who proposed a 12th century language for a 19th century nation state, around 1830, Alessandro Manzoni jokingly reminded that it was "one of those books which nobody actually read, but everybody discusses".

This is one more little note to our narrative on the evolution of the tree model.


Alighieri, D. De Vulgari Eloquentia. Edited and translated by Steven Botterill. Cambridge University Press, 2006.

Elisabeth Burr. Klassifizierung der romanischen Sprachen.

Wednesday, May 3, 2017

On stemmatics and phylogenetic methods

No se publica un libro sin alguna divergencia entre cada uno de los ejemplares. Los escribas prestan juramento secreto de omitir, de interpolar, de variar. [No book is published without some divergence between each of the copies. Scribes take a secret oath to omit, to interpolate, to change.] (Jorge Luis Borges, La lotería en Babilonia, in Ficciones, 1962)
This is the first on series of posts on stemmatics, a field just as much in love with trees and networks as are phylogenetics and historical linguistics. Being an introduction, I explain what the field does, present the most important jargon, and offer a list references that, while suitable for the audience of this blog, is denser than what one might expect for a blog post.

Thank you to Mattis and David for inviting me to write!

Textual criticism

Textual criticism (or, less precisely, "philology") is a discipline concerned with the investigation of the history of literary, legal, and religious texts for explaining how differences among the copies of a text (its "witnesses") arose, and with the production of "critical editions", either scholarly curated versions of a text that aim to reconstruct the lost original or corrected versions of an existing copy.

The problem of divergence between copies of text, with the accumulation of involuntary and deliberate errors, as well as the need for a systematic study of such differences, is as old as writing itself. For example, our current editions for the epic poems of Homer descend from Ancient philological attempts to restore an uncontaminated original (see the first two figures). These include the edition of Pisistratus (VI century BCE, which determined what was to be sung at the Panathenaic Games), and the so-called VMK (Viermännerkommentar, "commentary of the four men") of the Alexandrian School (I-II century BCE), which is generally assumed to be the root of the witnesses that we have.

Van der Valk's reconstruction of the sources for Venetus A, one of the most
important manuscripts of Homer's Iliad (source: Wikipedia).

Erbse's reconstruction of the sources for Venetus A, one of the most important
manuscripts of the Iliad (source: Wikipedia).

Before stemmatics, an edition could either be based on a "good copy" (a version considered to be less contaminated or more faithful than others), in a "majority reading" (in which the most attested variant would be chosen), or in a principle of "eclecticism" (with each best reading individually selected by the editor's judgment). Each new version, as expected, contributed even more to the confusion, particularly when changes were voluntary.

Among the texts with long and complex traditions, objects of countless and sometimes bloody disputations on the "correct" readings, are the Bible and codes of laws, for which it was not uncommon to have a different version in each city, with predictable consequences. For example, the first published textual tree, as already covered in this blog (The first Darwinian evolutionary tree), was authored by Carl Johan Schlyter in 1827 in a study precisely on the multiple and conflicting copies of Swedish law.

As such, it is no surprise that objective approaches were soon developed (Homer's VMK edition being one of the first examples), culminating with the development of stemmatics, with its study of the genealogical relationship between witnesses, and its representation of such relationships by means of trees.


As a scientific approach to textual criticism, stemmatics established itself from the beginnings of 19th century as an alternative to emendations based in the opinions and wishes of editors, possibly inspiring both Charles Darwin and August Schleicher (for a general discussion on the development and significance of this method, see Timpanaro 2005). However, more than a "source", we should consider it a branch equally stemming from the "cultural framework" (Macé and Baret 2006: 91) that also gave us Darwinism and historical linguistics.

As was true for these latter disciplines, stemmatics was at first opposed, because of the revolution it brought to its field, along with its genealogical trees. However, just as in these sister disciplines, the results of the new mindset introduced by the explanation of evolution with trees could not be ignored, and this approach is so central to textual criticism that the latter can be divided into periods before and after the work of Karl Lachmann, the "father" of stemmatics, in particular the publication of his edition of Lucretius' De rerum natura (1850). In his commentaries, besides demonstrating the number of lines per page in the lost manuscript at the root of the tradition, Lachmann was even able to demonstrate the kind of script used to write it (Lachmanni 1850).

The work he chose, with the importance of Lucretius in the development of the scientific mindset (and, as we should remember when dealing with cultural evolution, of Darwin's theories), is unlikely to be casual, but this is a matter for a different blog post.


Genealogical trees are so central to the stemmatic method that the field itself is actually named after them. The main goal of an editor is to produce a stemma codicum ("family tree of manuscripts"), or simply stemma, a tree-like structure that supports the textual emendation and represents the "tradition" (the witnesses' genealogy), in analogy with the family trees of Roman families that figured in many texts reviewed by 19th century philologists. Stemma, in fact, is a Greek word meaning garland or wreath, that was incorporated in Imperial Latin to designate a family tree (and, figuratively, nobility itself), as family trees were drawn with a stemma at their top.

In short, stemmatics begins with a recensio, which is an investigation of all total and partial copies of a work. This review is followed by a collatio, a systematic scrutiny of the manuscripts' contents, when readings are aligned and compared. The results of this alignment are used to produce the stemma, following the principle that "community of errors implies community of origin". By analyzing the stemma and the errors, editors finally proceed to the emendatio, which is a reconstruction that explains the known variants, and is intended to represent the "archetype" (a lost witness at the root of the ramification, assumed to be closer to the original than any other copy).

A stemma is conventionally drawn top-to-bottom, with vertical placements roughly indicating the date of the manuscript (the higher, the older). Solid edges ("arrows") indicate descent, while dashed ones imply contamination (scribes using more than one source). Witnesses are usually labeled with abbreviated names or Latin letters, when the manuscript is available, or with Greek letters, when it is missing (with α usually reserved for the archetype and ω for the original). Below is a reproduction of Petrocchi's partial stemma for the tradition of Dante Alighieri's Divine Comedy, which I will cover in a future post. Note that the genealogy is actually a reticulating network rather than a simple tree.

Petrocchi's partial stemma for the Divine Comedy, presented in the
introduction to his critical edition (1965).

The example stemma offered by Maas (1958), adapted below, is still useful to demonstrate the principles of stemmatics. In this example, for a textual emendation manuscript H should be eliminated (as it descends from F), as well as I and J (copies of G). Manuscript C shows a contamination from its collateral D, something which should be considered when weighting errors. Sub-archetypes β and γ are to be inferred from the available witnesses of their branches, and their readings will have the same weight as K, the only member of the third family branching from the archetype (even though it is a recent manuscript), in establishing the "lesson" of α. Errors might be presumed in α itself, or even in the original ω, and in both cases a corrected "lesson" might be offered by the editor after internal and external evidences.

Exemplary stemma adapted from Maas (1958).

Adoption and practice

Stemmatics has been criticized and confronted since Lachmann's time. It requires very specialized knowledge, for example in distinguishing between monogenetic and polygenetic errors, i.e. those that arose once and those that emerged independently more than once (and that, as such, are not disjunctive). A number of its suppositions are routinely called into question, such as the idea that each copy always derives from a single source (accepting contamination, at most), that each copy has at least the same number of errors of its source, and, fundamentally, that traditions have one and only one archetype.

Many measures tend to be adopted to reduce the editorial effort. These include eliminating manuscripts considered to be descripti (i.e. proved to descend from a preserved witness, in theory sharing all the errors of their sources), and only performing the collatio in a set of critical passages (loci critici). While a complete stemma and a full collatio are desirable, such compromises might be unavoidable for long texts with ample traditions. For example, in the case of Dante Alighieri's Divine Comedy, after considering the time employed by scholars such as Petrocchi, Sanguineti, and Shaw for their editions, Trovato (2016) estimated the length of a full stemmatic approach in 400 man-years.

An alternative to stemmatic methods and suppositions, which also reduces the editorial effort, is found in scholars who follow the work of Joseph Bédier, who successfully challenged the limits of stemmatics by adopting a renewed version of the method of the "good copy" for his editions of medieval texts. The Bédierian method does not refute a scientific approach or methods such as the recensio, the collatio, or even the production of a stemma, but these are used to support the editor's judgment in selecting and curating a bon manuscript — a good edition of text to be corrected only where errors can be proved beyond reasonable doubt. In short, trees (and networks) have been central to textual criticism even when stemmatics itself, as a method, is being challenged.

Considering the editorial effort and the analogies with linguistics and biology, it is no surprise that digital workflows have been proposed, along with the development of computer resources and phylogenetic methods. Ideas for new approaches were explored by Froger (1969), and formal phylogenetic methods were attempted by Platnick and Cameron (1977). Recently, the number of editions supported by formal phylogenetic methods and software has increased (see, for example, Barbook et al. 1998; Stolz 2003; and Lantin, Baret and Macé 2004), also in the face of scientific evaluations of performance (Roos and Heikkila 2009).

Besides advances in speed and replicability, the new technologies are allowing us to expand the goals of the discipline, moving from electronic editing to computational philology. In fact, while the field has for centuries been defined by the production of critical editions, digital approaches have been shown to support a reduction in the importance of "authorial intention", allowing researchers to focus on the reception of texts by the public, in line with developments of literary theory (Jauss 1982), and with the goals established by the "New Philology" (Cerquiglini 1989). Manuscripts with readings that differ from a supposed original, traditionally described as "corrupted", are changing from copies that were meant to be discarded into data points that collaborate in an investigation of human history that is assisted by quantitative data and methods.


Barbrook A.C., Howe C.J., Blake N., Robinson P. (1998) The phylogeny of the Canterbury Tales. Nature 394 (6696): 839.

Cerquiglini B. (1989) Éloge de la variante: histoire critique de la philologie. Aux Travaux. Paris: Éditions du Seuil.

Froget J. (1969) La critique des textes et son automatization. Bulletin De L’Association Guillaume Budé 1(1): 125–129.

Jauss H.-R. (1982) Toward an Aesthetic of Reception. Minneapolis: University of Minnesota Press.

Lachmann C. (1850) De Rerum Natura. Commentarius. Berolini: Imprensis Georgii Reimeri.

Lantin A.-C., Baret P.V., Macé C. (2004) Phylogenetic analysis of Gregory of Nazianzus’ Homily 27. 7èmes Journées Internationales d’Analyse statistique des Données Textuelles, pp. 700-707.

Maas P. (1958). Textual Criticism. Translated by Barbara Flower. Oxford: Oxford University Press.

Macé C.; Baret P.V. (2006) Why phylogenetic methods work: the theory of evolution and textual criticism. Linguistica Computazionale. The Evolution of Texts: Confronting Stemmatological and Genetical Methods 24: 89–108.

Platnick N.I., Cameron H.D. (1977) Cladistic methods in textual, linguistic, and phylogenetic analysis. Systematic Zoology 26: 380–385.

Roos T., Heikkilä T. (2009) Evaluating methods for computer-assisted stemmatology using artificial benchmark data sets. Literary and Linguistic Computing fqp002.

Stolz, M. (2003) New philology and new phylogeny: aspects of a critical electronic edition of Wolfram’s Parzival. Literary and Linguistic Computing 18(2): 139–150.

Timpanaro S. (2005) The Genesis of Lachmann's Method. Translated and edited by G. W. Most. Chicago: University of Chicago Press.

Trovato P. (2016) Metodologia editoriale per la Commedia di Dante Alighieri. Ferrara. See Youtube; date of access: March 19, 2017.

Tuesday, April 25, 2017

The siteswap annotation in juggling, and the power of annotation and modeling

I have been a juggler for more than 20 years now. It started when I was thirteen, and primarily interested in doing magic tricks, but I quickly realized that there are more transparent ways of presenting ones manipulation skills. About 15 years ago, when I was starting my studies in Berlin, there was a booming juggling scene in that city, with many young people, including many geeks who studied mathematics, programming, or physics. I, myself, was studying Indo-European linguistics by then, a field deprived of formalisms and formulas, devoted to the implicit as reflected in scientific prose that is not amenable to formalization, modeling, or transparent annotation.

It was at that time that some jugglers began to develop an annotation system for juggling patterns. The system was very simple, using numbers to denote the height and the direction of balls (or other objects) flying around from hand to hand. The 1 denoted the transfer of one ball from hand to hand without tossing it, the 2 denoted to hold one ball in one hand, the 3 to throw it from one hand to the other with a height required to juggle three balls, the 4 to throw one ball up in the air so that one would catch it with the same hand, and the 5 denoted  the crossing from one hand to the other, but this time slightly higher, as required when juggling five balls. Some of these numbers are indicated in these animated GIFs.

The people called this system siteswap, and they claimed that it was a good idea to formalize juggling to increase creativity, since one was not required to throw all of the balls with the same number, but one could combine them, following some basic mathematical ideas.

When people told me about this, I was extremely skeptical, probably due to my classical education, which gave me the conviction that juggling is an art, and an art cannot be describe in numbers. When people tried to teach me siteswaps, I ridiculed them, showing them some complicated patterns involving body movements (see the next GIFs), and told them they would never be able to describe all the creativity of all the jugglers in the world in numbers.

Only a couple of years later, I realized that the geeks had proven me wrong, when, after a longer break, I was again participating in one of the many juggling conventions that take place throughout the year, in different locations in Europe and the whole world. I saw people doing tricks with three balls that I had never thought of before, and I asked them what they were doing. They answered, that these were siteswaps, and they were juggling patterns they called 441, or 531, respectively, as shown in these GIFs:

I gave in completely, when I saw how they applied the same logic to routines with five and more balls, which they called 654, 97531, or 744, respectively. Especially the 97531 fascinated me. During this routine, all of the balls end up in one vertical line in the air, for just a moment, but enough even for laymen to see the vertical line, which then immediately breaks down to a normal five-ball pattern, as shown here.
I realized, how wrong it was to take the un-annotability of something for granted. But even more importantly, I also understood that models, as restrictive as they may seem to be at first sight, may open new pathways for creativity, showing us things we had been ignoring before.

Only recently, when I promised colleagues to juggle during a talk on linguistics, I detected the parallel with my own studies in historical linguistics. For a long time, the field has been held back by people claiming that things could not be handled formally, for various reasons.

But I am realizing more and more that this is not true. We just need to start with something, some kind of model, which may not be as ideal and as realistic as we might wish it to be, but that may eventually help us to detect things we did not see before. We just need to start doing it, walking in baby-steps, improving our models and our annotation, as well as improving our understanding of the limits and the chances of a given formalization.

Needless to say, the patterns that I deemed to be un-annotatable 10 years ago in juggling can now easily be handled by my colleagues. They did not stop with the normal number system, but kept (and keep) developing it, and they take a lot of inspiration from this.

Tuesday, April 18, 2017

Multimedia phylogeny?

Evolutionary concepts have often been transferred to other fields of study, or derived independently in them, especially in anthropology in the broadest sense, covering all cultural products of the human mind. This includes phylogenetic studies of languages, texts, tales, artifacts, and so on — you will find many examples of such studies in this blog. One of the more recent applications has been to what is sometimes called multimedia phylogeny — the research field that "studies the problem of discovering phylogenetic dependencies in digital media".

I have noted before that phylogenetics in the biological sense is an analogy when applied to other fields, because only in biology is genetic information physically transferred between generations — in the other fields, cultural information transfer is all in the minds of the people, not in their genes (see False analogies between anthropology and biology). This analogy often becomes problematic when applied to other fields, because the practical application of bioinformatics techniques separates the informatics from the bio, and the mathematical analyses focus on trying to implement the informatics without any biological justification.

A recent paper that discusses the application of bioinformatics to multimedia phylogeny exemplifies the potential problems:
Guilherme D Marmerola, Marina A Oikawa, Zanoni Dias, Siome Goldenstein, Anderson Rocha (2017) On the reconstruction of text phylogeny trees: evaluation and analysis of textual relationships. PLoS One 11(12): e0167822.
The authors described their background information thus:
Articles on news portals and collaborative platforms (such as Wikipedia), source code, posts on social networks, and even scientific publications or literary works, are some examples in which textual content can be subject to changes in an evolutionary process. In this scenario, given a set of near-duplicate documents, it is worthwhile to find which one is the original and the history of changes that created the whole set. Such functionality would have immediate applications on news tracking services, detection of plagiarism, textual criticism, and copyright enforcement, for instance.
However, this is not an easy task, as textual features pointing to the documents' evolutionary direction may not be evident and are often dataset dependent. Moreover, side information, such as time stamps, are neither always available nor reliable. In this paper, we propose a framework for reliably reconstructing text phylogeny trees, and seamlessly exploring new approaches on a wide range of scenarios of text reusage. We employ and evaluate distinct combinations of dissimilarity measures and reconstruction strategies within the proposed framework.
So, their solution to the separation of bio from informatics is to try a range of techniques, none of which are based on any particular model of how phylogenetic changes might occur in text documents. All of these methods involve distance-based tree-building.

The essential problem, as I see it, is that without a model of change there is no reliable way to separate phylogenetic information from any other type of information. For example, similarity can arise from many sources, only some of which provide information about phylogenetic history — phylogenetic similarity is a form of "special similarity". In biology, other sources of similarity are usually lumped together as chance similarities, such as convergence, parallelism, etc. Without this basic separation of phylogenetic and chance similarity, it does not matter how many distance measures you use, or how many tree-building methods you employ — if you can't separate phylogeny from chance then you are wasting your time constructing a hypothetical  evolutionary history.

The authors' only saving grace is their claim that: "In text phylogeny, unlike stemmatology [the analysis of hand-written rather than digital texts], the fundamental aim is to find the relationships among near-duplicate text documents through the analysis of their transformations over time." The expectation, then, is that the phylogenetic similarity of the texts will be high, which will thus reduce the possibility of chance similarities. Sadly, it will also reduce the probability that the similarities will contain any phylogenetic information at all — this is the classic short-branches-are-hard-to-reconstruct problem in phylogenetics.

For digital texts, the authors employ three distance measures: edit distance, normalized compression distance, and cosine similarity. None of these are model-based in any phylogenetic sense (although the first one is used in alignment programs such as Clustal) — I have discussed this in the post on Non-model distances in phylogenetics. Their tree-building methods include: parsimony, support vector machines (a machine-learning form of classification), and random forests (a decision-tree form of classification). Once again, none of these is model-based in terms of textual changes.

A final issue is the insistence on trees as the model of a phylogeny. In stemmatology, for example, a network is a more obvious phylogenetic model, because hand-written texts can be copied from multiple sources. Indeed, this distinction plays an important role in the first application of phylogenetics to stemmatology (see the post on An outline history of phylogenetic trees and networks). Perhaps this is not an issue for "near-duplicate text documents", but it does seem like an unnecessary restriction. Moreover, one of the empirical examples used in the paper actually has a network history, which therefore does not match the authors' reconstructed tree.

Tuesday, April 11, 2017

Morgan Colman and English royal genealogies

I noted in an earlier post (Drawing family trees as trees) that from 1576 CE Scipione Ammirato, an Italian writer and historian, set up a cottage industry producing family trees for the nobility. Over the years, he was not the only person to try to make money this way.

In the English-speaking world, one of these was Morgan Colman (or Coleman), who produced an impressively large genealogy of King James I and Queen Anne, in 1608. Nathaniel Taylor has commented: "Of all the congratulatory heraldic and genealogical stuff prepared early in James’s reign, this might be the most impressive piece of genealogical diagrammatic typography."

Unfortunately, we do not have a complete copy of this family tree. It was published as a set of quarto-sized bifolded sheets that needed to be joined together. Below is a small image of the copy in the British Library, which gives you an idea of the intended arrangement, and its incompleteness (click to enlarge). Taylor has a larger PDF copy available here.

The WorldCat library catalog lists the work as "Most noble Henry ; heire (though not son)", which is the first line of the dedicatory verse at the top left. Elsewhere, I have seen it referred to as "The Genealogies of King James and Queen Anne his wife, from the Conquest".

It is usually described as "a genealogy of James I and Anne of Denmark in 10 folio sheets [sic], with their portraits in woodcut, accompanied by complimentary verses to Henry Prince of Wales, the Duke of York (Prince Charles) and Princess Elizabeth, and with the coats-of-arms of the nobles living in 1608 and of their wives."

A Christies auction notes the sale of an illuminated manuscript of the "Genealogy of the Kings of England, from William the Conqueror to Elizabeth 1", produced by Colman in 1592. The accompanying text reads (in part):
Colman, a scribe and heraldic painter, was steward and secretary to various eminent public figures, including successive Lord Keepers of the Great Seal, Sir John Puckering (1592-96) and Sir Thomas Egerton (1596-1603) who caused his election as MP for Newport, Cornwall in 1597. Heraldic and genealogical compositions were his speciality and in 1608 he had composed, and prepared for printing, genealogies of King James and his Queen published as ten large quarto sheets; in 1622 a payment records his work for James I in producing two large and beautiful tables for the King's lodgings in Whitehall and for making many of the genealogical tables for 'His Majesty's honour and service'. But these successes were a distant prospect in 1592 when he produced the present manuscript: in that year he petitioned for the post of York Herald and a second petition at about this date, possibly to Sir John Puckering, solicits the addressee's continued support for his advancement. This genealogy appears therefore to be part of a campaign to secure employment: the writer ends his summary of contents 'Wherein if the simplicity of well-meaning purpose, maie procure desired accept'on then rest persuaded that the industrious hand is fullie prepared spedelie to produce matter for more ample contentment.' The inclusion of Francis Bacon's arms at the end of his work shows that Colman had hopes of securing Bacon's patronage: by 1592 Bacon's political and legal career was well established, he was confidential adviser to the Earl of Essex, the Queen's favourite, and had hopes of high office. Colman, however, hedged his bets; another copy of this genealogy survives, though incomplete and lacking the arms of a recipient.
Colman apparently petitioned for the office of herald in the latter part of the reign of Queen Elizabeth I, but never obtained it.

Tuesday, April 4, 2017

Terry Gilliam's film career

Terence Vance Gilliam, the well-known film director, has been in the news recently, for trying yet again to film his movie The Man Who Killed Don Quixote. This movie started back in the early 1990s, and has now been up and down like a yo-yo for more than 25 years. Maybe he will complete it this time, which he didn't last year, or in 2010 or 2008 — and it is cinema legend what happened back in 2000 (as shown in the documentary Lost in La Mancha).

It has been said of Gilliam that "his directorial vision has secured his rightful place within the pantheon of substantive filmmakers as well as appreciative, if selective, audiences throughout his career." This means that his films often do well, but not all that well; he is more than an art-film maker, but not quite a mainstream director. You either love his movies or you don't — there is little or no middle ground.

Gilliam is probably best known for wanting to make what are called "independent" films but which require studio-scale funding, and then fighting with the studio executives over the finished product. He clearly wants to be an independent auteur but without the tight budget that normally goes with it. In other words, he makes his own bed and then has trouble lying in it

Being a director of some renown, there are plenty of people who have been interested in providing retrospectives and commentaries on Gilliam's career. After all, that sort of thing seems to be the principal activity in the arts world — you are either a creator or a commentator, or sometimes both (such as film commentator turned film director Peter Bogdanovich).

So, it might be worthwhile to look at what some of these commentators have thought about Gilliam's career, as represented by his directorial repertoire of completed films. This ignores his involvement with television animations and various commercials.

To date, the Gilliam directorial oeuvre consists of 12 feature-length movies:
  • Monty Python and the Holy Grail (1975)
  • Jabberwocky (1977)
  • Time Bandits (1981)
  • Brazil (1985)
  • The Adventures Of Baron Munchausen (1988)
  • The Fisher King (1991)
  • Twelve Monkeys (1995)
  • Fear And Loathing In Las Vegas (1998)
  • The Brothers Grimm (2005)
  • Tideland (2005)
  • The Imaginarium of Dr Parnassus (2009)
  • The Zero Theorem (2013)
and 5 short films:
  • Storytime (1968)
  • The Miracle of Flight (1974)
  • The Crimson Permanent Assurance (1983)
  • The Legend of Hallowdega (2010)
  • The Wholly Family (2011)
In the modern world, arts commentators tend to provide rankings of works of art, telling us which work is "best" and which "worst". If nothing else, this allows a mathematical analysis, although I am never quite sure how one goes about actually ranking works of art in some linear series.

The available commentaries that contain ranked lists of Gilliam's films include some personal choices:
some compilations from members of the public:
and some compilations from professional critics:
There is also a list based on the adjusted US box office grosses (Box Office Mojo); there is a combined score from multiple sources (Ultimate Movie Rankings; and the Top 10 Films site does not rank three of the films. I will ignore these latter three lists, since they are not directly comparable to the other lists.

Few commentators have included the short films in their discussion, and so I will start my analysis with the two sources who have done so. Here is a time-course graph of the 17 films as ranked independently by both IndieWire and IMDB.

Note that both lists agree that Gilliam was at his best (ie. he produced the top third of his works) during the middle period of his career; and that he hasn't produced anything of note this century. This does not bode well for the future success of The Man Who Killed Don Quixote. [Note: The failure of this movie to be made is responsible for the large gap between films from 1998 to 2005.]

We could now use a phylogenetic network as an exploratory data analysis to display the consensus rankings of the feature films (only), from all of the commentators listed above. As usual, I first used the manhattan distance to calculate the similarity of the different films based on their rankings. This was followed by a neighbor-net analysis to display the between-film similarities as a network. Films that are closely connected in the network are similar to each other based on their critic rankings, and those that are further apart are progressively more different from each other.

The network shows a straightforward pattern from the highest ranked films at the top-right to the lowest at the bottom-left. In the graph, the films are numbered in the order of their production (not their ranking!). So, six of Gilliam's first seven films as director are the highest-ranked ones, by consensus, with Jabberwocky plus his final five films as the lowest-ranked.

Most of the commentators selected Brazil as their number one film, with occasional votes for Monty Python and the Holy Grail. More than a half of the commentators selected The Brother Grimm as the worst film, with Tideland running a strong second.

There is nothing unusual about any of this, of course. It is a truism of social history that most people, whether they are artists or scientists, do their most interesting and influential work during the earlier part of their career. From Isaac Newton to Albert Einstein, most scientists coast through their careers after age 35, sometimes in their later years still collecting awards for the useful work they did 20 years before. The best-known exception was Louis Pasteur, who made significantly different major contributions to chemistry and biology during his 20s, 30s and 40s.

Well, artists are no different. Very few of them become famous during their later life, but instead continue to be "interesting" without being either as original or influential as they were in their earlier career. They are often well known and well respected, although just as often completely forgotten, or even unknown to later generations. Gilliam, at least, has not suffered the latter fate.