Tuesday, September 26, 2017

Some desiderata for using splits graphs for exploratory data analysis


This is the 500th post from this blog, making it one of the longest-running blogs in phylogenetics, if not the longest. For example, among the phylogenetics blogs that I have previously listed, there has been only one post so far this year that has not been about a specific computer program.

Our first blog post was on Saturday 25 February 2012; and most weeks since then have had one or two posts. We have covered a lot of ground during that time, focusing on the use of network graphs for phylogenetic data, broadly defined (ie. including biology, linguistics, and stemmatology). However, we have not been averse to applying what are know as "phylogenetic networks" to other data, as well; and to discussing phylogenetic trees, when appropriate.


For this 500th post, I though that I should focus on what seems to me to be one of the least appreciated aspects of biology — the need to look at data before formally analyzing it.

Phylogeneticists, for example, have a tendency to rush into some specified form of phylogenetic analysis, without first considering whether that analysis is actually suitable for the data at hand. It is therefore wise to investigate the nature of the data first, before formal analysis, using what is known as exploratory data analysis (EDA).

EDA involves getting a picture of the data, literally. That picture should be clear, as well as informative. That is, it should highlight some particular characteristics of the data, whatever they may be. Different EDA tools are likely to reveal different characteristics — there is not single tool that does it all. That is why it is called "exploration", because you need to have a look around the data using different tools.

This is where splits graphs come into play, perhaps the most important tool developed for phylogenetics over the past 50 years.

Splits graphs

Splits graphs are the best current tools for visualizing phylogenetic data. They were developed back in 1992, by Hans-Jürgen Bandelt & Andreas Dress. These graphs had a checkered career for the first 15 years, or so, but they have become increasingly popular over the past 10 years.

It is important to note that splits graphs are not intended to represent phylogenetic histories, in the sense of showing the historical connections between ancestors and descendants. This does not mean that there is no reason why should not do so, but it is not their intended purpose. Their purpose is to display phenetic data patterns efficiently. In this sense, calling them "phylogenetic networks" may be somewhat misleading — they are data-display networks, not evolutionary networks.

A split is simply a partitioning of a group of objects into two mutually exclusive subgroups (a bipartition). In biology, these objects can be individuals, populations, species, or even higher taxonomic groups (OTUs); and in the social sciences, they might be languages or language groups, or they could be written texts, or verbal tales, or tools or any other human artifacts. Any collection of objects will contain a set of such splits, either explicitly (eg. based on character data) or implicitly (eg. based on inter-object distances). A splits graph simultaneously displays some subset of the splits.

Ideally, a splits graph would display all of the splits; but for realistic biological data this is not likely to happen — the graph would simply be too complex for interpretation. So, a series of graphing algorithms have been developed that will display different subsets of the splits. That is, splits graphs actually form a family of closely related graphs. Technically, the Median Network is the only graph type that tries to display all of the splits; however, the result will usually be too complicated to be useful for EDA.

So, these days there is a range of splits-graph methods available for character-based data (such as Median Networks and Parsimony Splits), distance-based data (such as NeighborNet and Split Decomposition), and tree-based data (such as Consensus Networks and SuperNetworks). In population genetics, haplotype networks can be produced by methods that conceptually modify Median Networks (such as Reduced Median Networks and Median-Joining Networks).

The purpose of this post, however, is not to discuss all of the types of splits graphs, but to consider what computer tools we would need in order to successfully use this family of graphs for EDA in phylogenetics.


Desiderata

The basic idea of EDA is to have a picture of the data. So, any computer program for EDA in phylogenetics needs to be able to quickly and easily produce the splits graph, and then allow us to explore and manipulate it interactively.

To do this, the features listed below are the ones that I consider to be most helpful for EDA (and thanks to Guido Grimm and Scot Kelchner for making some of the suggestions). It would be great to have a computer program that implements all of these features, but this does not yet exist. SplitsTree has some of them, making it the current program of choice. However, there is quite some way to go before a truly suitable program could exist.

Note that these desiderata fall into several groups:
  1. evaluating the network itself
  2. comparing the network to other possible representations of the data
  3. manipulating the presentation of the network
It is desirable to be able to interactively:
  • specify which supported splits are shown in the graph— eg. show only those explicitly supported by character
  • list the split-support values
  • highlight particular splits in the graph — eg. by clicking on one of the edges
  • identify splits for specified taxon partitions (if the split is supported) — this is the complement to the previous one, in which we specify the split from a list of objects, not from the graph itself
  • identify which splits are sensitive to the model used — eg. different network algorithms
  • identify which edges are missing when comparing a planar graph with an n-dimensional one — this would potentially be complex if one compares, say, a NeighborNet to a Median Network
  • map support values onto the graph (ie. other than split support, which is usually the edge length) — eg. bootstrap values
  • evaluate the tree-likeness of the network — ie. the extent of reticulation needed to display the data
  • map edges from other networks or trees onto the graph — this allows us to compare graphs, or to superimpose a specified tree onto the network
  • find out if the network is tree-based, by breaking it down into a defined number of trees —along with a measure for how comprehensive these trees capture the network
  • create a tree-based network by having the network be the super-set of some specified tree — eg. the NeighborNet graph could be a superset of the Neighbor-Joining tree
  • manipulate the presentation of the graph — eg. orientation, colours, fonts, etc
  • remove trivial splits — eg. those with edges shorter than some specified minimum, assuming that edge length represents split support
  • plot characters onto the graph — possibly next to the object labels, but preferably on the edges if they are associated with particular partitions
  • examine which subsets of the data are responsible for the reticulations — eg. for character-based inputs this might a sliding window that updates the network for each region of an alignment, or for tree-based inputs it might be a tree inclusion-exclusion list.
Other relevant posts

Here are some other blog posts that discuss the use of splits graphs for exploring genealogical data.

How to interpret splits graphs

Recognizing groups in splits graphs

Splits and neighborhoods in splits graphs

Mis-interpreting splits graphs

No comments:

Post a Comment