First, in the course of the GEBA pilot project the problem sometimes occurred that comparatively closely related organisms were targeted. Second, it is more likely for a more densely sampled group of organisms that a genome of at reference 4 least one of its members will be targeted by a genome project other than GEBA than for an isolated organism or group of organisms. The final goal of the novel algorithm was that the score, if summed up over all leaves (i.e., species or subspecies present; see below) of the underlying phylogenetic tree, yielded a value that served as the score of the entire tree in some biologically sensible manner. This feature allowed for estimates of the number of genome projects needed to cover a certain percentage of the total phylogenetic diversity.
If phylogenetic diversity was measured using a sum-of-branch-length approach, it should be possible to simply add the scores of distinct subtrees, including the scores of distinct leaves, together to obtain the scores of their parent subtrees or the entire underlying phylogenetic hypothesis. With such an approach, it would be easily possible to assess saturation effects caused by the inclusion of suitable targets. Algorithm We devised a scoring system for the leaves in a rooted topology with branch lengths. To comply with the second design goal, it was obvious that the branch lengths between each leaf and the root node had to be added up in some manner. To agree with the first design goal, this had to be done irrespective of whether any leaves were already marked in some way (e.g., as already targeted for genome projects).
That is, none of the leaves themselves could be downweighted or even deleted. For compliance with the fourth design goal, however, some downweighting had to be applied to avoid counting branches several times, thus overestimating overall phylogenetic diversity. For this reason, we considered scores, henceforth called Relative Phylogenetic Diversity (RPD), which proportionally downweighted the lengths of shared (i.e., internal) branches. Two versions were examined, a balanced (bRPD) and an unbalanced (uRPD) version. The latter weights each pair of sister clades equally, irrespective of the respective number of leaves, whereas bRPD takes the subtree sizes into account. Probabilistic interpretations come into play here. For example, consider leaf A in Figure 1.
The branch between nodes A and AB is not shared with another leaf; character changes that occurred on it (whose amount is proportional to the branch length) may have led to, e.g., novel sets of proteins in A [10], Entinostat but not in any other leaf. Changes on the branch between nodes AB and ABC, however, have affected both A and B, whereas those on the branch between ABC and ABCDE have influenced the leaves A, B and C.