A complaint that often comes up in introductions to studies on computational phylogenetics is that the number of possible binary trees grows quite fast (loosely factorialesquely) as a function of the number of entities we are attempting to relate. This means that regardless of what method is being used, it is for larger datasets usually not possible to investigate all possible trees on how well they would fit the data.
I wonder though if there might be though a shortcut that would allow simplifying linguistic analyses, in particular, a fair bit: geographical restrictions on branching. Languages, for a rough generalization, exist on territories within the 2D surface of the Earth; they have a number of neighbors, and usually have a neighboring language as their closes relative. Languages far separated from their relatives tend to be exceptional cases, often confirmable as recent intrusions by history.
So, instead of investigating arbitrary binary trees, we perhaps ought to only investigate binary trees where each division leaves two currently or historically contiguous groups of daughter languages. This restriction seems likely to be quite powerful in paring down the seemingly intractably vast hypothesis space.
Consider the simplest possible class of examples, the linear dialect continuum:
- If given a western dialect A, a central dialect B, and an eastern dialect C, the only sensible phylogenetic hypotheses are ((A B) C) and (A (B C)); the tree with B as an outgroup is rejectable.
- if given four dialects A B C D, we will accept five trees (A (B (C D))), (A ((B C) D), ((A B)(C D)), ((A (B C))D), (((A B) C) D), and reject ten.
- if given five dialects, we will accept 14 trees and reject 91.
- if given six dialects, we will accept 42 trees and reject 903.
- if given seven dialects, we will accept 132 trees and reject 10 263.
- if given eight dialects, we will accept 429 trees and reject 134 706.
- etc; as can be seen, the efficiency of this pruning method grows quite fast.
As I’ve seen recently remarked, 50 languages suffice to generate about 3·10⁷⁶ binary trees, almost as many as there are atoms in the universe (not actually more, but close enough). — Meanwhile string bracketings grow rather more slowly, and with 50 dialects we only reach about 2·10²⁷.
This is still huge enough (i.e. about 2 000 000 000 000 000 000 000 000 000 trees) that I don’t think my method here will help much on studies on Austronesian phylogeny in particular. But in smaller cases, with on the order of 10-20 language varieties under investigation, the difference will be quite significant.
In a real-world situation, dialect geography rarely works quite this easily though. I suppose a slightly better test model might be polyhex connectivity graphs — these allow up to 6 neighbors for each individual language, something that ought to suffice for most cases. I might be tinkering with these in the near future.
Obviously this constraint can still be violated by occasional majorly expansive languages, e.g. Persian. I suspect we do not have sufficient information to tell right away which of the dozens of minority Iranian languages were its original neighbors. Though in these cases a solution that’s in principle available is to divide such a language itself into a large number of tiny dialects, and treat each of them separately…