Long-Distance Comparisons As Butterflies

One of the rationality-cluster blogs here on WordPress, Aceso Under Glass, a while ago posted about a concept I find immediately useful: “Butterfly Ideas“. Roughly speaking, hypotheses that need further development, are probably not ripe for serious criticism as they stand, but could benefit from preliminary discussion (read the full post for more).

On this blog and elsewhere, I have repeatedly entertained a variety of “long-distance” linguistic relationships: Nostratic, Uralo-Yukaghir, Uralo-Eskimo, the works, despite not being so far highly committed to any of them. One idiom I’ve previously used to defend this is “big fish are worth angling even if you don’t catch any”; that there are major potential gains for our understanding of history (both intra-linguistic and extra-linguistic) if any of these theories start to prove themselves in more detail. Or as the more succinct modern spin goes, “big if true”. A second motivation is provided by what I have called the “cell theory of language“: spoken natural languages only come from other natural languages, never out of nothing. [1] This gives a strong prior that all natural languages are, indeed, related, even if we currently lack the knowledge of the details. Factoring in also anthropology further gives strong reasons to believe also in the existence of a number of “bottleneck proto-languages”, such as Proto-Australian, Proto-Amerind or Proto-Exo-African. So big fish are very likely indeed out there, even if we are not sure if our lures are working. Though then these are weaker boundary conditions that do not establish what currently-known families exactly would be the daughters of such a proto-language. E.g. who knows if some American languages might be not Amerind ≈ Beringian, but something else, like para-Na-Dene, pre-Clovis-coastal, Solutrean…? Continuing the metaphor, this would mean we don’t even know how big the fish are exactly, and so also we might not know (yet?) what are the best ways to catch them.

But there’s also a sense in which I think long-distance relationships would be better seen as butterflies than big fish. We do not find relationships in an instant, as sudden flashy discoveries (by “bites” on a “lure”). All spoken languages are in principle compareable, with known typological differences but also universal family resemblance. [2] The universality of basic phonological categories in particular makes it possible to find some resemblances between any two languages that plausibly could be indicative of some etymological or indeed genealogical relationship. Whether they actually are, depends on additional work on fine-tuning details. Are they above the level of pure chance, and independent of known onomatopoetic and nursery word trends? Are they in conflict with other data of equal value? Do they show recurring sound correspondences, at least some of them nontrivial? These are questions for which we cannot expect to have every answer in place immediately. Any relationship must always begin from observing some similarities that are not probative in itself, and then pursuing this as a hypothesis and seeing if it guides us to more similarities, ones that will not require further costly assumptions to justify.

If all we knew about Finnish and Hungarian were that their verbs for ‘to live’ are, respectively, elää and él, this would not be sufficient evidence to establish them as related languages. But they are, indeed, cognates. Insufficiency or statistical insignificance does not in any way refute cognacy per se. And it is true that checking for more examples of the correspondences e ~ é and l ~ l turns up more evidence such as pelätä ~ fél ‘to fear’. Now with a new correspondence p ~ f, but this does not mean we turn up our nose and declare the hypothesis unworkable: it’s possible to continue and maybe discover, say, pesä ~ fészek ‘nest’. It always takes several steps like this to assemble e.g. a phonological core that will be self-evidently non-accidental. Same for other “evidential cores”, such as partial common morphological paradigms. There is no immediate bite that instantly proves a relationship, but rather, a first weak signal, which will rise in importance once combined with a proper selection of other datapoints.

Any “minimum convincing argument” will not be dozens of steps deep necessarily, but where patience is especially needed is that at any stage there will be plenty of false paths of expansion that will not lead to a workable theory. If at some early point, we had formed a hypothesis of a ~ a, and then run into vapaa ~ szabad ‘free’ (without realizing that both are loanwords from Slavic) — we could still find more evidence also for p ~ b (e.g. by misanalyzing the correspondence Fi. mp ~ Hu. b), but no additional good evidence would be turning up for v ~ sz. At some point we might end up concluding that, yes, this is going nowhere and should be discarded. But then only this comparison! Finnish and Hungarian are still ultimately related, even if their words for ‘free’ are not cognate. Discarding this one comparison does not (should not) mean discarding also any other adjacent comparisons. A burgeoning comparative edifice needs to be open for exploration and individual mistakes, if it is to ever reach any particular rank like “a probable relationship” or “a proven relationship”.

This plea of course has also a corresponding inverse. Anyone who wants a “butterfly” treatment of their ideas has to have enough intellectual humility to recognize that it is, indeed, a tentative first-pass version. All too often I see also people who have a new language relation hypothesis in hand double down on their speculation, and not be open to even constructive criticism. Perhaps in some part there is a misunderstanding where people do not recognize the proposal of better, non-cognate etymologies (borrowing, onomatopoeia, internal derivation) as progress. But certainly also lone-wolf-genius-ism, and its attached incapacity to admit mistakes, is a problem that exists.

On the other hand, I don’t think this side of the problem needs to be focused on too much. In historical linguistics, the exploration of linguistic relationships is already a known research programme, a goal that many people agree to pursue even if we tend to disagree on quite a lot of details. This in mind, if a K. Kookenstein puts out a paper on allegedly showing how English is related to Arabic, but then refuses to consider these comparisons in light of what Indo-European or Semitic linguistics has to say on this: we don’t actually need his approval on this! Language data is not locked, copyrighted, or in any other way tied down to one person, and if desired, it will be possible in any case to check such papers for insights relevant also to better situated IE–Semitic comparison. I know I at least keep a few “Hungarian is too a Turkic language” type works around for this purpose. The intended main thesis is not going to pan out; but any data cited to this end could prove to be regardless still valid. Usually anything of this sort mostly relies on word comparisons (appeals to typology are strangely rare), and these might remain valid as etymologies of any imaginable type… not just Turkic loans in Hungarian, but maybe also old Hu. loans in Tk.; Hu. cognates of Khanty or Samoyedic loans in Tk.; common loans from some third source like Iranian or Yeniseian or Mongolic; some could even end up being evidence for a general Turkic–Uralic relationship. None of this is a priori ruled out, and in this way it may well be possible, with patience, to find meaningful building blocks even within theories that don’t hold up in their entirety. Such is a nifty property of historical linguistics, something that definitely doesn’t generalize to every science.

The two animal metaphors from the start of this post, though, no longer work very well at this point. Some butterflies … may grow up to be big fish, even though most probably don’t? Moreover, I have been mostly illustrating this discussion with disputed-but-definitely-published ideas. More nascent ideas that are simply brought up in a discussion are a different beast for sure. Of course there’s a selection bias here too: the actual butterfly ideas I do have, you will probably not be seeing on this blog as such (and you might have to watch closely to catch any even on my side channels). [3] Arguably also scientific publishing is “a conversation”… especially any ideas that can be so far found only in some paper draft posted for comments online (in linguistics they’re not even concentrated yet on any arXiv analogue). For these, the original reading of a butterfly idea seems to still work fairly well. This may hopefully help (e.g.) various long-distance proposals to develop better in the end, before they end up with one of two common fates: shelved as not having passed the judgement of Reviewer #2, or self-published with excessive confidence. For this goal, yes, the ball very much is first in the court of people who do have an idea and want to develop it; but it is also in the hands of the rest of us, in being willing to offer first criticism that’s not a complete dismissal. Thirdly, worth noting, all this also depends on a social milieu where people even can find parties interested in discussing some out-there idea.

A further aspect of AUG’s original concept — avoiding unnecessary emotional stress upon people presenting a new idea — I haven’t really even touched here yet. This would be a whole other jar of larvae, but suffice to say I agree that academic discussion, for all its standards of civility, fairly often can have undertones all the way to hostility. This probably scares away many people without a thick skin who might otherwise have had a few interesting things to say; and those of us who do stay engaged, to whatever degree, it may leave with more stress than is necessary.

Some of it, I’m sure, does not even come from a particular need to be prickly, but from limited time… Sufficiently well-known figures in a field tend to get approached by a disproportionate amount of amateurs with A Revolutionary Discovery, unless they specifically keep themselves hard-to-contact, or, perhaps, maintain an aura of not suffering fools gladly. Again a problem that might be softened with other people being open and approachable enough. But this also starts edging towards the general area of science communication and public relations, a bigger fish still to fry that I’m not going to pretend to already have big original ideas for right now (and the butterflies, they will have to wait for other channels).

[1] The famous case of Nicaraguan Sign Language does not seem to have spoken analogues. In principle there is little directly preventing such a case (and something of the sort, maybe in several gradual episodes, will have to be assumed as the ultimate origin of human language too), but the conditions are unlikely to ever come about. A community of children who are capable of speech but do not have access to any pre-existing spoken language? Sorry, language in general is too adaptive to have been ever abandoned after its first introduction. I will go as far as to suggest that all known human cultures depend strongly enough on language for the transmission of cultural knowledge that any sudden failure of language skills across an entire human group (say, a transmissible disease that induces deafness, fast enough that a signed language does not have time to develop) would not lead to an all-new language being developed a few generations later; it would lead to the group’s extinction.
[2] In the philosophical sense, not the genealogical one. E.g. despite some exceptions, most languages still have nasal or labial or velar consonants; all but the most impoverished and unbalanced phonological inventories or even just consonant inventories are going to have substantial overlap between them. And even if we did find languages that somehow have completely disjoint phoneme inventories (lazy example: one has only stop consonants and front vowels, the other only continuant consonants and back vowels?), they will not be unbridgably far apart: the known typology of sound change allows hypotheses relating basically any two speech sounds. Grammatical categories, too, can be quite different but still only finitely far apart, where the details of known language histories likewise give us ways to relate non-identical categories to each other (or to derive them de novo language-internally, etc.)
[3] A freebie for the sake of example though: cf. some very loose thoughts about the subclassification of Oceanic as floated on Tumblr just a few days ago (also already with some, though not highly severe, critique from a regular correspondent over there).

14 comments on "Long-Distance Comparisons As Butterflies
  1. Y says:

    “A strong prior that all natural languages are, indeed, related, even if we currently lack the knowledge of the details”: I agree, but the rest doesn’t follow. The other prior that we have is that for decades now, no new higher-level groupings have been proposed and widely accepted, despite having many people look at the data (maybe there’s one or two I’m not thinking of. I don’t know what the consensus is on Dené-Yeniseian.) Moreover, older hypotheses, examined with more data and with better analysis, have been falling left and right (many in Africa), while others (like Uralo-Yukaghir) look at least a lot harder to prove than they’d seemed at first. Of course there’s a lot yet to be done, but it’s not outlandish to guess that all demonstrable higher-level groupings have already been discovered.

    • sansdomino says:

      On the tier of proposed all-new groupings, plenty of new ideas do keep coming along; say e.g. Basque–IE, Hurrian–IE, Kartvelian–Burushaski, Nivkh–Chukotko-Kamchatkan, Nivkh–Wakashan, Austronesian–Ongan, Great Andamanese–Austroasiatic. Expecting them to be also accepted or correct right away, though, is kind of a “butterfly-crushing” approach. Even expecting to hear of them immediately is the wrong assumption: this is stuff that goes around among specialists first, long before you might hear of it in general circles. I do not think these are even mostly generally correct, but do think that probably most of these are onto something (can be eventually improved in some form). To Africa, Papua or North America I have not paid too much attention, but have seen at least a few interesting suggestions in these milieus too: e.g. there seem to be first hints out in a direction “most of Omotic is not Afrasian, but Dizoid is”.

      Looking at groupings that have in recent times risen to become accepted widely or increasingly, that then mostly includes examples like Austro–Tai or Je–Tupi(–Carib?) that have existed as proposals for ages, but only recently have been well substantiated with data. Also on the “big fish caught” front, there is at least the discovery that Khitan is para-Mongolic.

      No grounds for defeatism that I see. “Everything discoverable has been discovered” has so far never been true in any branch of science.

    • no new higher-level groupings have been proposed and widely accepted

      A counterexample that immediately comes to mind is Usher & Suter’s Anim family, proposed in 2015, and already accepted by such a conservative resource as Glottolog.

    • despite having many people look at the data

      This is where I can’t agree with you. “Many people” with knowledge of historical linguistics have perhaps looked at the data for Indo-Uralic or Altaic, but hardly for any other proposal. Many, or even most accepted language families have two or three historical linguists who can assess the evidence, but very few of them have specialist knowledge of several families.
      Moreover, your phasing implies that there are raw “data” out there, and all it takes to verify a hypothesis is to take a look at them. This is an essentially Greenbergian approach: Greenberg thought that one can classify languages by eyeballing raw data. For those who believe in the power of the Comparative Method, it is evident that a reconstructed proto-language may look rather unlike its attested descendants. Indeed, this is what we see in cases where reconstruction is sufficiently sophisticated: Proto-Indo-European with laryngeals, Proto-Uto-Aztecan with syllable-final consonants, etc.
      The assessment of distant relationship hypoteses depends on what exactly we compare, i.e., on the reconstructions of protolanguages of uncontroversial families. For example, Austro-Tai is now increasingly widely, if not universally, accepted. What led to this acceptance was the changes in the reconstructions on the Kra-Dai side. When the hypothesis was first proposed by Paul Benedict, it rested on some striking resemblances, but it was not clear at the time that these resemblances are more likely to reflect genetic relationship rather than contact. Graham Thurgood argued that the sound correspondences in the Kra-Dai words compared to Austronesian (that is sound correspondences within Kra-Dai, not between Kra-Dai and Austronesian) are irregular, and this irregularity points to borrowing. Later, Weera Ostapirat has shown that the sound correspondences in these words are actually regular, and, moreover, that on inner-Kra-Dai (or even inner-Hlai) grounds one can reconstruct penultimate vowels, lost in daughter languages (thus, another case like IE laryngeals), which regularly correspond to Austronesian penultimate vowels. Here is a case when a proper assessment of a hypothesis completely depends on the progress in the reconstruction of an uncontroversial family (Kra-Dai).
      Now, the main question is: how many uncontroversial families already have a mature reconstruction which will not change significantly with new data or new analyses? I suppose that this can be said only about handful of lucky exceptions such as Indo-European, Uralic, Austronesian, Algonquian and Bantu. One can be almost certain that, e.g., properly reconstructed Proto-Sino-Tibetan or Proto-Otomanguean will be surprisingly different from existing attempts at reconstruction.
      Finally, I think that the all too frequent assertions that everything demonstrable has been already discovered work as a self-fulfilling prophecy: they discourage historical linguists from working on anything but universally accepted families, thus lessening the likelihood of future discoveries.

      • sansdomino says:

        Agreed about this as well. If we measure progress by how far Indo-Uralic or Altaic specifically have gone, then fairly little has happened over the last half a century (towards a relationship anyway). Most openings are going to be elsewhere, already purely by numbers, and have not been looked at by several people. If by anyone! Most of the world’s languages are still not by any measure well-documented, sometimes even across entire families.

        • Howl says:

          The Austro-Tai example also shows that if you have loanword etymologies which were argued on good grounds, that still does not exclude inheritance. Irregular correspondences could mean sound correspondences which were not understood well enough at the time.

          The reconstruction of Proto-Uralic has become much better. Once Aikio’s UED is finished, it will all be accessible and in one place. I think any Uralic-Whatever proposal could benefit from that.

          Altaic has much room for improvement IMO. It is clear to me that better reconstructions are needed there.

  2. David Marjanović says:

    Ah, the basic dilemma of science. In order to be able to ever figure out anything really new, you need to be able to… seriously entertain the possibility that every single one of your predecessors and contemporaries, all the finest minds and greatest scholars, were wrong and you are the first ever to get this right. But of course if you overdo that, you become a crackpot. :-|

    I know I at least keep a few “Hungarian is too a Turkic language” type works around for this purpose.

    Likewise, Guillaume Jacques keeps a “what, Rgyalrong is totally Tibetan” work around as a nice handy compendium of Tibetan loans.

    Sufficiently well-known figures in a field tend to get approached by a disproportionate amount of amateurs with A Revolutionary Discovery, unless they specifically keep themselves hard-to-contact, or, perhaps, maintain an aura of not suffering fools gladly.

    Larry Trask was inundated with Basque–Anything proposals. As far as I can tell, he ended up convinced that any attempt to find any relatives for Basque was ipso facto crackpottery rather than merely wrong.

    • sansdomino says:

      I wonder if anyone would like to see a typology of Basque–Whatever comparisons and maybe some way to measure how crackpotty they are exactly.

      My own opinion so far on that topic is that, even if not by definition unworkable, attempting to tackle the question head-on is probably not a good use of time… actually getting generally agreeable results is probably going to require not creating a Proto-Basque-Whatever reconstruction, but reconstructing some other Proto-XYZ and then demonstrating that Basque is, btw, also a descendant, the way e.g. Albanian was classified. (Same also for, say, Sumerian.) Even then this is not automatically guaranteed to work, when already “Basque is Sino-Caucasian” and all the variants of “Basque is Indo-European” kind of do follow that and give very mutually incompatible results.

      • David Marjanović says:

        The “Basque is IE or its sister-group” proposals I’ve seen are all pretty bad, but I do agree they aren’t crackpottery.

  3. David Marjanović says:

    K. Kookenstein

    Not to be confused with Ernst von Koken, who may or may not also have been named Karl.

  4. sparkles says:

    no new higher-level groupings have been proposed and widely accepted

    That depends what is meant by “new”. Does 1976 count? Effectively 2005? An Austro-Tai connection was first made by Schlegel in 1901 and 1902. The initial idea was popularised by Paul Benedict in 1942, who expanded Wilhelm Schmidt’s 1906 “Austric” proposal (linking Austronesian to Austroasiatic), accepting a close relationship between Austronesian and Kra-Dai within Austric. While largely ignored by specialists (Sebeok 1942, Thomas 1964), an Austric family was accepted by the Moscow school and connected circles. Following the publication of Pou and Jenner 1994, Benedict abandoned Austric in 1976 (with further arguments in 1991), but he continued to maintain Austro-Tai (adding Japonic in 1990, a proposal seen more than once but without much success; see Vovin). Too little work had been done reconstructing individual nodes, and as a result these early rushed attempts were thoroughly criticised in works like Gedney 1976, Diffloth 1976, Short 1976, Egerod 1976, Diffloth 1977, Reid 1984, Thurgood 1985, Hartmann 1986, Reid 1988, Diffloth 1990, Matisoff 1990, Solnit 1992, Thurgood 1994, Ross 1994, Reid 1994, Diffloth 1994 Schumacher 1998, Diller 1998, Diller 2000, Ostapirat 2004, and a number of others, which varied from complete rejection of relationship to acceptance of a very remote relationship. And while Austric continued to see some support from more long range prone linguists (i.e. Bengston 2006), the more formidable Austro-Tai remained generally unaccepted.

    This was until 2005, when the Proto-Kra-Dai reconstruction specialist Ostapirat published a paper showing regular sound correspondences between words in the basic vocabulary of both families, arguing for a genetic relationship. The same year, Sagart (mostly a Sinologist and previously an adherent of Sino-Tibetan-Austronesian alongside Reid and Xing i.e. 1999) initiated a linguistic controversy by classifying Kra-Dai as a subfamily within Austronesian that underwent typological change upon contact with an unknown family. The first prominent review of their work might have been Reid 2006, who accepted some sort of Austro-Tai relationship but remained undecided as to what sort of relationship it might be. A number of Austronesian specialists began to open up to the idea, but it would be several years before any joined one of the two parties.

    As the reconstruction of both families progressed, more and more linguists started to give their opinions. Robert Blust and Andrew Reid, who had both rejected Benedict’s proposal, both ended up accepting the Austro-Tai hypothesis in the form proposed by Ostapirat, and their student Alexander D. Smith has taken up the task of its reconstruction. It is safe to say that most Proto-Austronesian reconstruction specialists now accept a genetic Austro-Tai relationship. From the Proto-Kra-Dai side, James Chamberlain accepts the Ostapirat hypothesis, and Peter Norquest accepted a relationship in 2013 but has not been very public on what sort of relationship he thinks it might be.

    As noted in his 2017 monograph, Yongxian Luo summarised the situation in China, where Deng and Wang have come out to support a genetic Austro-Tai, while Luo himself prefers the nationally popular hypothesis of a genetic relationship between Kra-Dai and Sino-Tibetan and a contact scenario for the Austro-Tai similarities. Sino-Tai is very strong in the Chinese camp, and will continue to be for some time. Support is understandably strong thanks to its age, with works like De Lacouperie 1883, Wulff 1934, Li 1938, Hadricourt 1974, Shafer 1974, Manomaivibool 1975, Li 1976, Denlinger 1989, Zhengzhang 1995, Gong 2002, Ting 2002, Pan 2002, Nie 2002, Lan 2003, Mei 2003, Zheng 2004, and so on, including many works by Luo. But even in China, some works have criticised a “genetic” connection since Luo Meizhen 1992 and 1994, and Dai and Fu 1995.

    A few elderly specialists in the relevant fields (i.e. Diller, Edmondson, Solnit, Ross, Pawley) have yet to discuss Ostapirat’s opinion in any conventional publication I think, and there are younger linguists who have yet to publish anything on the topic. But outside China a consensus seems to have been reached that there is a genetic connection between Austronesian and Kra-Dai. As recently as 2015, Jeremy Collins (an Austroasiaticist) expressed doubt, but he also cautiously held that Robbeets’ Transeurasian hypothesis looked promising, so I don’t know how much experience he had.

