Language Family Tectonics

Basic research in historical linguistics is mostly done within individual families: we take a swath of attested (in most cases modern) languages, and work towards the past to figure out their development from a common origin, one group at a time. Any knowledge of languages outside the family only really factors in as correction terms: filtering out loanwords and other contact influence, as data that the family’s overall internal history will not need to account for.

What the big picture of this looks like once we consider also geography is that we end up with a series of dots — “homelands” (though not to be understood as points of creation, but simply the last uncoverable phase of earlier processes) — somewhere in the past; some of which have then expanded, to cover the whole world by today. Just a few millennia ago, much of the world would have been an uncharted area, full of regions from which no knowledge of their languages has survived to us. The ones that do survive would, even, have been largely isolated dots. Most language contacts must eventually end (or rather, begin) at some point in the past. Languages of different families, that are today next to each other, cannot all have had their parents too as neighbors. Perhaps some individual cases were: Proto-Germanic seems to have been about as much of a neighbor of Proto-Finnic as Swedish and Finnish are still today; even further back, something like Proto-Kartvelian as a neighbor of Proto-Northwest Caucasian could be possible too. But once we consider highly expansive families, it is self-evidently absurd to propose that Proto-Indo-European could have been simultaneously a neighbor to all of (pre-)Proto-Kartvelian in the Caucasus, (pre-)Proto-Uralic in the taiga zone, (pre-)Proto-Dravidian in South Asia, pre-Basque in Iberia…

This already implies that most borders of today’s language families are collision zones: where two lineages have come to meet that were not in contact at some point in the past. (Same also for some, though fewer, language borders within them.) I’d like to think that we can probably divide them further in subtypes. This will have to include their history, not just their current but also past dynamics. One reasonable analogy might be plate tectonics. Geologists are not content to simply locate the current boundaries of the world’s tectonic plates, but ever since the rise of continental drift to a mainstream theory, already introductory maps will also aim to identify boundaries as either constructive, destructive or conservative. Often longer-term history or future, too, could be extrapolated from arrows of movement (of, yes, actual movement right now — as per the classic example and the mid-ocean ridge closest to me, the Atlantic Ocean is growing some three micrometers wider every hour, already a perfectly visible amount of maybe 0.3 millimeters since I began to write this blog post).

Of course this is not to be aped too closely. The social “forces” that drive linguistic expansions can be rather fickle, nowhere near as stable and predictable as the physical forces of geology in e.g. continental drift. No responsible linguist is going to be putting a predicted specific time of death on any but, perhaps, an already moribund language (those where all transmission to new generations has already ceased, and the only question is whether the last few speakers have 5 or 50 years left to live); and predictions on what languages will be gaining new ground entirely I have not really seen anywhere at all. If anyone wants to register particular predictions, be my guest, but currently these are really only going to be educated guesses, not derived from a theory with known predictive power.

So maybe let’s not draw any future-pointing arrows on linguistic fault zones just yet. Drawing past-originating ones, though, seems like a much more doable task, first of all in cases where (some) history is already known. And this I think also gives us anyway some analogues of geologists’ “constructive, destructive, conservative”. A look at known history actually suggests that just two types might be enough to get started. Of course we can have conservative boundaries, where languages have stayed each on their own side for a while. This often coincides with also geographic boundaries of some sort (e.g. the northern boundary of Indic has been, broadly, at the Himalaya for millennia, and it’s no wonder that the Korean / Japonic boundary has stabilized between the Korean peninsula and the Japanese archipelago). Then we have collision zones, where two lineages come head to head —

But wait. Head to head? No, actually, the most typical case we see anywhere in the world’s known history is not quite this. Where we find e.g. a Germanic / Celtic boundary in the British Isles, a Finnic / Samic boundary in northern Finland, a Turkic / Iranic boundary north of Iran, a Bantu / Khoe boundary in Botswana: these do not represent cases of two spread events that finally arrived at some common ground simultaneously, running out of no speaker’s land to claim. Almost always such a border represents one newer (Germanic, Finnic, Turkic, Bantu) and one older family (Celtic, Samic, Iranic, Khoe), with the latter’s historical range extending far into the former’s current-day one. The geological analogy happens to continue working here too to some extent: when two plates collide, for all the mountains that results, these still are not zones where both plates indefinitely squish and crumple without crossing. Instead one plate will be pushed underneath another, into the crust (and mainly the topmost one will jut up as mountains). Now the distribution of language families does not really have a Z-axis, but the time axis does similar duty here. We already routinely speak of e.g. English expanding (having expanded) “over” Brittonic; and call the latter a “substrate”, the former a “superstrate”, again employing terms from geology that strictly speaking refer to vertical location. I’m sure also a part of the motivation is one of geology’s core findings that, by default, vertical order reflects historical order!

To fully derive an understanding of this situation, the naive zeroth-order model of language family expansion (they start in some some compact area in the past and begin expanding) moreover needs to be amended by the fact that expansions are not infinitely powerful: they can run out of steam even without encountering another expansion in its path. Not only does Finnish supercede various lost Sami varieties, it is also not the case that Samic started somewhere in the north and expanded south until running into Finnic. Rather, Samic also itself originally expanded mainly northwards, probably much along the same geographic routes. There was no southward expansion front of Samic for Finnic to collide with; nor an eastward expansion of Celtic by the time of the Germanic expansions, etc. In this way linguistic expansions might have a better geological analogy still in lava flows in a volcanic field: they will layer on top of another, not by virtue of which one expands faster or more strongly, but by simple virtue of which one has already stopped, at least in a particular area, and which one is still going.

In those cases where two expansions do happen to be going on simultaneously, this is maybe indeed more likely to end up with something resembling a conservative boundary. And also among these, many though will prove not quite entirely stable if we look closely enough. They can turn out to be series of small advances on either side, just not spilling out to outright conquest of the other family (and likewise, mostly not inherently one-dimensional lines anyway, but a crossfade in the proportion of speakers of X versus Y). Again more like lava flows than continents.

Still, I will continue to keep the term “tectonics” here anyway. Etymologically looking, it is not a term that by itself implies the details of plate tectonics, but simply refers to the largest-scale analyzable units.

What can we do with this then? If we recognize that the world’s major language family boundaries are mostly collision zones — where one family is or has been in the process of expanding at the cost of another, not currently expanding one — this gives us first of all convenient rules of thumb about linguistic substrates. Anywhere near a language family boundary, the substrate of an expanding family X is probably primarily the non-expanding language family Y next to it. At least in the wide definition of “substrate”, that is “the language spoken there before the expansion of the current family”. If it has left any discernible substrate influence, structural or lexical or toponymic, would be another discussion entirely. Conversely, locations where we might be able to fruitfully hypothesize completely extinct substrates will be instead

  1. more towards the geographic or expansion centers of recently expansive families (thus e.g. the Paleoeuropean substrates of Germanic);
  2. underlying not-most-recently expansive families that have few or no leading edges over anything anymore (thus e.g. the Paleolaplandic substrate in Samic).

Or further yet. The facts that language families expand from small origins, readily take over other languages in the process, and are also generally just some thousands of years old, leads us to also a more powerful rule of thumb: There Was Some Other Language There Before. Almost no language is the absolute first language to have been spoken in “its” territory. The main exceptions would be a few cases of recent seafarers, above all in Polynesia; several more scattered cases also in the Atlantic, of which I think only Icelandic and Cape Verde Creole have been established as their own languages. [1] At any other ends of the Earth, Inuit is a known newcomer in the American high arctic, Pama-Nyungan is a known newcomer in the Australian interior desert (even if the languages preceding them are not attested)… and in places with long written history, we may find quite extensive known successions, to the effect of Hattic replaced by Hittite replaced by Luwian replaced by Aramaic replaced by Greek replaced by Arabic replaced by Turkish. Maybe some Assyrian or Kurdish phase in there somewhere too, depending on what point we’re considering here exactly. More importantly, over the remaining at least 60,000 years of modern human presence in West Asia without written records, obviously much much more of this still. Not all of this leaves major genetic or archeological fingerprints, either, and some specific cases might be very hard to identify if we didn’t have linguistics itself as a source of evidence.

For two, it will be generally beneficial to work out which of any two language families in contact at a particular border has been the more recently expansive one. [2] Know more widely, at least. I’m not sure if there actually are many cases where this would be a mystery entirely. I could think of some hard-to-tell cases once we’re talking about subfamily borders (Mari / Udmurt? Celtic / pre-Latin Italic?), but even here probably some dedicated experts would have an opinion. Maps of individual language families, especially in historical contexts, often enough also have some spread lines or historical distributions marked. But large-scale summary maps still trend towards presentations like this, seemingly entirely static, even though the process of restricting language families to complementary areas necessarily elides some current-day detail in favor of historical idealization (denoting where a language family “is native” or “is traditionally spoken”). I’ve seen sociolinguists criticize this whole genre of language distribution maps repeatedly already, in them not really capturing synchronic reality. The response though might not need to be to abandon them entirely, as much as admit that, yes, they are maps that display some historical information too, and adjust accordingly for more history-informed design. If there is knowledge on this mostly out there, why not?

For three, a concept of family tectonics readily draws attention to the point that there’s work to be done not just on charting language families’ “current” or “traditional” distribution, but also their past distribution. “Beneath” (before) any current language family there “is” (was) some different distribution of other languages. Some of them maybe belonging in it still extant neighboring families, some maybe its own lost relatives, some maybe unknown entirely.

The first possibility I find the most interesting for the sake of further work. The closest example to my work comes from central and eastern Siberia. An important but I think largely open question would be what was spoken in the area before the expansion of the relative newcomers? Russian is of course the newest layer all over the place, but Siberian Turkic (Yakut, Tuvan, etc.) and Northern Tungusic (Evenki, Even, etc.) are both parts of relatively recent families too. What have they ended up displacing? Early Russian explorers report, and rudimentarily attest to, first of all a formerly wider distribution of the Yukaghir family, today known only in two small islets; and a variety of Samoyedic and Yeniseic varieties in the southwest of this area. Still, the main Turkic and Tungusic expansions must have been early enough to predate all historical records in the region, so this cannot be the whole picture either. One hypothesis I keep coming back to is the possibility of a lost “tenth” Uralic branch — perhaps para-Samoyedic, perhaps an independent branch entirely. This might have some benefits to it in explaining a variety of known but not especially substantial similarities between Uralic and all the other families further east. Turkic of course has been in direct contact with (branches of) Uralic anyway, but various parallels continue sporadically into Yukaghir, Tungusic, Chukotkan, Nivkh, Eskaleut. All of them seem more likely to originate from the Uralic side, due to it being the Siberian family with the most known time-depth. Yeniseian is sometimes approximated as rather old as well, but otherwise both “Neosiberian” and “Paleosiberian” are all families without too much time-depth. [3]

Most notably, Uralic parallels in eastern Siberia include even basic words for ‘reindeer’, an all-important livelihood animal for many groups these days, especially Chukotkan *qora (whence the ethnonym Koryak), Tungusic ⁽*⁾oron (or probably *xoron, with further diffusion after *x > ∅ in NTg) (whence the ethnonym Oroqen). Kolyma Yukaghir qoroj ‘two-year-old male reindeer’ is usually adduced here too, as well as loanwords further into Siberian Yupik. This has been already identified in earlier research as a Wanderwort originating in Proto-Uralic *kojəra ‘male [domestic?] animal’ > Proto-Samoyedic *korå ‘id.; bull reindeer’, which might have already had an allophonic [q-] in Proto-Samoyedic or even earlier. But we seem to lack especially clear evidence on who is to be credited for the original diffusion of this word. Yakut, as far as I know, has no reflex of it, splitting the Eastern Siberian region off from Samoyedic, and thus probably suggesting a pre-Turkic movement eastward. If so, then maybe even already at the time of the original Uralic expansion (which I think must have been partly eastwards too in any case)? Who knows. Maybe someone will eventually though, if we get e.g. some additional toponym data for guidance and keep inter-family comparative research going.

Elsewhere in the world, I’m wondering also about e.g. how far Africa’s other language families might have reached before the Niger-Congo and particularly Bantu expansion. The case of possible contact between Khoe and Cushitic is already preliminarily discussed in a 2009 paper from Blench, though I’ve been unable to verify his interesting claim that Khoe #goe for ‘cow’ would be compareable with similar “widespread terms” in Cushitic. [4] The quite tattered Central Sudanic looks like another good candidate for a family that might have been more widespread earlier (but might have been also enroached upon by Chadic and the various branches of Eastern Sudanic). In the Americas, too, I could wonder especially what preceded the large continuous spreads of Athabaskan and Algonquian in most of Canada and the northern US? (And also which of them is the newer one?) Was there ever anything to the effect of “Inland Tsimshianic” or “Inland Tlingit”, “Plains Iroquioan” or “Forest Caddoan”? Or turning to Oceania: how far west and east did the various “”Papuan”” language families (many of them even today not confined to just New Guinea) extend before the Austronesian / Malayo-Polynesian expansion? For that matter has anyone even tried comparing any of these with the other continental SEA languages in any capacity, or just assumed that they must have been in splendid isolation amongst each other linguistically effectively forever?

These are questions that, again, some experts might already know answers to or at least have hypotheses for. But nowhere is this information available in centralized geographic form, even though it would be surely possible to represent so, giving a kind of a bird’s eye view of what are the major ethnohistorical results achieved or confirmed by historical linguistics, and what questions still remain open.

[1] Faroe Islands seem to be better established than Iceland as having had a pre-Norse population (at least as of the Nature study just last December). A longer list of cases without a distinct local ethnicity includes e.g. the Azores, Bermuda, Falkland Islands, Svalbard, Tristan da Cunha (and also remote islands in the other oceans, e.g. Kerguelen). There are some more within-reach cases like the Andamans, Maledives or Nicobars, for which I’m not sure what’s known of their prehistory (though then already the existence of two Andamanese language families suggests that one of them is very likely older than the other).
[2] Not always the same family on top in all interactions: Turkic has been expansive over Iranic, while Russian has been expansive over Turkic … and yet Russian and Iranian are both Indo-European. It should be no surprize at all either when we find e.g. language shift from Swedish into Finnish in Finland, vs. from Finnish into Swedish in Sweden.
[3] Really if “Neosiberian” is taken to mean “the recent but pre-Russian arrivals”, and “Paleosiberian” as everything else in the area — then we ought to be counting Uralic as the largest representative of the latter, not as some European family that somehow just happens to be also present. By now we do know the westernmost expansions of Finnic, Samic and especially Hungarian to be relatively recent, while Uralic or pre-Uralic presence in western Siberia has no established terminus post quem (short of the hard geological limit of the last ice age). — I suppose the usual exclusion of Uralic from “Paleosiberian” has been instead more informed by its typological similarity with Turkic and Tungusic. But then this seems improper when the term is Paleosiberian, not “Non-vowel-harmonic-siberian” or anything else of that sort.
[4] Checking with a recent monograph from Bender instead shows some very uncompareable-looking terms in most of Cushitic, such as Oromo /saʔa/, Konso /lawaa/, Agaw (North Cushitic) *lɨw-, South Cushitic *ɬee; or does Blench have some supposition about a Northeast Caucasian-esque *ɬ > *g?! — Further north, *gʷow- ‘cow’ in Indo-European does look amusingly similar to Khoe, but Afrasian is bit too wide and old of a family (definitely older than the domestication of cattle, which “only” dates to ~10,000 years BP) for me to think that there could be a connection entirely without it. Even something like the mysterious Y-DNA haplogroup R-V88, common in central Africa around Lake Chad yet seemingly derived from Eurasia, doesn’t really allow any connection that would reach all the way to southern Africa.

10 comments on “Language Family Tectonics
  1. David Marjanović says:

    For that matter has anyone even tried comparing any of these with the other continental SEA languages in any capacity, or just assumed that they must have been in splendid isolation amongst each other linguistically effectively forever?

    Blench has this idea that generic Southeast Asia spoke Austroasiatic before Austronesian (and Indic) spread over it; but I can’t remember if he has compared any “Papuan” languages to it, and I’m not going to dive into that timesink again soon. :-)

    • sansdomino says:

      Yeah, if anyone, it is Blench that I would guess to have pondered a topic like this already — he can be definitely trusted to shake assumptions about how known linguistic pieces fit together in prehistory (even if anything new he proposes in connection will likely need to be double-checked in its details). I may even have seen something like that that was mostly about continental SEA. An AA substrate behind mainland Chamic, at least, would be extremely sensible.

      I did in fact a brief lookover of Zompist’s numbers list, but did not catch any Papuan languages that could plausibly adjoin the distinctive AA or also Hmong-Mien pattern where both ‘3’ and ‘4’ begin with *p… no easy outs here I’m sure.

    • In this Greenberg-style paper (indeed, one of its authors is M. Ruhlen) Kusunda is compared to an assortment of “Indo-Pacific” languages including Great Andamanese and various Papuan languages. The evidence is mainly pronominal and not very strong, but I think that the most interesting comparison here may be that between Great Andamanese and West Papuan. Akabea (Great Andamanese) has the following possessive prefixes on nouns, also used to form personal pronouns: 1sg d-, 1pl m-, 2 sg/pl ŋ-, 3 sg/pl Ø-. Now, “West Papuan” languages, including North Halmaheran languages, Moi and Tehit, have person prefixes 1sg t-, 1pl (excl.) m- and 2 sg/pl n-. Since this is a three-member paradigm, it is less likely to be a complete coincidence than two-member “Mitian” M-T or “Amerind” N-M. Incidentally, the paper on Kusunda does not mention the 1 pl m-, since it is not found in Kusunda. From a geographical point of view, West Papuan languages are more likely to have some discernible relatives in Eurasia than other Papuan families.

  2. Y says:

    Alexandre François published a paper just a few weeks ago, titled “Lexical tectonics: Mapping structural change in patterns of lexification”. Is the Collective Unconscious at work here?

    (François uses the metaphor in a very different sense. He uses it to describe shifting boundaries within semantic maps.)

    Aside from that, geology has closer metaphors to unrecorded languages known only through borrowings. Bits of the lower crust and the mantle of the earth are brought up as xenoliths in igneous rocks, and traces of older continents survive as mineral grains in younger sedimentary rocks (most famously in the Jack Hills of Western Australia).

  3. recent seafarers, above all in Polynesia

    Or rather generally in Remote Oceania, including also Micronesia, Vanuatu and New Caledonia.

    • sansdomino says:

      Yes, fair, I’m not well-versed in the finer divisions of Oceania and mainly know that it’s not all of it that was first settled by Oceanic speakers (i.e. “Papuan” languages extending also further out to at least Bougainville Island).

      • The terms Near vs. Remote Oceania refer precisely to parts that were settled tens of thousands years ago vs. islands first settled by Oceanic speakers.
        The boundary lies between Solomon Islands (Near Oceania) and Santa Cruz Islands (Remote Oceania). “Papuan” languages are found not only on Bougainville, but also on Central Solomon Islands, but Äiwoo on Santa Cruz, formerly classified as Papuan, turned out to be Oceanic.

  4. Andreas Johansson says:

    It’s my understanding that most of Siberia, incl much of western Siberia, was ice-free during the last ice age. So I don’t think you can really exclude that pre-proto-Uralic was spoken there back then.

    • sansdomino says:

      The problem in western Siberia is not just the ice, but also the periglacial lake(s). Most of today’s Khanty-Mansi Autonomous Okrug where not covered by ice would’ve been covered by what seems to be known as Lake Mansi.

      Higher-elevation central Siberia would’ve been open tundra or mammoth steppe, yes.

