Long-Distance Comparisons As Butterflies

One of the rationality-cluster blogs here on WordPress, Aceso Under Glass, a while ago posted about a concept I find immediately useful: “Butterfly Ideas“. Roughly speaking, hypotheses that need further development, are probably not ripe for serious criticism as they stand, but could benefit from preliminary discussion (read the full post for more).

On this blog and elsewhere, I have repeatedly entertained a variety of “long-distance” linguistic relationships: Nostratic, Uralo-Yukaghir, Uralo-Eskimo, the works, despite not being so far highly committed to any of them. One idiom I’ve previously used to defend this is “big fish are worth angling even if you don’t catch any”; that there are major potential gains for our understanding of history (both intra-linguistic and extra-linguistic) if any of these theories start to prove themselves in more detail. Or as the more succinct modern spin goes, “big if true”. A second motivation is provided by what I have called the “cell theory of language“: spoken natural languages only come from other natural languages, never out of nothing. [1] This gives a strong prior that all natural languages are, indeed, related, even if we currently lack the knowledge of the details. Factoring in also anthropology further gives strong reasons to believe also in the existence of a number of “bottleneck proto-languages”, such as Proto-Australian, Proto-Amerind or Proto-Exo-African. So big fish are very likely indeed out there, even if we are not sure if our lures are working. Though then these are weaker boundary conditions that do not establish what currently-known families exactly would be the daughters of such a proto-language. E.g. who knows if some American languages might be not Amerind ≈ Beringian, but something else, like para-Na-Dene, pre-Clovis-coastal, Solutrean…? Continuing the metaphor, this would mean we don’t even know how big the fish are exactly, and so also we might not know (yet?) what are the best ways to catch them.

But there’s also a sense in which I think long-distance relationships would be better seen as butterflies than big fish. We do not find relationships in an instant, as sudden flashy discoveries (by “bites” on a “lure”). All spoken languages are in principle compareable, with known typological differences but also universal family resemblance. [2] The universality of basic phonological categories in particular makes it possible to find some resemblances between any two languages that plausibly could be indicative of some etymological or indeed genealogical relationship. Whether they actually are, depends on additional work on fine-tuning details. Are they above the level of pure chance, and independent of known onomatopoetic and nursery word trends? Are they in conflict with other data of equal value? Do they show recurring sound correspondences, at least some of them nontrivial? These are questions for which we cannot expect to have every answer in place immediately. Any relationship must always begin from observing some similarities that are not probative in itself, and then pursuing this as a hypothesis and seeing if it guides us to more similarities, ones that will not require further costly assumptions to justify.

If all we knew about Finnish and Hungarian were that their verbs for ‘to live’ are, respectively, elää and él, this would not be sufficient evidence to establish them as related languages. But they are, indeed, cognates. Insufficiency or statistical insignificance does not in any way refute cognacy per se. And it is true that checking for more examples of the correspondences e ~ é and l ~ l turns up more evidence such as pelätä ~ fél ‘to fear’. Now with a new correspondence p ~ f, but this does not mean we turn up our nose and declare the hypothesis unworkable: it’s possible to continue and maybe discover, say, pesä ~ fészek ‘nest’. It always takes several steps like this to assemble e.g. a phonological core that will be self-evidently non-accidental. Same for other “evidential cores”, such as partial common morphological paradigms. There is no immediate bite that instantly proves a relationship, but rather, a first weak signal, which will rise in importance once combined with a proper selection of other datapoints.

Any “minimum convincing argument” will not be dozens of steps deep necessarily, but where patience is especially needed is that at any stage there will be plenty of false paths of expansion that will not lead to a workable theory. If at some early point, we had formed a hypothesis of a ~ a, and then run into vapaa ~ szabad ‘free’ (without realizing that both are loanwords from Slavic) — we could still find more evidence also for p ~ b (e.g. by misanalyzing the correspondence Fi. mp ~ Hu. b), but no additional good evidence would be turning up for v ~ sz. At some point we might end up concluding that, yes, this is going nowhere and should be discarded. But then only this comparison! Finnish and Hungarian are still ultimately related, even if their words for ‘free’ are not cognate. Discarding this one comparison does not (should not) mean discarding also any other adjacent comparisons. A burgeoning comparative edifice needs to be open for exploration and individual mistakes, if it is to ever reach any particular rank like “a probable relationship” or “a proven relationship”.

This plea of course has also a corresponding inverse. Anyone who wants a “butterfly” treatment of their ideas has to have enough intellectual humility to recognize that it is, indeed, a tentative first-pass version. All too often I see also people who have a new language relation hypothesis in hand double down on their speculation, and not be open to even constructive criticism. Perhaps in some part there is a misunderstanding where people do not recognize the proposal of better, non-cognate etymologies (borrowing, onomatopoeia, internal derivation) as progress. But certainly also lone-wolf-genius-ism, and its attached incapacity to admit mistakes, is a problem that exists.

On the other hand, I don’t think this side of the problem needs to be focused on too much. In historical linguistics, the exploration of linguistic relationships is already a known research programme, a goal that many people agree to pursue even if we tend to disagree on quite a lot of details. This in mind, if a K. Kookenstein puts out a paper on allegedly showing how English is related to Arabic, but then refuses to consider these comparisons in light of what Indo-European or Semitic linguistics has to say on this: we don’t actually need his approval on this! Language data is not locked, copyrighted, or in any other way tied down to one person, and if desired, it will be possible in any case to check such papers for insights relevant also to better situated IE–Semitic comparison. I know I at least keep a few “Hungarian is too a Turkic language” type works around for this purpose. The intended main thesis is not going to pan out; but any data cited to this end could prove to be regardless still valid. Usually anything of this sort mostly relies on word comparisons (appeals to typology are strangely rare), and these might remain valid as etymologies of any imaginable type… not just Turkic loans in Hungarian, but maybe also old Hu. loans in Tk.; Hu. cognates of Khanty or Samoyedic loans in Tk.; common loans from some third source like Iranian or Yeniseian or Mongolic; some could even end up being evidence for a general Turkic–Uralic relationship. None of this is a priori ruled out, and in this way it may well be possible, with patience, to find meaningful building blocks even within theories that don’t hold up in their entirety. Such is a nifty property of historical linguistics, something that definitely doesn’t generalize to every science.

The two animal metaphors from the start of this post, though, no longer work very well at this point. Some butterflies … may grow up to be big fish, even though most probably don’t? Moreover, I have been mostly illustrating this discussion with disputed-but-definitely-published ideas. More nascent ideas that are simply brought up in a discussion are a different beast for sure. Of course there’s a selection bias here too: the actual butterfly ideas I do have, you will probably not be seeing on this blog as such (and you might have to watch closely to catch any even on my side channels). [3] Arguably also scientific publishing is “a conversation”… especially any ideas that can be so far found only in some paper draft posted for comments online (in linguistics they’re not even concentrated yet on any arXiv analogue). For these, the original reading of a butterfly idea seems to still work fairly well. This may hopefully help (e.g.) various long-distance proposals to develop better in the end, before they end up with one of two common fates: shelved as not having passed the judgement of Reviewer #2, or self-published with excessive confidence. For this goal, yes, the ball very much is first in the court of people who do have an idea and want to develop it; but it is also in the hands of the rest of us, in being willing to offer first criticism that’s not a complete dismissal. Thirdly, worth noting, all this also depends on a social milieu where people even can find parties interested in discussing some out-there idea.

A further aspect of AUG’s original concept — avoiding unnecessary emotional stress upon people presenting a new idea — I haven’t really even touched here yet. This would be a whole other jar of larvae, but suffice to say I agree that academic discussion, for all its standards of civility, fairly often can have undertones all the way to hostility. This probably scares away many people without a thick skin who might otherwise have had a few interesting things to say; and those of us who do stay engaged, to whatever degree, it may leave with more stress than is necessary.

Some of it, I’m sure, does not even come from a particular need to be prickly, but from limited time… Sufficiently well-known figures in a field tend to get approached by a disproportionate amount of amateurs with A Revolutionary Discovery, unless they specifically keep themselves hard-to-contact, or, perhaps, maintain an aura of not suffering fools gladly. Again a problem that might be softened with other people being open and approachable enough. But this also starts edging towards the general area of science communication and public relations, a bigger fish still to fry that I’m not going to pretend to already have big original ideas for right now (and the butterflies, they will have to wait for other channels).

[1] The famous case of Nicaraguan Sign Language does not seem to have spoken analogues. In principle there is little directly preventing such a case (and something of the sort, maybe in several gradual episodes, will have to be assumed as the ultimate origin of human language too), but the conditions are unlikely to ever come about. A community of children who are capable of speech but do not have access to any pre-existing spoken language? Sorry, language in general is too adaptive to have been ever abandoned after its first introduction. I will go as far as to suggest that all known human cultures depend strongly enough on language for the transmission of cultural knowledge that any sudden failure of language skills across an entire human group (say, a transmissible disease that induces deafness, fast enough that a signed language does not have time to develop) would not lead to an all-new language being developed a few generations later; it would lead to the group’s extinction.
[2] In the philosophical sense, not the genealogical one. E.g. despite some exceptions, most languages still have nasal or labial or velar consonants; all but the most impoverished and unbalanced phonological inventories or even just consonant inventories are going to have substantial overlap between them. And even if we did find languages that somehow have completely disjoint phoneme inventories (lazy example: one has only stop consonants and front vowels, the other only continuant consonants and back vowels?), they will not be unbridgably far apart: the known typology of sound change allows hypotheses relating basically any two speech sounds. Grammatical categories, too, can be quite different but still only finitely far apart, where the details of known language histories likewise give us ways to relate non-identical categories to each other (or to derive them de novo language-internally, etc.)
[3] A freebie for the sake of example though: cf. some very loose thoughts about the subclassification of Oceanic as floated on Tumblr just a few days ago (also already with some, though not highly severe, critique from a regular correspondent over there).

Tagged with: , , , ,
Posted in Methodology

Language Family Tectonics

Basic research in historical linguistics is mostly done within individual families: we take a swath of attested (in most cases modern) languages, and work towards the past to figure out their development from a common origin, one group at a time. Any knowledge of languages outside the family only really factors in as correction terms: filtering out loanwords and other contact influence, as data that the family’s overall internal history will not need to account for.

What the big picture of this looks like once we consider also geography is that we end up with a series of dots — “homelands” (though not to be understood as points of creation, but simply the last uncoverable phase of earlier processes) — somewhere in the past; some of which have then expanded, to cover the whole world by today. Just a few millennia ago, much of the world would have been an uncharted area, full of regions from which no knowledge of their languages has survived to us. The ones that do survive would, even, have been largely isolated dots. Most language contacts must eventually end (or rather, begin) at some point in the past. Languages of different families, that are today next to each other, cannot all have had their parents too as neighbors. Perhaps some individual cases were: Proto-Germanic seems to have been about as much of a neighbor of Proto-Finnic as Swedish and Finnish are still today; even further back, something like Proto-Kartvelian as a neighbor of Proto-Northwest Caucasian could be possible too. But once we consider highly expansive families, it is self-evidently absurd to propose that Proto-Indo-European could have been simultaneously a neighbor to all of (pre-)Proto-Kartvelian in the Caucasus, (pre-)Proto-Uralic in the taiga zone, (pre-)Proto-Dravidian in South Asia, pre-Basque in Iberia…

This already implies that most borders of today’s language families are collision zones: where two lineages have come to meet that were not in contact at some point in the past. (Same also for some, though fewer, language borders within them.) I’d like to think that we can probably divide them further in subtypes. This will have to include their history, not just their current but also past dynamics. One reasonable analogy might be plate tectonics. Geologists are not content to simply locate the current boundaries of the world’s tectonic plates, but ever since the rise of continental drift to a mainstream theory, already introductory maps will also aim to identify boundaries as either constructive, destructive or conservative. Often longer-term history or future, too, could be extrapolated from arrows of movement (of, yes, actual movement right now — as per the classic example and the mid-ocean ridge closest to me, the Atlantic Ocean is growing some three micrometers wider every hour, already a perfectly visible amount of maybe 0.3 millimeters since I began to write this blog post).

Of course this is not to be aped too closely. The social “forces” that drive linguistic expansions can be rather fickle, nowhere near as stable and predictable as the physical forces of geology in e.g. continental drift. No responsible linguist is going to be putting a predicted specific time of death on any but, perhaps, an already moribund language (those where all transmission to new generations has already ceased, and the only question is whether the last few speakers have 5 or 50 years left to live); and predictions on what languages will be gaining new ground entirely I have not really seen anywhere at all. If anyone wants to register particular predictions, be my guest, but currently these are really only going to be educated guesses, not derived from a theory with known predictive power.

So maybe let’s not draw any future-pointing arrows on linguistic fault zones just yet. Drawing past-originating ones, though, seems like a much more doable task, first of all in cases where (some) history is already known. And this I think also gives us anyway some analogues of geologists’ “constructive, destructive, conservative”. A look at known history actually suggests that just two types might be enough to get started. Of course we can have conservative boundaries, where languages have stayed each on their own side for a while. This often coincides with also geographic boundaries of some sort (e.g. the northern boundary of Indic has been, broadly, at the Himalaya for millennia, and it’s no wonder that the Korean / Japonic boundary has stabilized between the Korean peninsula and the Japanese archipelago). Then we have collision zones, where two lineages come head to head —

But wait. Head to head? No, actually, the most typical case we see anywhere in the world’s known history is not quite this. Where we find e.g. a Germanic / Celtic boundary in the British Isles, a Finnic / Samic boundary in northern Finland, a Turkic / Iranic boundary north of Iran, a Bantu / Khoe boundary in Botswana: these do not represent cases of two spread events that finally arrived at some common ground simultaneously, running out of no speaker’s land to claim. Almost always such a border represents one newer (Germanic, Finnic, Turkic, Bantu) and one older family (Celtic, Samic, Iranic, Khoe), with the latter’s historical range extending far into the former’s current-day one. The geological analogy happens to continue working here too to some extent: when two plates collide, for all the mountains that results, these still are not zones where both plates indefinitely squish and crumple without crossing. Instead one plate will be pushed underneath another, into the crust (and mainly the topmost one will jut up as mountains). Now the distribution of language families does not really have a Z-axis, but the time axis does similar duty here. We already routinely speak of e.g. English expanding (having expanded) “over” Brittonic; and call the latter a “substrate”, the former a “superstrate”, again employing terms from geology that strictly speaking refer to vertical location. I’m sure also a part of the motivation is one of geology’s core findings that, by default, vertical order reflects historical order!

To fully derive an understanding of this situation, the naive zeroth-order model of language family expansion (they start in some some compact area in the past and begin expanding) moreover needs to be amended by the fact that expansions are not infinitely powerful: they can run out of steam even without encountering another expansion in its path. Not only does Finnish supercede various lost Sami varieties, it is also not the case that Samic started somewhere in the north and expanded south until running into Finnic. Rather, Samic also itself originally expanded mainly northwards, probably much along the same geographic routes. There was no southward expansion front of Samic for Finnic to collide with; nor an eastward expansion of Celtic by the time of the Germanic expansions, etc. In this way linguistic expansions might have a better geological analogy still in lava flows in a volcanic field: they will layer on top of another, not by virtue of which one expands faster or more strongly, but by simple virtue of which one has already stopped, at least in a particular area, and which one is still going.

In those cases where two expansions do happen to be going on simultaneously, this is maybe indeed more likely to end up with something resembling a conservative boundary. And also among these, many though will prove not quite entirely stable if we look closely enough. They can turn out to be series of small advances on either side, just not spilling out to outright conquest of the other family (and likewise, mostly not inherently one-dimensional lines anyway, but a crossfade in the proportion of speakers of X versus Y). Again more like lava flows than continents.

Still, I will continue to keep the term “tectonics” here anyway. Etymologically looking, it is not a term that by itself implies the details of plate tectonics, but simply refers to the largest-scale analyzable units.


What can we do with this then? If we recognize that the world’s major language family boundaries are mostly collision zones — where one family is or has been in the process of expanding at the cost of another, not currently expanding one — this gives us first of all convenient rules of thumb about linguistic substrates. Anywhere near a language family boundary, the substrate of an expanding family X is probably primarily the non-expanding language family Y next to it. At least in the wide definition of “substrate”, that is “the language spoken there before the expansion of the current family”. If it has left any discernible substrate influence, structural or lexical or toponymic, would be another discussion entirely. Conversely, locations where we might be able to fruitfully hypothesize completely extinct substrates will be instead

  1. more towards the geographic or expansion centers of recently expansive families (thus e.g. the Paleoeuropean substrates of Germanic);
  2. underlying not-most-recently expansive families that have few or no leading edges over anything anymore (thus e.g. the Paleolaplandic substrate in Samic).

Or further yet. The facts that language families expand from small origins, readily take over other languages in the process, and are also generally just some thousands of years old, leads us to also a more powerful rule of thumb: There Was Some Other Language There Before. Almost no language is the absolute first language to have been spoken in “its” territory. The main exceptions would be a few cases of recent seafarers, above all in Polynesia; several more scattered cases also in the Atlantic, of which I think only Icelandic and Cape Verde Creole have been established as their own languages. [1] At any other ends of the Earth, Inuit is a known newcomer in the American high arctic, Pama-Nyungan is a known newcomer in the Australian interior desert (even if the languages preceding them are not attested)… and in places with long written history, we may find quite extensive known successions, to the effect of Hattic replaced by Hittite replaced by Luwian replaced by Aramaic replaced by Greek replaced by Arabic replaced by Turkish. Maybe some Assyrian or Kurdish phase in there somewhere too, depending on what point we’re considering here exactly. More importantly, over the remaining at least 60,000 years of modern human presence in West Asia without written records, obviously much much more of this still. Not all of this leaves major genetic or archeological fingerprints, either, and some specific cases might be very hard to identify if we didn’t have linguistics itself as a source of evidence.

For two, it will be generally beneficial to work out which of any two language families in contact at a particular border has been the more recently expansive one. [2] Know more widely, at least. I’m not sure if there actually are many cases where this would be a mystery entirely. I could think of some hard-to-tell cases once we’re talking about subfamily borders (Mari / Udmurt? Celtic / pre-Latin Italic?), but even here probably some dedicated experts would have an opinion. Maps of individual language families, especially in historical contexts, often enough also have some spread lines or historical distributions marked. But large-scale summary maps still trend towards presentations like this, seemingly entirely static, even though the process of restricting language families to complementary areas necessarily elides some current-day detail in favor of historical idealization (denoting where a language family “is native” or “is traditionally spoken”). I’ve seen sociolinguists criticize this whole genre of language distribution maps repeatedly already, in them not really capturing synchronic reality. The response though might not need to be to abandon them entirely, as much as admit that, yes, they are maps that display some historical information too, and adjust accordingly for more history-informed design. If there is knowledge on this mostly out there, why not?

For three, a concept of family tectonics readily draws attention to the point that there’s work to be done not just on charting language families’ “current” or “traditional” distribution, but also their past distribution. “Beneath” (before) any current language family there “is” (was) some different distribution of other languages. Some of them maybe belonging in it still extant neighboring families, some maybe its own lost relatives, some maybe unknown entirely.

The first possibility I find the most interesting for the sake of further work. The closest example to my work comes from central and eastern Siberia. An important but I think largely open question would be what was spoken in the area before the expansion of the relative newcomers? Russian is of course the newest layer all over the place, but Siberian Turkic (Yakut, Tuvan, etc.) and Northern Tungusic (Evenki, Even, etc.) are both parts of relatively recent families too. What have they ended up displacing? Early Russian explorers report, and rudimentarily attest to, first of all a formerly wider distribution of the Yukaghir family, today known only in two small islets; and a variety of Samoyedic and Yeniseic varieties in the southwest of this area. Still, the main Turkic and Tungusic expansions must have been early enough to predate all historical records in the region, so this cannot be the whole picture either. One hypothesis I keep coming back to is the possibility of a lost “tenth” Uralic branch — perhaps para-Samoyedic, perhaps an independent branch entirely. This might have some benefits to it in explaining a variety of known but not especially substantial similarities between Uralic and all the other families further east. Turkic of course has been in direct contact with (branches of) Uralic anyway, but various parallels continue sporadically into Yukaghir, Tungusic, Chukotkan, Nivkh, Eskaleut. All of them seem more likely to originate from the Uralic side, due to it being the Siberian family with the most known time-depth. Yeniseian is sometimes approximated as rather old as well, but otherwise both “Neosiberian” and “Paleosiberian” are all families without too much time-depth. [3]

Most notably, Uralic parallels in eastern Siberia include even basic words for ‘reindeer’, an all-important livelihood animal for many groups these days, especially Chukotkan *qora (whence the ethnonym Koryak), Tungusic ⁽*⁾oron (or probably *xoron, with further diffusion after *x > ∅ in NTg) (whence the ethnonym Oroqen). Kolyma Yukaghir qoroj ‘two-year-old male reindeer’ is usually adduced here too, as well as loanwords further into Siberian Yupik. This has been already identified in earlier research as a Wanderwort originating in Proto-Uralic *kojəra ‘male [domestic?] animal’ > Proto-Samoyedic *korå ‘id.; bull reindeer’, which might have already had an allophonic [q-] in Proto-Samoyedic or even earlier. But we seem to lack especially clear evidence on who is to be credited for the original diffusion of this word. Yakut, as far as I know, has no reflex of it, splitting the Eastern Siberian region off from Samoyedic, and thus probably suggesting a pre-Turkic movement eastward. If so, then maybe even already at the time of the original Uralic expansion (which I think must have been partly eastwards too in any case)? Who knows. Maybe someone will eventually though, if we get e.g. some additional toponym data for guidance and keep inter-family comparative research going.

Elsewhere in the world, I’m wondering also about e.g. how far Africa’s other language families might have reached before the Niger-Congo and particularly Bantu expansion. The case of possible contact between Khoe and Cushitic is already preliminarily discussed in a 2009 paper from Blench, though I’ve been unable to verify his interesting claim that Khoe #goe for ‘cow’ would be compareable with similar “widespread terms” in Cushitic. [4] The quite tattered Central Sudanic looks like another good candidate for a family that might have been more widespread earlier (but might have been also enroached upon by Chadic and the various branches of Eastern Sudanic). In the Americas, too, I could wonder especially what preceded the large continuous spreads of Athabaskan and Algonquian in most of Canada and the northern US? (And also which of them is the newer one?) Was there ever anything to the effect of “Inland Tsimshianic” or “Inland Tlingit”, “Plains Iroquioan” or “Forest Caddoan”? Or turning to Oceania: how far west and east did the various “”Papuan”” language families (many of them even today not confined to just New Guinea) extend before the Austronesian / Malayo-Polynesian expansion? For that matter has anyone even tried comparing any of these with the other continental SEA languages in any capacity, or just assumed that they must have been in splendid isolation amongst each other linguistically effectively forever?

These are questions that, again, some experts might already know answers to or at least have hypotheses for. But nowhere is this information available in centralized geographic form, even though it would be surely possible to represent so, giving a kind of a bird’s eye view of what are the major ethnohistorical results achieved or confirmed by historical linguistics, and what questions still remain open.

[1] Faroe Islands seem to be better established than Iceland as having had a pre-Norse population (at least as of the Nature study just last December). A longer list of cases without a distinct local ethnicity includes e.g. the Azores, Bermuda, Falkland Islands, Svalbard, Tristan da Cunha (and also remote islands in the other oceans, e.g. Kerguelen). There are some more within-reach cases like the Andamans, Maledives or Nicobars, for which I’m not sure what’s known of their prehistory (though then already the existence of two Andamanese language families suggests that one of them is very likely older than the other).
[2] Not always the same family on top in all interactions: Turkic has been expansive over Iranic, while Russian has been expansive over Turkic … and yet Russian and Iranian are both Indo-European. It should be no surprize at all either when we find e.g. language shift from Swedish into Finnish in Finland, vs. from Finnish into Swedish in Sweden.
[3] Really if “Neosiberian” is taken to mean “the recent but pre-Russian arrivals”, and “Paleosiberian” as everything else in the area — then we ought to be counting Uralic as the largest representative of the latter, not as some European family that somehow just happens to be also present. By now we do know the westernmost expansions of Finnic, Samic and especially Hungarian to be relatively recent, while Uralic or pre-Uralic presence in western Siberia has no established terminus post quem (short of the hard geological limit of the last ice age). — I suppose the usual exclusion of Uralic from “Paleosiberian” has been instead more informed by its typological similarity with Turkic and Tungusic. But then this seems improper when the term is Paleosiberian, not “Non-vowel-harmonic-siberian” or anything else of that sort.
[4] Checking with a recent monograph from Bender instead shows some very uncompareable-looking terms in most of Cushitic, such as Oromo /saʔa/, Konso /lawaa/, Agaw (North Cushitic) *lɨw-, South Cushitic *ɬee; or does Blench have some supposition about a Northeast Caucasian-esque *ɬ > *g?! — Further north, *gʷow- ‘cow’ in Indo-European does look amusingly similar to Khoe, but Afrasian is bit too wide and old of a family (definitely older than the domestication of cattle, which “only” dates to ~10,000 years BP) for me to think that there could be a connection entirely without it. Even something like the mysterious Y-DNA haplogroup R-V88, common in central Africa around Lake Chad yet seemingly derived from Eurasia, doesn’t really allow any connection that would reach all the way to southern Africa.

Tagged with: , , , ,
Posted in Methodology

Reviewing UraLex

Nerdsnipe of the day: the BEDLAN team, researching diversification of the Uralic languages interdisciplinarily, mentioned earlier today that they will be soon uploading version 3 of their UraLex dataset of basic vocabulary across Uralic. I thought this might be a good time to do a look-over of the data, from a not-that-computational historical linguist’s point of view (i.e. mostly on the contents, not the technical details). Maybe these comments will be helpful either to the team or to other people aiming at similar projects.

Data sources

The selection / definition of languages looks mostly good already to me, with varieties being specified fairly closely, including details like “Sosva Mansi” rather than just “Northern Mansi”. Unmarked “Selkup” is however questionable at least. This is claimed in the documentation to be more specifically Taz Northern Selkup, the currently most vital dialect [1] and the basis of current written Selkup. The listed forms, though, often look more like the Proto-Selkup reconstructions from Sölkupisches Wörterbuch, e.g. in retaining PSk *č (> modern NSk /t/) and *uə (> *Cʷë > modern NSk /Cɤ/, /wɤ/). A similar issue is the database’s “Karelian Proper”. This too does not appear to be any real variety of Karelian, but rather the interdialectal lemma forms of Karjalan kielen sanakirja, which are frankly overly Finnishized (not really actual Proto-Karelian), and elide many important contrasts, especially voiced obstruents and, mostly, the s / š contrast. E.g. rasva for ‘fat’ only appears as such in the Oulanka dialect. Most northern Karelian has rašva, much of southern Karelian razva, some intermediate southern dialects ražva.

The KKS and SkWb lemmas are probably tolerable as lexicostatistic indices to Karelian and Selkup, but I hope some future update might fix this in favor of actually-recorded language varieties — and certainly before anyone tries to do phonological analysis with this data!

I would have some desiderata myself on what varieties’ classification would be interesting to gage by their lexicon. Foremost maybe transitional varieties, such as Karelian Isthmus Finnish; NE Erzya and Shoksha; Pelym, Lozva & Eastern Mansi; Berezovo, Nizyam, Salym & Vartovskoe Khanty; anything really among the Selkup dialects. But it’s possible that this is too fine detail for a Uralic-wide dataset and would call for within-language-group studies instead, similar to Rydving (2013) on Sami. And it appears that the most important additions for within-Uralic study have are already been planned: adding Moksha besides the currently represented Erzya; Hill Mari besides Meadow Mari; Obdorsk (Northern) Khanty and Pelym (Western) Mansi varieties besides EKh and NMs; Kamassian and Mator within Samoyedic. These should cover many bases. E.g. the well-known Mansi cognate(s) of Hung. tűz, EKh tö̆ɣət ‘fire’ are not recorded from NMs, but do appears in WMs (Pelym toåwt, Upper Lozva töät, North Vagilsk tüöwt, etc.)

A different point entirely is that attempts to study specifically the interrelationships of the nine basic Uralic branches would, I think, function the best if using their protolanguages as the basic data points. There are too a few gotcha cases where no coverage of modern-day languages is sufficient: occasional native Uralic terms might be reconstructible for Proto-Mansi only from early 19th century wordlists, for Proto-Samoyedic only from Castrén’s mid-19th century records, for Proto-Mordvinic only from Witsen’s 18th century records, for Proto-Hungarian only from early medieval records, etc. Comparative-historical Uralistics is maybe not particularly philology-centered, but has never been able to afford overlooking philology entirely. [2]

The selection of semantic concepts to cover is generally reasonable, pulled from major basic vocabulary lists like various Swadesh lists and the Leipzig-Jakarta list. Some of the items on these do break up completely to noise within Uralic, but that’s a good point to have on record as well. I do not think the classic Swadesh list was assembled very rigorously, and at some point it would be good to know not just something about the relative average stability of concepts on it, but also their variance in stability across different language families. An example I have often mentioned in dicussions related to this is how in Uralic, ‘fish’ and ‘moon’ are highly stable, while ‘cow’ is unreconstructible and ‘sun’ is highly unstable; while in Indo-European, ‘cow’ and ‘sun’ are highly stable, vs. ‘fish’ unstable and ‘moon’ just about unreconstructible. (This phenomenon e.g. already constitutes a fairly strong critique of glottochronology or any models resembling it, which would rather predict average variance to be a monotonic function of average stability.) — Many of the more unstable and entirely unreconstructible concepts seem to be from the LJ list. This is basically what we should expect I think, since these have been selected only by their stability vs. loaning, not vs. all the other lexical innovation processes out there like derivation, semantic shifts, onomatopoeia, a priori coinages (and also not even vs. the likelihood of synchronic synonymy).

There are regardless still many semantic concepts or etymological groups that I think would have a bunch to say about the diversification of Uralic, but which haven’t made the mark. These are I suspect typically more Uralic-specific, and they could not be easily located by general cross-linguistic considerations. Simple examples include e.g. terms for local fauna (*śixələ ‘hedgehog’, *onča ‘nelma, Stenodus‘), flora (*ďëmə ‘bird cherry’, *pečä ‘pine’) and technology (*joŋsə ‘bow’, *ńëlə ‘arrow’). More involved examples tend towards etyma that Helimski (2001) has called core vocabulary as distinct from basic vocabulary: often verb roots, relational terms, or incipiently grammaticalizing body part terms, that may not have strong semantic stability but do have decent etymological stability. In Uralic thus e.g. *kixə- ‘to rut, lek, be excited, lustful, want’, *kulə- ‘to go out, run out, wear, end’; *pučkə ‘hollow, tube, inside, marrow’; *pončə ‘tail, hem, back part’ (glosses not meant as PU but indicating the range of variation in reflexes). Most regular lexicostatistic methods run poorly however if matched against etyma that don’t have stable or well-defined proto-meanings, e.g. we can’t really ask what is “the” replacement of such an item in a language that has lost it. Down the line, some new techniques entirely will be required for making use of this kind of data instead.

Phonetics & Phonology

I do not know what use, if any, is planned for this part of the data, but especially inconsistent IPA transcription seems to remain a major problem, as many other times in Uralic studies.

  • v is transcribed as a fricative /v/ rather than the approximant /ʋ/ for Estonian, Votic and Ingrian (though correct in Finnish).
  • A phenomenon I’ve seen in many online sources over the last ~10 years, Finnish h is given superfluous and partly incorrect transcription as /ç/, /x/ in many clusters and /ɦ/ in many medial positions. E.g. karhea ‘rough’ as “/karçe̞a/”, though fricative allophones only appear with any systematicity in the syllable coda. Even then these have enough variability that I would think leaving this as phonological /h/ would be surely the safest choice.
  • Some Finnish falling diphthongs are transcribed with glides as the 2nd component (aurinko ‘sun’ /ɑwriŋko̞/, koira ‘dog’ /ko̞jrɑ/), others with close vowels (jauhaa ‘crush’ /jɑuɦɑː/, oikea ‘right’ /o̞ike̞ɑ/).
  • Estonian length marking is a mess. -p- -t- -k- appear seemingly at random as both /p t k/ (thus also -b- -d- -g-) or /pː tː kː/ (thus also -pp- -tt- -kk-); sometimes even in the same word, e.g. lükata ‘to push’ as “/lykɑtːɑ/” (as if ˣlügatta ?)! I don’t have strong opinions on if it’s more proper to use /pˑ tˑ kˑ/ for transcribing grade 2, or maybe /pːː tːː kːː/ for grade 3, but please at least make the distinction. — I’m not even going to start on long/short clusters or overlong vowels, which are maybe less phonologically relevant anyway.
  • Estonian palatalization has also gone absent, e.g. lill ‘flower’ as /lilː/ and not /lilʲː/. Also, four slip-ups of õ turning up as IPA /ɣ/ rather than /ɤ/: “/hɣːrutɑ/” ‘rub’, “/kɣvɑ/” ‘loud’ (but correct in /kɤvɑ/ ‘hard’!), “/lɣkːs/” ‘trap’, “/mɣmisetɑ/” ‘mumble’.
  • Votic transcription includes some allophones like [d̥ g̊ vʲ ɑˑ], but leaves unmarked maybe the most prominent allophone in the language, л = [ɫ], “dark L”. I did not catch any ˣ/ɣ/ pro /ɤ/ mistakes.
  • I’m happy to see that most languages’ palato-alveolar ľ, ń, ś etc. have been transcribed as /ʎ/, /ɲ/, /ɕ/ etc. rather than incorrect /lʲ/, /nʲ/ /sʲ/ seen in many naive attempts to IPA-fy Finno-Ugric transcription; … but this has been overdone to include also Erzya, for which palatalized alveolars are correct. Not a major issue ultimately, but still an inconsistency.
  • Meadow Mari ə̑ has been transcribed as /ə̱/, which is a bit superfluous; /ə/ would be sufficient. (It is rather Hill Mari ə (= reduced e) that would call for a diacritic in IPA, probably /ĕ/ or /ə̟/.) — The Ob-Ugric data has had the ə / ə̑ distinction phonologized away entirely, though if desired, it could be maintained phonetically at least in Eastern Khanty.
  • Komi and Udmurt: FUT / literary ‹ы› is given as /ɯ/, rather than the more correct /ɨ/, and / ‹ӧ› has been rendered as /ɤ/ though probably /ə/ or /ɘ/ would be likewise more consistent (as in the Oxford handbook of Uralic from this spring). Even a / ‹а› might be for the Permic languages better rendered as IPA /a/ (unlike most of Uralic, where a contrasts with /æ/ and is thus indeed better rendered as IPA /ɑ/).
  • Hungarian uses tie bars for some its affricates, /t͡s t͡ʃ/ etc. Not incorrect in any way, but this is used nowhere else in the data and not even entirely consistent within Hungarian. I also notice a straggling flap /ɾ/ appearing in erdő ‘woods’, féreg ‘worm’ that seems like an error.
  • Uvulars in Khanty aren’t dealt with very consistently at all. [q ʁ] as back-vocalic allophones of /k ɣ/ go unremarked, but /χ/ is indeed transcribed as uvular (ditto in Mansi). Worse, some data with /χ/ has been incorrectly entered for Vakh-Vasyugan Khanty, e.g. jŏχət- ‘come’, koχ ‘long’ (the actual VVj forms are jŏɣət-, koɣ). Only western Khanty ever has χ!
    I suspected data mix-up initially, but this clearly must be a processing problem instead, given even e.g. köχ ‘stone’: no such form appears anywhere in Khanty (it’s VVj köɣ, Jugan kä̆w, other Surgut kä̆ɣʷ, all western kew). Are these words derived from some orthographic source that spells VVj /ɣ/ as Cyrillic ‹х›, by any chance? (But still correct forms in many other cases like ‘head’, soɣ ‘worm’, wajəɣ ‘bird’.)

Looking over these issues, I could formulate a Rule #1 for IPA-fying FUT: the transcription systems do not correspond 1:1 and several details must be, alas, checked on a language-by-language basis. Especially vital is understanding your source data: whether whatever you are IPA-fying is pre-WW2 “hyperphonetic” FUT; mid-century “major-allophonic” FUT; or post-70s “phonological” FUT. IPA comes with its bracket notation [d͇], /d/, //ð// etc. to warn what level of transcription you might be dealing with… FUT does not, perhaps its biggest flaw. A related Rule #2 might be that it’s similarly important to understand what you are trying to do with IPA: phonological, broad phonetic or narrow phonetic transcription? Most of the time, there is no One Correct IPA Representation either.

In the base FUT data I do not see any further major issues. It would be probably good to make sure to distinguish ´ (the suprasegmental palatalization sign) and ˈ (the overlength / strong-grade cluster sign) in the Samic data though. Currently both seem to be much of the time encoded as a simple apostrophe; e.g. Inari Sami kyevˈđi ‘snake’, Skolt ku´vdd ‘id.’ are given as “kyev’di”, “ku’vdd”. Occasionally even opening or closing single quotes appear (thanks, Microsoft). Apostrophes do actually even triple duty in marking palatalized ľ in other languages, but this seems unlikely to do any real harm.

Protoforms

The dataset is of course primarily about attested lexical data, so I maybe should not spend too much time on examining the proto-language reconstructions included (only Proto-Uralic, no intermediate reconstructions). Still, this is protouralic dot wordpress I am blogging at, so some observations on that topic too.

The transcription scheme seems to closely follow Janhunen 1981, Sammallahti 1988. The *i/i̮ reconstruction for noninitial syllables is used almost thruout; an *-e- has slipped in only in *koje-mV ‘husband’. *i̮ rather than *e̮ is used in initial syllables too, however still an **a in at least a few lexemes like *maksa ‘liver’, *maɣi̮ ‘earth’ (= J *mi̮kså; S *mɨkså, *mɨxi); also *ńś rather than *ńć, though a traditional *ć is still retained in some cases. Different transcription schemes are more inconsistently mixed for the “voiced spirants”, including ‹δ› in *śaδa- ‘rain’, but ‹ð› in *wuði̮- ‘new’; ‹x› in *juxi̮- ‘to drink’, but ‹ɣ› in *miɣi- ‘to give’.

A possible consequence of the dataset’s original compilation for a lexicostatistic review of the traditional Uralic classification is also that some meanings are marked as “[Not reconstructible]”, although they would have well-established though western-leaning proto-forms, e.g. *külmä ‘cold’ (maybe debatable; an IMO poor loan etymology from Balt(o-Slav)ic remains marked for the reflexes), *mälə ‘mind’ (clearly PU; this is reflected in derived verbs in Ob-Ugric), *läwlə ‘heavy’ (EKh ‘cold’ probably doesn’t belong). Some items reconstructed in recent literature are missing too, e.g. Aikio’s revamped *këččə ‘bitter’, *widä- ‘to kill’. More worrying for me is how also many long-known proto-forms are left absent, such as *küsə ‘thick’, *näkə- ‘to see’ (admittedly most reflexes derivatives w/o this meaning), *lükkä- and *puskə- both ‘to push’, *śepä ‘neck’, *sańća- ‘to stand’, *wëlkə ‘white’. I don’t think this can be just due to later semantic divergence in some reflexes, when e.g. *jelä ‘day’ has been admitted as a PU form only from Samoyedic direct evidence (parallels also at minimum in Samic); and *śilä ‘fat’ from no direct evidence at all? Yet also some poor comparisons from UEW seem to remain around, e.g. “*čočV-” ‘to wipe’; actually its only reflex meaning ‘wipe’ is Finnish huosi-, which I don’t think can belong here. [3] — These types of issues may even combine for more involved cases. E.g. the PU word for ‘full’ is given as *türə, a narrowly distributed Finnic–Permic etymon, and not the better-distributed *täwdə. This is again probably per UEW, which maintains Selkup tīr as reflecting the former and not, as recognized since Aikio 2002, the latter. [4] Or, the word for ‘year’ is given as *ärV; but this reconstruction was in effect already refuted by Aikio 2012, who points out that the Samoyedic forms (meaning ‘fall’) go back to PSmy back-vocalic *ër-, which continues rather the already better-distributed PU form *ëdə. [5]

A methodological choice also seems to have been that no synonyms are admitted for PU, although there probably are a few concepts in the data for which they existed; e.g. besides *śilä for ‘fat’, we can reconstruct also *wajə, *koja (both already alluded to in the database; the former though specializing to ‘butter’ in most Uralic languages familiar with agriculture).

(All my Uralonet links above show what I think of as their most reliable reconstructions, but defending those would be at times quite a debate that I don’t intend to get into in detail here — I’ll be happy as long as the reconstruction system chosen is at least internally consistent enough.)


Since following newer literature adequately appears to have given some difficulty for the team, I would like to note here (I think for the first time on this blog) that I’ve already a few years ago started a little repository of new results in Uralic etymology, currently keeping track of

  • newly proposed PU reconstructions;
  • newly found reflexes of known reconstructions;
  • newly found loan etymologies for what have previously been thought of as native Uralic etyma.

The list(s) can be found at the Sanat wiki, as a part of / appendix to our etymological database of Proto-Finnic. [6]

Currently pending updates include, besides better coverage of several earlier but post-UEW sources, especially several new native and loan etymologies for Mari and Permic from Metsäranta’s PhD thesis from 2020. I have also been thinking of starting an “antietymological” sister repository, tracking PU reconstructions that have been clearly disproven by better etymologies being published for all or all-but-one of their reflexes, of which there are quite a few by now too.

Etymological marking

Maybe the core content of the dataset. Standard literature has been followed quite faithfully here and I see no major flaws (even where etymological relationships have not been seen fit to be promoted to Proto-Uralic status). Mostly I can just point out some recent and overlooked results. Besides cases already mentioned:

  • The Hungarian word for ‘claw, nail’ has been unfortunately given as the less basic karom ‘claw, talon’ rather than köröm ‘claw, nail’; which was, even, recently argued by Aikio to be indeed a reflex of PU *künčə.
  • The Samoyedic words for ‘to scratch’ derive from PSmy *kətå ‘nail’, much as also e.g. Khanty *kö̆nč- does double duty as ‘nail; scratch’. The base noun is, probably correctly, not admitted as a cognate of rest-Uralic *künčə. The verb entry however inconsistently does encode them as cognates.
  • The most notable loan etymology missing entirely is probably the derivation of Erzya veśe and Hungarian össze- ‘all’ from earlier *wiśwV- ← Proto-Indo-Iranian *wićwa- > Sanskrit víśva, etc. (an etymology due to Katz 2003 that was unfortunately overlooked by Holopainen 2019). Both are regular: for Hungarian *wi- > *wü- > *ü- also in IIr. loans (besides native ones like *widä > öl- ‘to kill’), cf. the already long-known özvegy ‘widow’ < *wiðVwädźV ← Scythian / pre-Alanian *widawa-čī.
  • There is probably room to adjust many of the individual loanword etymologies, e.g. Kildin Sami sūll´ ‘salt’ is not borrowed from Russian сол but, as maybe the palatalization best reveals, from Finnic *soola (thus also UEW, SSA). This would regularly continue a Proto-Samic *sōlē > Peninsular Eastern Sami *suəllʲe, also present in Skolt suõ´ll. Would be way too much work for me to start digging into these on my own though with any consistency.
  • There are, on the other hand, still several Proto-Indo-European loanword etymologies advanced that do not seem very reliable (were they ever widely accepted?), e.g. *pelə- ‘to fear’ ~ PIE *pelh₁- ‘to shake’ (which only gives ‘fearful’ in derivatives in Gothic and Slavic); *śalə ‘gut’ ~ PIE ? *ḱolH- ‘turn’ (which only gives ‘gut’ in Greek). These are though only marked as “probable”, not “clear” — is this basically an euphemism for “not that likely”?

I suppose this is by now enough comments for one day. I know that assembling and curating datasets this big is quite the task, and I could probably also spend a week more reading this in further detail. Hopefully I’ve already pointed out some productive directions for future improvement though. (And if you were thinking of otherwise releasing 3.0 just tomorrow: sure, don’t mind me, there will be time in the future too to improve things.)

Edit 2022-06-27: See also some brief responses from Outi Vesakoski (and further from me) at Twitter!

[1] Very relatively so: at triple rather than double or single digits of speakers.
[2] So far the biggest gap in philological coverage are probably the old Swedish “Biblical Sami” records, substantial already in the 18th century, but to my knowledge they have never been looked over in detail etymologically.
[3] Has been further etymologized as being maybe from Proto-Finnic *hosja ~ *hoosja ‘horsetail, Equisetum‘ (traditionally used to make scrubs), which I don’t think has itself any etymology yet. By its phonological structure it obviously cannot be native Uralic as is. Inverting the semantic derivation though, an irregular (?) contraction from an agent noun *hosija < *hose/i-ja ‘sweeper, scrubber’ might be possible (cf. also Fi. hos-u- ‘to work carelessly, in a rush’). Or if this is, as UEW’s etymology would imply, really assibilated *hocija… a root that looks somewhat compareable to me is Samic–Mordvinic *šodə- ‘to let out, run out’ (maybe first derived to *šodə-j- > *hoci- ‘to throw/sweep things out’). A PU *čočV-, on the other hand, should not give Finnic *h- but *s-, via the affricate dissimilation seen also in e.g. *čečä ‘uncle’ > *ćečä > PF *setä.
[4] Worth noting, besides Aikio’s argument that cognates elsewhere in Samoyedic require a protoform with *ä-ə, is also that *türə would be expected to give Sk. **tir with a short vowel. tīr shows Helimski’s Law = Proto-Selkup vowel lengthening in Proto-Samoyedic *ə-stems, < PU *CVCCə stems and some *CV(C)CA stems (a relatively recent discovery from 2007).
[5] This does still leave Permic *ar ~ (Core) Mansi *ārmə (closed syllable per Pelym årəm with a short vowel), but the latter should clearly be analyzed a loan from the former; more specifically, from derived *arm as reflected in Udmurt. Permic *a has no well-established native source at all and even some more dubious cases only really point to some possible origin from *ä.
[6] “Us” being myself, Santeri Junttila, Sampsa Holopainen & Juha Kuokkala, plus original data assembly by Kallio.

Tagged with: , , ,
Posted in Commentary, Links

A Finnic Family Tree

I was recently asked on Twitter about the history and subclassification of Finnic. [1] Whipping up a full-length discussion paper or even a polished nice-looking family tree would be more work than I can produce on short notice or on free time (and probably something that might warrant wider publication still), but since I actually do have several opinions about this, that are probably either scattered in several places or that I haven’t mentioned anywhere yet, here is a summary of my current thinking.

I’ve given datings of proto-languages and extinction dates only where I can pretend to have any sense of accuracy to them. Error ranges are at least ±100 years for the former, at least ±10 for most of the latter.

┐ Proto-Finnic (ca. 500 BCE, ? middle Daugava)
├─┐ Proto-South Estonian (ca. 500 CE, ? upper Gauja)
│ ├── † Leivu (northern Latvia; extinct 1988)
│ └─┐ Mainline South Estonian
│   ╞══ Mulgi South Estonian
│   ╞══ Tarto South Estonian
│   │   [basis of Old Literary South Estonian]
│   └── East South Estonian (Võro–Seto)
├───┐ Proto-Livonian (ca. 1000 CE, lower Daugava)
│   ├── † Salaca Livonian (northwest Latvia; extinct ca. 1870)
│   ├── † Riga Livonian (unattested, extinct in the 13th C?)
│   └─┐ Courland Livonian
│     ├── † Eastern Livonian
│     ├── † Central Livonian
│     └── (†) West Livonian
└─┐ Proto-Core Finnic (location?)
  ├─┐ Proto-Central Finnic (location?)
  │ ├─┐ Estonian proper
  │ │ ╞══ Insular Estonian
  │ │ ╞══ East Estonian dialects
  │ │ ╞══ West Estonian dialects
  │ │ ╞══ Central Estonian dialects
  │ │ ╘══ North Estonian proper
  │ │     [basis of Modern Standard Estonian]
  │ └─┐ Proto-Votic (inland Ingria)
  │   ├── † Eastern Votic (extinct 1976)
  │   ╞══ † Central Votic (extinct > 1950)
  │   ╞══ Lower Luga Votic
  │   └── † Krevinian (Southern Latvia; extinct ca. 1850)
  └─┐ Proto-North Finnic (ca. 0 BCE, ? coastal Estonia)
    ├─┐ Proto-Northwest Finnic (? coastal Estonia?)
    │ ├── Northeast Coastal Estonian
    │ ╞══ Taivassalo / Very Southwestern Finnish
    │ ├─┐ Southwesternish Finnish
    │ │ │ [main basis of Old Literary Finnish]
    │ │ ╞══ North SW dialects
    │ │ ╞══ South SW dialects
    │ │ ╞══ Western Uusimaa dialects
    │ │ ╘══ probably other dialects in the SW transitional zone
    │ └─┐ Mainline Finnish (ca. 200 CE, Kumo River)
    │   │ [main basis of Modern Standard Finnish]
    │   ╞══ Lower Satakunta dialects
    │   ╞═╤ West Upper Satakunta dialects
    │   │ └── Austrobothnian Finnish
    │   ╞══ Ostrobothnian dialect chain
    │   ├── Kemi Finnish
    │   ├─┐ Torne Valley Finnish
    │   │ ╞══ Lower Torne Valley dialects
    │   │ ╘══ Upper Torne Valley dialects
    │   │     [incl. Meänkieli & Kven]
    │   ├─┐ Kalix Valley Finnish
    │   │ ├── † Lower Kalix Valley Finnish (unattested)
    │   │ └── Jällivaara Finnish
    │   └─┐ Core Tavastian (ca. 300 CE)
    │     ╞══ East Upper Satakunta dialects
    │     ╞══ Heartland Tavastian dialects
    │     ╞═╤ South Tavastian dialects
    │     │ └── colloquial Helsinki Finnish
    │     └─┐ East Tavastian
    │       ╞══ Southeast Tavastian dialects
    │       └─┐ Northeast Tavastian
    │         ╞══ Päijät-Häme dialects
    │         └─┐ Karelid Finnic
    │           ╞══ Savo dialects
    │           ╞══ Karelian Isthmus / Southeast Finnish dialects
    │           ╞══ Ingrian
    │           └─┐ Old Karelian (ca. 700 CE, NW Ladoga)
    │             ╞══ Olonets Karelian
    │             ├── † Sortavala Karelian (unattested)
    │             │   [substratal to Sortavala Finnish]
    │             └─┐ Karelian proper
    │               ╞══ Viena / Northern dialects
    │               ╘═╦ Southern dialects
    │                 ╚══ Central Russian dialects
    │                     (Tver, Tikhvin, Valdai)
    └─┐ Ludian–Veps (ca. 600 CE, SE of Ladoga)
      ├── † Olonets Ludian (unattested)
      │     [substratal to Olonets Karelian]
      ╞══ North Ludian dialects
      ╞══ Central Ludian dialects
      ╞══ South Ludian dialects
      ╞═╗ North-Central Veps
      │ ╠══ Northern Veps dialects
      │ ╚══ Central Veps dialects
      ╞══ Southern Veps dialects
      └── † North Chudian (unattested)
          [substratal to some Northern Russian,
           in contact with Proto-Komi]

The South Estonian sub-tree here is the part that has been published the most recently, basically from Kallio (2021, 2018); though I’d like to see more detail on the suggested Tarto–VS group still.

Some other divergences of note from earlier Finnic family trees include:

  • No Coastal Finnic (Livonian + Core), contra Kallio. I will be arguing for this in detail in a future paper. Among the early branches, Core Finnic and Central Finnic seem to hold up better so far, though I’m open to the possibility that some North Estonian dialects may eventually prove to have some fairly deep archaisms to them too. North Finnic I have several suspicions about, but Ludian–Veps still has nowhere better to go in the tree than with my “Northwest Finnic”.
  • No East Central Finnic (East Estonian + Votic), contra Viitso. These are united only by some cases of õ, which I however consider to be archaisms already from common Central Finnic. [2] This also allows for (re)introducting a non-paraphyletic Estonian sensu stricto.
  • Paraphyletic Western Votic, directly following Kuznetsova, Muslimov & Markus (2015).
  • Paraphyletic Western Finnish and Tavastian Finnish, generalizing further from Kallio (2013). Purely by linguistic evidence, the traditional “Western Finnish” grouping would be about as well-supported as my “Mainline Finnish”, but settlement history to me seems to strongly favor the latter: the Karelid group can’t just drop out of nowhere, it needs to be derived from somewhere at the time in the early 1st millennium when there simply wasn’t any Finnic presence yet in eastern Finland (but parts of western Finland had already been Finnic-speaking for some centuries, with presumable incipient diversification). Archeology so far does not favor an independent expansion from the south; the river Kymi would look like a good route candidate for that at first, but it might have been simply too non-navigable with its several major rapids. Hence, Karelid Finnic must be nested not just within “Finnish”, as has been known already for long, but indeed within “Western Finnish”.
  • Polyphyletic Ostrobothnian Finnish. Some of these lineages may eventually prove to be offshoots of specific Western dialect groups further south, but current research really hasn’t even started that line of investigation (though see next item).
  • Austrobothnian (= my term for South Ostrobothnian) as a West Upper Satakunta offshoot specifically. This is a well-known fact of settlement history, but has some implications for analyzing what is areal and what is old inheritance across the Western Finnish dialect continuum that I don’t think have been fully appreciated in the past.
  • No Karelian–Veps group. This seems like a no-brainer to me: there are practically zero common innovations (some lexical evidence has been claimed but without ruling out common archaisms or loanwords) vs. quite abundant Finnish–Karelian = Northwest Finnic innovations, even beyond the Karelid group. Some more narrowly distributed, e.g. Ludian–Karelian, innovations exist, but their absense from Veps or eastern Finnish I think immediately shows them to be areal rather than genealogical.
  • Paraphyletic Ludian and perhaps Veps. The latter above all due to the fact that most innovations in Veps could be attributed to Russian influence or at least are downstream of changes due to this. Not tying down the assumption that Veps must be monophyletic seems like the safer bet so far.

I take no stance here on the still gradually ongoing debate on if the Kukkuzi dialect is Votic-with-Ingrian-superstratum or Ingrian-with-Votic-substratum or a mixed variety entirely. [3]

Last, don’t take the rather fine detail of Finnish dialects as meaning that they’re actually more different from each other than what we find within other groups — they’re just a) more numerous (even 100 years ago Finnish had 2× the speakers of Estonian, 60× the speakers of Karelian, 200× the speakers of Veps…) and b) better known to me. If I had been looking into e.g. Estonian dialectology in as much detail, I would probably have some opinions also on how to re-tool things around there.

[1] Yes, I am on Twitter as of the start of this year. Not explicitly announced on the blog before, though you may have noticed if you’ve checked my About page recently.
[2] One intriguing example is PF *kota : *koda- ‘house’, giving in my view early PCF *këta : *këða- > later PCF *këta : *kë.a-, whence Vt. kõta : kõa-; EEst. kõda : kõja-; NEst. koda : koja-. What is telling here is that Estonian -j- as a hiatus filler only seems to be regular after illabial vowels, thus showing that NEst. koda does not retain PF *o; it has instead undergone the development *kë-a > ko-a that also appears in cases like *këldajnën > *këllainë(n) > kollane ‘yellow’ (which has “primary” *ë < *e, not “secondary” *ë < *o; cf. Finnish keltainen).
[3] For general historical Fennistics purposes it’s in any case sufficient to know that any attestations in Kukkuzi but not in “normal” Votic can be always from Ingrian, be it by loaning or descent, i.e. not requiring reconstructing anything all the way to Core Finnic.

Tagged with: , , , ,
Posted in Commentary, Reconstruction

*-ətA adjectives in Mordvinic

Across Finnic and Samic, one of the more characteristic adjective endings is *-əta ~ *-ətä; yielding e.g. Finnish -ea ~ -eä, Estonian -e, Northern Sami -at. The Permic cognate *-i̮t is also at least relatively common. Because Of Reasons I have gone for a hunt for reflexes in Mordvinic, where no productive reflex survives. More specifically I’ve gone over Paasonen’s Mordwinisches Wörterbuch (a few more could be probably found in other sources). The scoop is as follows.

First, some cases well-known in the comparative literature. (Noticably often these have exact equivalents in Finnic, or indeed specifically Finnish).

  • *kalgədə ‘hard’ (> Er. kalgodo, Mk. kalgəda) < WU *ka/ëlkəta > Fi. kalkea [1]
    (MWB unwarrantedly lists this as a derivative of *kalgə ‘sheaf, etc.’, which is rather < WU *këlkə ‘haulm’)
  • *śejəďə ‘thick’ (> Er. śejeďe, Mk. śiďä) < WU *śikətä > Fi. sikeä ‘sound (of sleep)’
    (the Moksha form miscited in Uralonet as śäjiďä — a real form, but rather from some Erzya dialect that has *e > ä)
  • *taŋgədə ‘firm, stiff’ (> Mk. taŋgəda) < WU *taŋkəta > Fi. tankea ‘id.’
  • *valdə ‘light’ (> Er. valdo, Mk. valda) < WU *wëləta > Fi. vaalea ‘id.’
    (in UEW / Uralonet, Mordvinic incorrectly under the longer variant *wëlkəta)
  • *vijəďə ‘straight’ (> Er. vijeďe, Mk. viďä) < WU *wojkəta > Fi. oikea, NS vuoigat etc. ‘right’

We see here reflexes as *-ədə / *-əďə after a consonant cluster, syncopated *-də after a PU sonorant (but apparently not after single *k). Moksha śiďä, viďä are probably due to secondary post-Proto-Mordvinic syncope (unclear to me if with fusion *jď > ď or, as might be suggested with *ej > i in the former, with vocalization of the glide). Not many other cases follow this exactly, though. I find only one other clear example + one possible example in *-ədə:

  • ? *ľifčədə ‘loose’ > Mk. ľifčəda; from a stem common with e.g. *ľifčańa ‘pliable’, Mk. ľifčəm- ‘to relax’. Attested as both an ə-stem ľifčədə- and an a-stem ľifčəda- though, hard to tell which might be primary.
  • *vačədə ‘hungry’ > Er. vačodo, Mk. vačəda; from *vačə ‘hunger; hungry’

For *-də after CVR-, I find two more examples, and also two nouns that might derive from former adjectives:

  • Er. boďo ‘obese’. Perhaps distorted from *vojdo, and thus a derivative from *vaj ‘butter, fat’ (which in Erzya develops as > *voj > oj)? Still would have expected *-ďə, but there’s no possible soundlawful origin for an Erzya word ending in -ďo anyway…
  • *naŕďə ‘firm, tough’ > Er. naŕďe, Mk. naŕďä (no base root that I can identify)
    (update: or maybe from PU *ńërə ‘cartilage’??)
  • *śardə ‘elk, reindeer, deer’ > Er. śardo, Mk. śarda. Has clear cognates at least in Mari (*šårδə) and Khanty (*sūrtāj; Northern Mansi surti probably a loan from this), with the PU form usually reconstructed as *śarta. However I suspect this was originally rather an adjective *śarwəta ‘horned’ ← *śarwə ‘horn’. Loss of *-w- in clusters may have been early enough in Mordvinic and Mari to allow common syncope from *śarəda to *śarda. [2]
  • Mk. šoľďä ‘crazy person, crybaby’. Could this be from a common root with Finnic *hullu ‘crazy’ (both pointing to earlier *šul-)? The morphology of the Finnic word remains obscure though, and the palatalization in Moksha would be unexpected; maybe suggests something like *šuljəta. Alternately, maybe ‘crybaby’ is more original, and the Mk. word is instead from a common root with Erzya čoľeďe- ‘to chirp, trill’? Either way this would probably have been an original adjective.

There are however several adjectives ending in *-adə, derived mostly from stems already ending in *-a-. This contrasts with the suffix’s behavior in Finnic and Samic, where it always carries a 2nd-syllable *ə even when attaching to *a-stems (e.g. Fi. lauha ~ lauhea ‘mild (of weather)’, notka ~ notkea ‘pliable’). I suppose the widepread Proto-Mordvinic reduction of 2nd-syllable vocalism led to a reanalysis of *-ədə as just *-də, and then later on the rise of new cases attaching to different stems.

  • *kaladə ‘broken’ > Er. kalado, Mk. kalada; from a stem *kala- common with e.g. *kaladə- ‘to break (intr.)’, *kalaftə- ‘to break (tr.)’
  • *komadə ‘turned over’ > Er. komado, Mk. komada; from *koma- ‘to turn over (< PU *kuma-)
  • *naksadə ‘rotten’ > Er. naksado, Mk. naksada; from a stem *naksa- common with e.g. *naksaftə- ‘to let rot’, *naksalgadə- ‘to begin to rot’
  • *ozadə ‘sitting’ > Er. ozado, Mk. ozada; from *oza- ‘to sit’
  • *panžadə ‘opened’ > Er. panžado, Mk. panžada; from *panžə- (!) ‘to open’ (< PU *panča-)
  • *śťadə ‘straight, standing’ > Er. śťado, Mk. śťada; from *śťa- ‘to stand’
  • *štadə ‘naked’ > Er. štado, Mk. štada; from *šta- ‘to be exposed, cold’
  • *tajadə ‘stupid, grumpy’ > Er. tajado; from a stem *taja- common with e.g. *tajardə- ‘to be timid, dejected’, *tajaskadə- ‘to become grumpy’

Itkonen (1963, CIFU 1) has proposed to consider a chunk of these to be instead primarily adverbs, formed with the homophonic ablative suffix *-də, but I’m not sure if this is a good analysis: Mordvinic infinitives and participles are generally marked, not formed by appending case endings to a bare verb stem. Also, I would analyze *kala, *naksa, *taja to be primarily noun roots ‘brokenness’, ‘rottenness’, ‘unsatisfiedness’.

Still more interestingly, I can also find adjectives where the final vowel looks to have escaped vowel reduction.

  • Mk. aluda ‘underlying; under’. Another adverb/adjective, seemingly pleonastic from an unattested *aləŋ > *alu ‘underlying, undery’ (maybe ousted by the homophonic lative adverb: Er. alov, aloŋ, Mk. alu ‘(to) under’).
  • Er. čando, čonda ‘pricey; price’. Probably not a cognate of Fi. hinta ‘price’ as traditionally compared. MWB hesitantly but I think more likely correctly suggests a connection with Er. Mk. čana ‘price’, which is ← Ru. цена.
  • *pärda > Er. ala-berda, Mk. ala-pärda ‘missshapen’ (“under-pärda“). Probably still an independent word in PMo., given how Erzya and Moksha differ in if they adopt compound-medial stop voicing (“rendaku”, we might call it).
  • *säŕďa ‘fragile of old age’ > Er. seŕďa, Mk. śäŕďä; evidently from a common root with *säŕəďə- ‘to hurt, be sick’ (? < PU *särä-, though intriguing resemblance also with Finnic *särke- ‘to hurt’).
  • *šopəda ‘dark; darkness’ > Er. čopoda, čobda, Mk. šobda, šovda; from *šop ‘in a day, for a day’
  • *topəda ‘dark (of color), maroon’ > Er. topoda, Mk. tobda; from *topə ‘full’, the meaning apparently thru expressions like *topəda_seń ‘full blue’ = ‘dark blue’.
  • #ťožda ‘light’ > Er. čožda ~ Mk. ťožďä (no base root that I can identify). Reconstruction difficult due to several irregularities. Is Er. č- maybe by contamination with čova ‘thin, fine’?

In three of these, we find a similar environment to where PU 2nd-syllable *a survives: after a 1st-syllable *o < PU *u. Maybe the same would have originally allowed even retention of a 3rd syllable *a? — By contrast the disharmonic *pärda, *säŕďa pretty much have to be Mordvinic-internal formations. Could an adjective suffix *-da have been generalized / extracted just from cases like *topəda?

No further answers today; just a look at what other etymological candidates we might have in Mordvinic for residues of this ending.

[1] Close to a ghost word, though; kalkee ‘poor, low-quality’ is only known from one Finnish dialect. This can only really link to ‘hard’ thru kalki ‘poor, unlucky’ (“having a hard time”) from one early dictionary. The reported “dialect variant” kalkkea ‘loud, talkative, lively’ seems likely to be unrelated and instead from the verb kalkkaa ‘to ring (bell), make loud noise’ (many similar derivatives from this, also e.g. kalkas ‘lively’, kalkatti ‘blabbermouth’). — Estonian kalk, kalge ‘hard, brittle’ is a more reliable cognate in any case at least.
[2] In Mo. loss of *w probably postdates medial voicing though: by a few examples, *-tw- *-sw- seem to yield PMo. *-t- *-s-, not **-d- **-z- (at least *latə ‘shelter, roof’ ~ Finnic *latva ‘canopy’, *kas- ‘to grow’ ~ Finnic *kasva-).

Tagged with: , , , ,
Posted in Etymology

Will Someone Please Reconstruct Proto-Kurdish Already

Some things about comparative linguistics you might just take for granted in your own little corner of a particular language family, until you start looking at how they do things in others. In Uralic studies, we’ve known for 200+ years, and put into explicit practice since 150+ years ago, that progress requires documenting unwritten language varieties (just comparing literary Hungarian / Finnish / Estonian / Sami runs out of steam fast [1]). For 120+ years, even, that it’s additionally good practice to get detailed interdialectal comparison of such languages started sooner rather than later, not just rely on one well-known doculect.

The big dog of our Eurasian linguistic region, Indo-European studies, has of course an enviable access to a good bunch of attested Old Indians, Old Church Slavonics and Old High Germans, which are lot more directly compareable with each other. But you’d think the field would have somewhere during the 20th century understood at least that, yes, newer-attested languages will have contributions to make to the overall picture too. Remember e.g. how Nuristani, a little bunch of languages up in the mountains of Afghanistan, turned out to have the key evidence for affricate reflexes of *ḱ *ǵʰ *ǵ in Proto-Indo-Iranian, preserved several millennia longer than in Avestan or Sanskrit?

Where Slavistics, Baltistics, Germanistics, Armenistics, Romanistics have all still gotten their general comparative programmes rolling pretty well, Indo-Iranian keeps being a rock that drags behind pretty badly. Considering extra-scientific causes, this is not a giant surprize / is clearly in some amount thanks to these other sub-fields’ status as National Sciences in the various nation-states of Europe. Still, this would not have to be the case, it’s not like Celtistics has been left in the dust. Comparative linguistics also seems like something with sufficiently little direct political valence that it should be doable enough even e.g. under Iran’s current theocratic administration, let alone by the sizable and somewhat intellectual-leaning Iranian diaspora(s). Indian fans of the Out-of-India theory also demonstrate an existing if unorganized interest in linguistic history.

But indeed. Indo-Iranian is not just any random branch of Indo-European; it is today the largest branch (e.g. Glottolog counts 319 varieties, out of 581 Indo-European varieties altogether), and also the only one to preserve all of its known main branches since antiquity. Reflects more branches today than in history, really: already Nuristani is nowhere to be seen until the 19th century. By contrast, in Europe East Germanic, West Baltic, Continental Celtic, Aeolian Greek etc. are long gone. If anywhere in IE, it is in Indo-Iranian that we should expect to be able to reach quite deep time depths by collecting data from modern varieties and applying comparative reconstruction efforts as usual. Yet this generally seems to have not been done, and approximations derived mainly from Sanskrit and Avestan end up making do as Proto-Indic, Proto-Iranian, and the main fodder for Proto-Indo-Iranian.

By now there is clear evidence that this is insufficient. One informative case from recent years is Martin Kümmel’s observation that “secondary” word-initial h- in several Iranian varieties — at least Khotanese and many western Iranian varieties including Middle to New Persian — actually seems to be a retention of PIE laryngeals (especially *h₂)! This may not have been completely out of the blue. Laryngeal hiatus in Vedic (*aHa > *a.a > ā in some cases still parsing as two syllables) has been known since the early decades of laryngeal theory, and Cheung’s Etymological Dictionary of the Iranian Verb from 2007 takes an extremely cautionary approach of projecting all PIE laryngeals into Proto-Iranian, including an implausible-looking contrast between this *H and secondary Iranian *h < *s (and implausible-looking clusters like *Hhauš- ‘to dry out’). [2] Regardless we do see that it is incorrect methodology to treat any divergences from attested Old Iranian as innovations, and that this will fail to connect archaisms in marginal new Indo-Iranian varieties back with the wider programme of Indo-European reconstruction. The same has been very patiently explained by Kümmel too, in a 2016 paper “Is ancient old and modern new? Fallacies of attestation and reconstruction“.


I’ve picked Kurdish here as a semi-random example of a modern Iranian language group that probably deserves closer investigation in this fashion, though its western peripheral location might indeed make it a more likely location for archaisms than smaller languages more fully encircled by Persian. It quite clearly shares at least the propensity of retaining *h₂. Even just looking over the lexicon of standard Kurmanji as listed at Wikipedia readily turns up cases like hêk ‘egg’ << PII *Hāwyam < PIE *h₂ōwyom; hirç ‘bear’ << PII *Hr̥ćšas < PIE *h₂r̥tḱos (~ Middle Persian xāyag, xirs). However also cases like hesp ‘horse’, where some kind of “aspiration throwback” could be considered (*asp- > *esʰp > hesp).

The outlines of Kurdish historical phonology are known, of course. Relatively detailed discussion is readily found in sources like Asarian & Livshits (1994), or at Iranica Online. What seems to be missing from these accounts, however, is any real integration of variation among the Kurdish “dialects” (by now widely thought to comprise at least 2–3 languages). They also spend much effort on lamenting difficulties in telling what might be native Kurdish words and what loanwords from Persian or Zazaki or some other neighboring Iranian variety; same as in many other studies on individual western Iranian varieties. But we — at least e.g. us Uralicists — know quite well that attention paid to dialectology is often able to resolve such issues! Maybe some Kurdish variety would turn out to display a form different from the others that would then need to be considered the native one; or to display a different loanword substitution, pointing in favor of relatively recent loaning, whether from Persian or not. Dialect differences could also help with relative chronology, in telling late areal changes (and across Iranian these are many) apart from what really are early Proto-Kurdish innovations. The retained laryngeals, too, are noted by Kümmel to not be entirely systematic. Conceivably it could be the case that e.g. Kurdish only gets them through Persian, at some older or newer date. Or inversely, maybe Kurdish might be in its native vocabulary more systematic about this than Persian is. No way to tell before looking.

Let me be clear here on the proposal. A reconstruction of e.g. Proto-Kurdish should not rely on just some handful of already available descriptions / dictionaries (though I’m sure their comparison, too, would already add up to several results), nor aim just for identifying phonological variation. The goal of such a project should be primarily lexicogeographic: to have detailed enough dialectological picture to be able to see the directions of vocabulary spread, to tell local innovations apart from local archaisms. In Uralic studies, when putting together an understanding of, or at least the data for understanding, Proto-Samic, Proto-Mari, Proto-Permic, Proto-Mansi, Proto-Khanty, Proto-Selkup, etc., we have routinely based this on low double digit numbers of varieties, each documented at least to low quadruple digits of vocabulary. And these are all smallish language groups, spoken by some tens or hundreds of thousands of people. The Kurdish languages have tens of millions of speakers altogether. Even if extensive fieldwork in Kurdistan were to look too dangerous or politically complex right now, already connecting with the diaspora communities worldwide should be easily able to provide data on some dozens of varieties.

I do not pretend that this would be a small or quick task (it is clearly beyond what I or anyone could accomplish as just one unattached researcher), but it seems like a very doable task, and likely fruitful, not just for the circles of Kurdish studies or Iranian studies, but for Indo-European studies altogether. And by no means is this a gigantic endeavor either. This could be all done in under a decade by one research group, if there was first of all the will for it to happen (be funded and prioritized).

Closing up this plea, let me also suggest one other hypothesis that could be up to something. In existing overviews, Kurdish is reported to “sometimes” show PIr. *x > kʰ, e.g. *xara- > /kʰer/ ‘mule’. The facts that this development (1) fails to be regular and (2) seems to be a regression (alleged PIr. free-standing *x is < PII *kʰ) should already suggest that it is perhaps an archaism rather than an innovation. The same might go for /tʰ/ from “PIr. *θ” < PII *tʰ, reported at least in *θaiwar- > /tʰiː/ ‘brother-in-law’. This interpretation is not airtight off the cuff by any means: both Armenian and Semitic influence could have encouraged secondary introduction of aspirated stops. But, interestingly, on a brief look-around I do not find cases where Persian /x/ ~ Kurdish /kʰ/ would derive from a secondary *x that continues *h₂, only cases with PII *kʰ. From the former, the result seems to be /h/, as above in e.g. ‘egg’. So did Kurdish regularly shift *x > /h/, while never shifting *kʰ? Again, detailed dialect evidence could perhaps swing this either way. One of these decades we will hopefully know better.

[1] Yes, written Sami already existed 200 years ago, indeed since the mid-17th century. The first variety to have been standardized to some practical extent was so-called “Old Swedish Sami”, a clergy-designed form from the mid-18th, based most closely on Ume Sami though aimed as a general western interdialectal standard. Standard Northern Sami took its first steps around the same time as well.
[2] Omniretained laryngeals are furthermore trouble also for e.g. RUKI. If we have *s > *š in e.g. *buHs- > *buHš ‘to endeavor’, as if triggered by a preceding *H and not *ū, why not also in e.g. *yaHs- > *yaHh- ‘to girdle’? Without the assumption of universal laryngeal preservation, though, this could be easily resolved by assuming *eH >> *ā as an independent vocalization from *iH/*uH > *ī/*ū. Note also a further but welcome corollary: if we do go with thinking that RUKI in *buHš- has been triggered by a long-retained *H, then also *Hhauš- will have to be simplified to just *hauš-, indeed already to a pre-RUKI late PIE *sews- < early PIE *h₂sews-.

Tagged with: , , , , , ,
Posted in Commentary, Methodology

Phonological Renormalization

A small definition of a concept.

Across the dialectology of various languages we very often find almost the same segment inventory despite various innovations. I call this phenomenon “phonological renormalization”. It seems somewhat mysterious at first: it is hard to see any way how a language’s status as a part of a large dialect continuum could outright prevent innovative phonological features from arising. However, it does seem to me that there could be an easy way out — by assuming a slight diachronic detour: suppose that new innovations do arise, they simply then afterwards change back into another segment already known in the language’s other dialects. A sufficiently homogeneous “sociophonetic environment” could probably motivate novel phonological segments to often merge with pre-existing close matches. Specifically, learners / innovating speakers faced with the prospects of (1) adopting an innovative form and (2) adopting an innovative segment might prefer the former, but still avoid or fail to manage the latter.

This kind of hammer-down-the-nail development can be sometimes directly attested. Both stages are historically recorded e.g. for the fate of *ð [1] in some western dialects of Finnish: first flapping, to create a new phoneme /ɾ/, then merging with the usual trilled rhotic /r/. Or, in Eastern dialects of Finnish: early loss of medial *-n- in the allegro forms of the 1PS and 2PS pronouns minä, sinä evidently first created a rare transient diphthong /iä/, recorded only in a narrow southwestern corridor from Mikkeli to Hamina; but elsewhere renormalized to /ie/. Even in Mikkeli it has been fortified by the late Savonian–Karelian diphthongization of *ää to /iä/. I also wonder if the Western Finnish allegro forms , actually continue older miä, siä with different renormalization (but before the lowering /ie/ > /iä/).

My aim is not to list extensive amounts of evidence here. Renormalization is pretty easy to notice once you start looking, maybe especially among consonants and/or small phonological inventories? But just off the top of my head, a few other conspicuous examples that come to mind include the following:

  • Glide epenthesis in Finnic and Samic. Both language groups generally turn *w into labiodental /v/ = usually an approximant [ʋ]. Whenever any kind of a labial glide develops later on, e.g. before word-initial /oː/, or in hiatus following any labial vowels, or from lenited *b, this likewise tends to produce /v/, not [w]. Occasionally a [w] can be attested, e.g. in Standard Finnish in cases like *kauɣan > kauan [kau.an ~ kauwan] ‘for long’; for which maybe most dialects however show instead kauvan.
  • Preaspiration *TT > *ʰTT in Samic. This affects first the native Uralic geminates already in Proto-Samic; slightly after that, across western Sami, also secondary geminates introduced by consonant gradation.
  • Lenition-plus-fronting *ɣ > /v ~ j/ in Mordvinic (/v/ in back vowel environments, /j/ in front). This affects first *ɣ from PU medial *k across the whole group, slightly after that also *ɣ from the lenition of *ŋ, across all of Moksha and most of Erzya.

At least the second has been even proposed to actually represent a single innovation, which would postdate *Cˑ > *Cː in western Sami but predate it in eastern. An equivalent scenario, with old *w and *ɣ preserved until the rise of secondary cases, could be sketched from the other two too. But perhaps the phenomenon of renormalization should be a sufficient explanation.

In more general, I suspect this phenomenon could perhaps end up accounting for much of what is normally called “cyclicity” of phonological rules. In a language that e.g. aspirates its initial stops (say, English) it’s not that a newly born /t/ from a source that does not already have aspiration (say, from /θ/) would have to be instantly aspirated, since we do sometimes find allophonic contrasts phonemicizing in this fashion. But if this did create a three-fold contrast /tʰ/ : /t/ : /d/, I predict that this should prove unstable and be quickly followed by an additional merger [t] > [tʰ]. After this the innovation could also no longer spred to other speakers as a “phonetic” change [θ] > [t], only in a purely phonological form /θ/ > /t/ [tʰ : t], potentially quickly leaving any small original areas of /tʰ/ : /t/ in obscurity.

And I wonder further… other kinds of reiterated sound changes could find similar explanations for them too. Juliette Blevins, for example, has a recent paper observing that Austronesian *q has disproportionately few reflexes that are actually /q/ She proposes that this should be taken as evidence that the phonological stability of uvulars is conditional on a language’s vowel system, which might not be all wrong. (Cf. some further comments from me @ Tumblr.) But could it be additionally the case that an area having many languages with /ʔ/ creates a pressure for other languages to “normalize” (hardly “re-“) even an inherited, native *q into /ʔ/…? Or, indeed, even a *k? setting thus the stage for the famously common-in-Oceanic chainshift *t *k > /k ʔ/.

Of course though, this goes both ways and it’s also possible that many cases could be accounted for simply by internal phonological instability. [2] Already in the above examples: e.g. a contrast [w] : [ʋ] would be itself pretty rare, ditto the western Finnish system with all three of /ɾ r rː/. Still, at least the specific direction in which these systems seem to collapse does look to be largely determined areally, and e.g. no cases of the chainshift *r *rː > /ɾ r/, known from places like Ibero-Romance and Albanian, has been described from any western Fi. dialect.

[1] Reminder for the map reader: Kettunen’s ð is usual Finno-Ugric transcription for the alveolar tap = IPA /ɾ/, while δ is the voiced dental spirant = IPA /ð/.
[2] Reminder thanks to my wife Sara Carrier-Bordeleau.

Tagged with: , , , , , , ,
Posted in Methodology

Revisiting Setälä’s *pk

In 1907, E. N. Setälä published one of his last comparative linguistic works: [1] “Finnisch-ugrisches pk (~ βk)” (in FUF 6; nominally dated to 1906), on a minor addition to the cluster canon of Proto-Finno-Ugric. This was a follow-up to some discussion in early 1907 in Virittäjä by Paasonen and Setälä. [2] The idea has since then gone without much attention, either for or against. At least one of the proposed comparisons, supported also by Paasonen — Finnic *tukka ‘hair’ ~ Mari tupka, təpka (*tŭpka) ‘tuft, bunch’ — survives as late as Collinder’s Fenno-Ugric Vocabulary (1955, 1977: 63) and Comparative Grammar (1960: 87–88). Even this last case is, however, quietly dropped in later references, I think starting with Suomen kielen etymologinen sanakirja (tukka in vol. 5 from 1975) and absent also in the UEW. A look at the original work reveals that also cognates from Komi were proposed by Setälä: tup-jura ‘tuft-haired’, tup-jur ‘owl’ (= “tup-head”), tupka ‘owl’. Its removal does make sense, as by Collinder’s time it was already known that Komi /u/ normally does not correspond with Finnic *u ~ Mari *ŭ < PU *u (we would rather expect /ɨ/).

A cluster occurring only in one word could be surely deemed fairly uncertain, and other etymological directions seem to exist also for both the Finnic and Mari words. Before looking more into these though: what of the rest of Setälä’s data? He presents no less than 9 examples in his articles, which would be already more than there are examples of some regardless generally accepted PU clusters. I summarize the data below in a table (reordered, glossing simplified somewhat, some variant forms omitted):

glossFinnicSamicMordv.MariPermicHung.
‘to beat,
chop’
*hakkat-*cōvkkē-*čaka-,
*čuka-
*ćapkɨ-csap-
‘(to) kiss’Fi. suukko
SE tsiuku
N cuvkit
‘to smack’
*ćup,
*ćupked-
csók
*hukka
‘loss’
(S–N *hāvkkë-
‘to suffocate’)
K. /šupkɨ-/
‘to throw’
‘to block’*tukkë-Lu–I *tëvkkë-K. /tupkɨ-/
‘to drip’Fi. tiukku-Ud. /ťopkal-/,
/ťopkat-/
‘to beat’
(of heart)
Fi. tykki-Er. tykno-K. /ťopkɨ-/
*öökkät-
‘to vomit’
K /ɨpkɨ-/
‘to sigh’
‘hair, tuft’*tukka*tŭpkaK. /tup/
*kokka
‘hoe’
kopka
‘plough’

The consonant center representation indeed looks fairly regular, especially Finnic *kk and Permic medial *-pk-. Reflexes elsewhere are more scanty, and in particular no Ob-Ugric data appears at all. Unfortunately, even besides this we have several reasons right off the cuff to suspect that these are not reliable etymologies.

  • No regular reflex is established for Hungarian. We have instead one case of p, one of k.
  • An abundance of onomatopoeia / ideophones or at least meanings susceptible to this kind of origin: ‘beat’, ‘kiss’, ‘drip’, ‘vomit’. Many would have parallel variants, e.g. Fi. sykkiä besides tykkiä.
  • Poor within-branch distribution is common: we have just Finnish in two of the Finnic cases, just Northern Sami in one of the Samic cases, and just Komi in five and just Udmurt in one of the Permic cases. Some could be supplemented by newer data though, e.g. Moksha does seem to have /təkna-/ ‘to beat (heart)’, and Komi /ťopkɨ-/ ‘to drip’.
  • Some lax semantics. In the 3rd I have no idea what the basis for the comparison between Finnic and Komi is supposed to be. The ninth is not very promising at all either: a bit off to begin with, and per the more detailed data of Moisio & Saarinen, kopka does not mean simply ‘plough’, but rather ‘flat center part of the plough, where the ploughshares are attached’, rather further away from ‘hoe’.
  • Onset mismatches; at least S. *c- and Mo. *č- (suggests PU *č) versus Permic *ć- and Hu. cs- (suggests PU *ć) in the first and second; Fi. t- ~ Permic /ť-/ in the fifth and sixth. General Samic /h-/ in the 3rd is also strictly a loanword consonant, and Setälä does proceed to propose borrowing from Finnic, but still before his proposed assimilation *pk > *kk.

More trouble still comes indirectly from finer details. For one, Permic morphophonology: though consonant clusters often simplify to their first member word-finally, there are no known cases with an alternation /-p/ : /-pk-/ as would be predicted to exist from this (including not in *ćup ‘kiss’). For two, the Mari stem structure CVCCA looks suspicious: 2nd syllable vowels are usually lost in nouns, and even when they do survive, it’s usually as Proto-Mari *-ə, not as *-a. Same in a few cases in Finnic: overheavy syllable structures like *suukko, *tiukku-, *öökkä-t- are not typical for native vocabulary. And even Mordvinic: unpalatalized /t/ + front vowel /i/ (with an allophone [ɨ] that Setälä notates as y) has no known native origin. So altogether a bunch of this data does not even look native.

Even after these observations, some basis for *pk could be perhaps still salvaged. But the death knell I think is the near-complete absense of any regularity in first-syllable vowels. Only one of the eight load-bearing Finnic / Permic comparisons has good parallels: uu ~ *u, regular from PU *ow. A few cases of y ~ K. /o/ are known too, but these have a conditional explanation: assimilation *e-ü > *ü-ü early on in Proto-Finnic. [3] Many other correspondences like Finnic *u ~ Samic *ā are also firmly irregular. Hence I will be happy to think that, yes, this article and its etymologies were in error and no cluster **pk is to be reconstructed for P(F)U.

What then of the existence of a cluster /pk/ in Mari and Permic? I think this is explainable, just by morphology rather than phonology. In Mari these should be clearly considered derivatives *tŭp-ka, *top-ka, with a reflex of the common PU diminutive suffix *-kka. This source of /pk/ is already clearly evident in other cases, e.g. lap : lapka ‘low(-lying)’, šapə : šapka ‘faded’. In Permic, then, note first that /pk/ is primarily attested in verbs. I would similarly segment here roots ending in /-p/, plus the PU momentane suffix *-kə-. This is not generally productive in Permic, but traces of it have already been identified in various cases (UEW even derives *ćapkɨ- from its *ćappɜ- ‘to hit’… *a > *a remains irregular though). This seems clear at least for ‘to drip’, where even /ťop/ ‘drop’ has been attested. The involved word roots, as mentioned above, do seem to be largely simply onomatopoetic.

One more Uralic variety is also known to have /pk/: Southern Sami, among this data only in hapkedh ‘to choke’, but other cases exist too. In Setälä’s view, /pk/ ~ /vhk/ (< *vkk) would be different generalized grades of an old alternation pattern *pk ~ *βk, but no direct evidence whatsoever exists of such an alternation. I wonder if a phonological solution could be still sought: *vkk > /pk/ might be an old regular sound change in SS.

One clear loanword, SS haapkie ‘hawk’, is alas not evidence for such a change. Other Samic reflexes like Lule hábak point to loaning already from Proto-Scandinavian *habukaz (→ PS *hāpëkkē + later syncope in SS), [4] not from attested Old Norse haukr (which could have yielded PS **hāvkkē). However, I suppose that also western Samic *hāvkkë- is still a loanword from Finnic; the source is just not Setälä’s *hukku- ‘to disappear, drown’, but rather *haukki- ‘to gasp for breath’ (+ other meanings). This has been attested from most of Eastern Finnic, e.g. Karelian haukkie, also dialectal Finnish haukkia; its standard Finnish variant haukkoa seems to be actually more narrowly distributed altogether (and thus younger?).

Still on the other hand, a development *pk > *vkk or maybe straight to /vhk/ would make more sense within the general dialectology of Samic, where we also have innovations like *šk > /jhk/ across all western varieties. [5] A likely intermediate looks to be *fk, which is actually the normal Kola Sami reflex of *vkk. Lehtiranta’s Proto-Samic reconstruction already takes this stance, giving PS *cōpkë- for a similar correspondence in SS tsuopkenidh ‘to break (intr.)’ ~ other Samic *-vkk-, e.g. NS cuovkut ‘to break (tr.)’. This would also allow supposing that attested cases of /vk/ in Southern Sami do continue PS *vkk and are not newer loans from other Sami varieties: from Lehtiranta we have jaavk-udh ‘to appear’ ← *jāvkkë- ‘to disappear’, raavkedh < *rāvkkë- ‘to demand (back)’. But then we face again the question of explaining the origin of SS /pk/. For *cōpkë- ‘to break’, the same approach as in Permic is perhaps not impossible: is it also a relict *-kə-momentane from an onomatopoetic root *cōp- < *čap(ə)-? Alas for hapkedh this will not readily work. Still for that matter it also shows short /a/ which does not match the cognates I’ve proposed to reflect Finnic ⁽*⁾haukki-… Extremely speculatively I could entertain an idea of this to be instead from a PS *θëp(pë)-, as a cognate of Finnish–Karelian *tüppe-htü- ‘to be extinguished, out of breath’ [6] that has indeed been amended with *-kə-; i.e. pseudo-PU *ďüppə-kə-?! Usually though this Finnic verb has been considered a parallel derivative to *tüpp-i- ‘to block, close’ which furthermore also has a known Samic cognate *tëppë- ‘id.’ > SS dahpedh (with /t-/, not /h-/ < *θ- < *ď-). I am not sure if the existence of words like Es. läppama ‘to choke’, NS lahppasit ‘to be out of breath’, Mordv. *ľäpija- ‘to choke’ are worth anything: they don’t correspond well with each other, but they could suggest an old ideophone of lateral + *pp for ‘to choke’, and my *ďüppə- could also fit under this pattern if *ď- had been originally lateral. But for now this is at best a stretch.


This post was originally inspired by some observations on a possible different etymological origin of one of the involved words… it would be, by now, however an entirely different tangent, and I may return to that topic instead later.

[0] In case this post seems like an excessive amount of effort to spend on forgotten crappy etymologies from 115 years ago, cf. further my older discussion of “anti-etymologies“. It is very possible that the poorness of these comparisons would not be apparent to some people happening upon Setälä’s work! They also have given me an opportunity to talk a little about some other topics that have been on my mind, such as /-kɨ-/ as a Permic verb suffix.
[1] In his later years he would be much more involved instead in the politics of newly independent Finland.
[2]Alkuperäisestä -pk-sta on suomessa tullut -kk-“; “Alkuperäistä -pk-ta ja sen heikkoa astetta edustaa suomessa -kk- ja -uk-.” (Neither currently available online, but perhaps in the future.) Setälä’s article also reports that the editing of FUF 6 had been finished, but the issue had not yet gone to print, by the time Vir. 11/1 appeared in late February 1907.
— TBH, to me it would seem like an amazing coincidence that both scholars had been planning to publish on the same minor sound change at almost exactly the same time. Since in our time it is known that Setälä in his later years had a track record of stealing discoveries from other scholars, I do have to wonder if he is here too trying to claim priority from Paasonen on the three comparisons he advances (those of tukka, tukkia, kokka) by sneaking a small article in at the last minute into his own journal. He did clearly come up with the idea of the correspondence F *-kk- ~ P *-pk- though: the comparison of suukko ~ cuvkit ~ K. ćupköd- appears already in his 1896 article on consonant gradation (SUSA 14).
[3] PF *lülü ~ K. /lol-/ ‘hard heartwood’; PF *süntü- ‘to be born’ ~ K. /sod-/ ‘to multiply’; PF *süttü- ‘to be ignited’ ~ PP *sɔtɨ- ‘to burn’; see recently Aikio (2021). To be fair, in the cluster of tykki- we do have Fi. tykyttää with cognates also in Karelian and Ludian; but also a morphologically primary-looking variant tykkä- is attested, stretching wider still to Ingrian, Veps and Estonian.
[4] Perhaps also not directly from Scandinavian, but thru Finnic *habukka (> standard & western Fi. haukka but e.g. eastern Fi.–Krl. havukka, Lu.–Veps habuk).
[5] Traditionally considered the defining innovation of a Western Samic subgroup, but I would agree more with a division into South–Ume versus Rest being older, as argued in recent times (future blog post on this perhaps coming).
[6] Inspiring also modern Finnish typpi ‘nitrogen’ as a back-derived coinage.

Tagged with: , , , , , , ,
Posted in Commentary, Reconstruction

“All swans are underlyingly white”

An allegory that I started writing for something else, but which upon reflection should probably stand on its own.

Once upon a time, in a world closely alike our own, a biologist postulates a generalization: “All swans are white”. The hypothesis performs admirably for a time and appears to predict quite strongly the coloration of swans.

Small updates to the emerging theory are gradually accepted: for example, the impact of diseases, oil spills or Diogenean paint-set-wielding jokers on the coloration of swans is admitted via an adjustment “All swans are naturally white”. Likewise, objections drawn from the study of the beaks, flesh, skin, etc. of swans are admitted via a further adjustment “All swans’ mature plumage is naturally white”. Regardless, the original formulation remains in circulation, even if by now generally understood to be shorthand for the more nuanced version; and all remains well in the field of Generalizative Biology.

One day, however, a serious disaster appears to strike, as reports of black swans arrive from the far-away land of Australia. Some initial ways of explaining away this pesky conflicting evidence are explored: perhaps the birds in question have a peculiar habit of taking dust baths in coal beds; perhaps the birds possess not true plumage but a neotenous extension of the downy (and non-white) body covering of young swans; or perhaps their blackness indeed disqualifies them from being considered “swans”. However, detailed study of their behavior, genetics, etc. eventually proves these approaches untenable. The birds in question are in all appearences just another species of swan — except for its atypical plumage. Generalizative Biologists remain irked by this blemish upon their valued theoretical stance that all swans are white; thus far one of their most strongly established results in the field of animal coloration. It is proposed by some naïve outsiders that the theory has simply been falsified, but such busybodies universally fail to propose any new, better theory in its stead. How else would one explain the repeated and well-replicated observation that swans everywhere else appear white? A weakening to a mere “some swans are white” would have no predictive power, and it would clearly be a great failure of parsimony to posit that dozens of swan species are individually white, but with nothing common between these individual facts.

In the minor field of historical biology, it is pointed out that black and white swans alike likely descend from a white ancestor. This does not generate much disagreement as such (by definition the proto-swan has been a swan, and hence is clearly predicted to have been white), but at the same time is agreed to not be an answer to the problem of black swans. Swans must be describable as synchronic biological systems! Their identity as swans cannot hinge on evolutionary theory. After all, has not the concept of swans — even including the problem of black swans — been already defined long before anyone had heard of evolutionary biology?

Finally a new promising theory is found. Informed by close study of the developmental history of swans, it appears that black swans’ plumage only gains its coloration by a pigment. It is shown in careful experiments that, without this pigment, the plumage would rather turn out white! This immediately gives a new, seemingly paradoxal result. Even black swans are white after all — that is to say, white in (what comes to be called) their underlying anatomy, and only seemingly black due to (what comes to be called) their individual bodily realization. All biology, of course, has for long recognized that members of a species often differ in their individual bodies: taller, shorter, missing digits or having additional ones, indeed sometimes albino; and that this should not be seen as invalidating their identity as members of a species with a certain typical height, number of digits, or coloration. But the recognition that such individual processes can be highly widespread across a species proves revolutionary.

The theory of underlying anatomy quickly finds applications to several seemingly unrelated problems. The hooded crow, for example, turns out to be not a true counterexample of the old but contentious theory that “all crows are black”: it can be treated as a completely typical, underlyingly black crow (instead of e.g. the popular earlier theory to identify it as being actually a remarkably large and crow-like magpie). Likewise, the old suggestion that “all birds have wings” proves to be underlyingly true and only superficially contradicted by species such as the kiwi or, by some views, the ostrich. Even the older hedge about swans being only naturally white proves to be but a trivial special case of the new theory: e.g. any swan spraypainted green remains, of course, still underlyingly white. All this is widely seen as strong evidence for the validity of the concept; and thus, the problem of black swans has, in the end, only made Generalizative Biology stronger.

Some still occasionally express confusion about the nature of underlying anatomy, mostly people without proper training in theoretical anatomy (lamentably including even several experimental anatomists). It is admitted, of course, that experimental correlates of underlying anatomy remain difficult to identify. However, even if one for some reason sees fit to completely ignore e.g. the original pigmentation studies among the black swans, what of other case studies such as the embryological evidence for humans as an underlyingly tailed species? the fossil and written evidence for lions as an underlyingly European species? or any other number of such demonstrations? No, all accumulated evidence must surely be taken in favor of underlying anatomy as the main cause behind organisms’ observable biology. And the theory clearly advances by the day — why, just last year it was argued that even horses, too, share the important mammalian universal of five underlying digits.

Tagged with: , ,
Posted in Methodology

Sami ruoŧŧa ‘Swedish’, ruošˈša ‘Russian’

The ethnonym and state name Russia(n) traces its origin back to older Rus’ (Русь). As the current standard etymology goes, this is thought to then derive, via the Varangian ruling class of pre-Slavic Russia, from Finnic *roocci ‘Sweden’, in derivatives ‘Swedish’; which is itself considered a loan ultimately from Germanic *rōþaz ‘rower’, in most versions via the name of the area of Roslagen on the eastern coast of Sweden, or perhaps various compound names in *rōþs- for its inhabitants. There are various details in this chain of hypotheses that are not exactly straightforward, and currently a lively session is ongoing at academia.edu, around a recent discussion paper by Viacheslav Kuleshov. [1]

Most discussion has focused only on the Scandinavian – Finnic – Slavic main chain. There are old offshoots also in other Uralic languages though, and on closer consideration I find interesting the existence of two separate groups of reflexes in Sami.

In this blog post’s title I’ve given the standard Northern Sami forms. The first of them is etymologically no trouble at all: it is simply a transparent loan from Old Finnish †Ruodzi /ruoθθi/ that probably does not need to be projected deeper back in Sami than the 17th century (around the time when the Swedish state itself, i.e. not just Swedish-affiliated Finnish peasants, begins to have a stable presence in the Northern Sami areas). Inari Sami ruátálâš ‘Swede’, Ruotâ ‘Sweden’ and Skolt Sami Ruõtˈt ~ Ruõcˈc ‘Sweden’ are transparently newer loans still. The first two come from spoken northern Finnish ruottalaine(n) ‘Swede’, Ruotti ‘Sweden’, [2] the last from standard Finnish Ruotsi. Lagercranz in Lappischer Wortschatz documents newer-looking loan variants from Northern Sami dialects too, e.g. (transposed to modern orthography) ruoha from Talma in Kiruna, ruohta from Gratangen and Nesseby, ruoha ~ ruohta from Parkalompolo in Pajala. I would take this variation similarly as evidence that there was no name of Sweden known even in Common Northern Sami (which very well might be older than a unified Kingdom of Sweden at all, putative Proto-NS seems to be closer to Proto-Samic than to modern NS dialects). Northern, Inari and (since WW2) Skolt Sami are of course also the three most Finnish-influenced Sami varieties. From Lule Sami on south I would presume ‘Sweden’, ‘Swede / Swedish’ to be instead direct loans from Swedish (Sverige, svensk). I have not checked primary lexicographic sources beyond drawing a blank for cognates of Ruoŧŧi in the Álgu database, but a quick look at Wikipedia incubators seems to confirm this, attesting Lule Sami gen.sg. Svieriga (nom.sg. Svierihka?), Pite Sami Sverji, Southern Sami Sveerje and Svïenske.

Explaining ruošˈša from Finnic is however harder. Already the sound correspondence *šš ~ *cc looks mysterious to me: all Sami languages would have the voiceless affricates c, č available (both plain and geminate, add preaspiration to taste), and substituting later Finnish *θθ as a palato-alveolar sibilant wouldn’t really make sense either. I have not found any real explanation or reference to one in the handbooks of Korhonen or Sammallahti. There seems to be at least one parallel though: Fi. viitsiä ~ NS višˈšat ‘to bother’. I could imagine this reflecting an intermediate early Finnish stage with a lamino-dental sibilant *s̪s̪, which is phonetically expected and perhaps directly ancestral to some of the small Finnish dialect areas with *cc > ss. There would be a better match if we assumed that the palatalization of Karelian čč was earlier found across Finnish as well (perhaps even as a retention: PF *cc after all comes from palatalized *ćć), and that assibilation took place already before any fronting, so that “very old Finnish” had a reflex *śś that could be adopted intact in Sami. But for this there is zero evidence across the Finnish dialect reflexes.

Kuleshov however has, in an earlier paper on the same topic, a promising suggestion that is new to me: borrowing not from Finnic but Slavic. This for starters fits the meaning better. Nowhere in Finnic does *Roocci mean ‘Russian’, and even the possible loan etymology into Slavic would seem to involve the semantic shift ‘Swedish’ > ‘Russian’ coming about within Slavic speakers. As an intermediate stage I would assume the word referring to Slavs affiliated with or ruled by still-Scandinavian Rus’. There is also common Permic *Roć ‘Russian’, usually analyzed as a loan from Finnic. I guess Permians should be predicted to follow roughly the same semantic trajectory as in Slavic however, once Scandinavian Rus’es cease to exist and are replaced by or assimilated into Slavs.

(Actually this also gets me wondering about Norwegians running trade connections to Old Perm via a northwestern oceanic route. They must have been known by some Permic name, and I wonder what… This is not required to have been the same as that for the inland Rus’, of course.)

Phonologically, NS uo from Slavic u does not immediately look good. Kuleshov’s suggestion is loaning already very early, before the Middle Slavic raising †ō > u. In this particular word this stage actually happens to be clearly attested even, in a Byzantine Greek translitteration Ῥῶς. On some thinking I have developed a different idea though. Any contacts with (pre-)Russians must have started at the eastern end of the Samic dialect continuum. And if we look at the other Sami varieties’ cognates of ruošˈša, we find not only identical Inari ruošˈša and roughly the same Skolt ruõšˈš, but also Kildin Sami rūšš. The correspondence “mainline” uo-a ~ Kildin ū is regular, probably regular enough that ū could be etymologically nativized back to uo, if the loan was transmitted thru Kildin or a similar Kola Sami variety (just as Rūʰt̀s = Rūʰcc for ‘Sweden’, mentioned by Kuleshov, must be etymologically nativized, either straight from Fi. Ruotsi or more likely from Skolt Ruõcˈc). This gives more flexibility in absolute chronology, which would be handy: the Middle Slavic era is usually dated somewhere in the mid 1st millennium, while the Russian Pomors arrive on the coasts of Kola peninsula a fair bit later, in the first centuries of the 2nd millennium. I do not think occasional reports of people further southeast by more trade-minded Norse or Karelians would be sufficient to establish a Sami name for the people who would eventually become Russians.

A different issue appears in the consonantism, but this too seems to work out by the assumption of borrowing initially into Kildin Sami. Samic palatoalveolar š from Slavic palatalized s’ is as expected; but why overlong *šš? In discussion Kuleshov has pointed out as a parallel the substitution of Russian medial с, ш as mostly geminate ss, šš even in recent loans into Finnic, a correspondence that is upon checking known to be fairly systematic but which I had not realized before. But so indeed: jorssi ‘ruff’, kassa ‘hair’, kassara ‘billhook’, kasseli ‘backpack’, kiisseli… Thinking a bit more, while this substitution looks unnecessary, it would be not so in varieties like Ludian or southern Karelian, where medial *-s- has been voiced to -z- or -ž- and the only native voiceless sibilants are therefore -ss-, -šš-. Browsing SSA, there are even cases with a geminate in the eastern languages but a singleton attested in Finnish, e.g. EFi. †kaasa (19th c.) ~ Krl. koašša etc. ‘porridge’; EFi. kosinkka ~ Krl. kossinkka ‘scarf’; EFi. ko(s)suli ~ Ingrian kossula ‘type of plough’; EFi. ku(s)sakka etc. ~ Krl. kuššakka ‘woven belt’. (However this pattern is weakened by forms kosinkka, kušakka even in southern Karelian, with singleton voiceless sibilants apparently re-established in loans.) Makes of course also a nice parallel to the long-running sound substitution strategy that Indo-European -p-, -t-, -k- are borrowed as Finnic -pp-, -tt-, -kk-, but -b-, -d-, -g- as -p-, -t-, -k-. This is by now regardless a strategy, a kind of a cousin of etymological nativization, and not a mechanical phonetic substitution. [3] I think this is also what allows the gemination pattern to turn up even in Russian loans into eastern Finnish, despite the availability of unvoiced -s-. Further in western and standard Finnish such loans have been of course mainly mediated by the eastern dialects.

Now what has this to do with a loanword into Sami? Directly nothing, I think: the Sami languages are not bound to Finnic and should be free to develop their own patterns of loanword adaptation. But in eastern Sami, from Skolt on, we also find medial voicing of sibilants, in the weak grade that is (the strong grade is a regular / “short” geminate, as everywhere in consonant-gradating Samic). This would have created the opportunity to innovate the same loanword nativization strategy: Russian -з- gets taken over as the paradigmatically voiced -ss- : -z-, versus Russian -с- as the consistently voiceless -sˈs- : -ss-. I have no data easily on hand on if this actually happens in Russian loanwords into Kildin Sami, but a draft paper by my colleague Markus Juutinen, on Russian loanwords into Skolt Sami, comes helpful. Some geminates from Russian -с- indeed turn up there: bie´sˈs ‘devil’ ← бес, [ki̮s̄sa̮] ‘bag made of sealskin’ ← киса, pleäsˈsjed ‘to dance’ ← плясать. All three must be fairly recent (note retained b-, pl- and lack of *ki > ǩi), but this is regardless evidence that the same adaptation-by-gemination strategy has been innovated in eastern Sami. Within a younger chronology, where rūšš is first loaned from Pomors in maybe the 12th century and perhaps reinforced as geminate during ongoing contacts, it does not seem outlandish to assume that medial voicing and also vowel raising to ū (IMO a part of an already Proto-Kola Sami chainshift) could have been in place already.

I do not know what to think of the absense of a known Ter Sami cognate of rūšš (we would predict rī̮šš = /rɨːɕː/). Is this merely a documentation gap? The main word for ‘Russia(n)’ in Ter Sami is however instead Tārra, cognate to the words for ‘Norwegian, Scandinavian’ elsewhere in Samic (NS dárru, etc.). While this is a neat parallel to the presumed Rus’ > Russian shift, it does raise questions. Were Pomors first interested in areas further north and west before showing much attention to the Ter Sami? Was there earlier an inland / coastal split within Kola Sami instead of the current west / east one?


A further interesting aspect of what this etymology adds up to is, I think, that although ruošˈša and ruoŧŧa still are likely doublets from a common ultimate source, their identical vocalism would be kind of accidental: the latter gets its uo straight from Finnish < Proto-Finnic *oo, the former develops thru Middle Slavic ō > (Old) Russian u → Kildin ū, which is only incidentally also < Common Samic uo. Under my current hypothesis, probably not even the shifts *ō > *ū in Slavic and Kola Sami can be considered connected: they merely reflect the universal tendency towards long vowel raising (besides, one is conditional, the other unconditional). Thus can typology of sound change conspire to create similar sound patterns via different routes. A kind of a second dimension to my earlier typology as parallel loanwords being somewhere between “diverging” (adopted in forms more different than they should be natively) or “converging” (adopted in forms more similar than they should be natively): they can also show correspondences “regular due to internal development” versus “regular due to unrelated developments”.

[1] Itself in response at a recent proposal that tries to sketch a novel Balto-Slavic origin for *Roocci; which I find so far still worse phonologically, semantically and sociolinguistically, so enough about that for this blogpost.
[2] Note also the Finnish second-syllable alternation between -a- in the ethnonym and -i- in the toponym, copied also in Inari Sami -á- ~ -â- (the latter with usual etymological nativization of stem vowels in F/S loans). This phenomenon remains without a clear known origin but is paralleled by Suomi ‘Finland’ : suomalainen ‘Finn’ and Lappi ‘Lapland’ : lappalainen ‘Lapp’ (≈ Sami). Presumably some two of these are analogous to the third, but I have no solid idea which way around. Slightly different is Häme ‘Tavastia’ : hämäläinen ‘Tavastian’ where the first seems to be a simple derivative in -e (within just Finnish we cannot tell very well if from *-eh or *-ek) from an earlier *Hämä.
[3] It has likely been originally phonetic though, way back when Proto-Norse and Proto-Germanic *-p-, *-t-, *-k- were still preaspirated or preglottalized. At least one example could show a similar development in an old loanword from Baltic: PF *rattas ‘wheel’ ← *ratHas < PIE *HrótHos.

Tagged with: , , , ,
Posted in Commentary, Etymology

Enter your email address to follow this blog and receive notifications of new posts by email.

Links