Trees within trees: the Bundle Model


Reposting here, an illustration I whipped up a few days before Christmas, for a debate on the validity of the tree model in linguistics, held at in an article draft session by fellow historical linguists and linguistics bloggers Guillaume Jacques and Johann-Mattis List. They argue against recent papers by Alexandre François and Siva Kalyan, who have proposed “freeing” historical linguistics from the tree model, and moving to an updated wave-model-esque approach they call “historical glottometry”.

I will not cover the debate here in detail, especially as the comments have been made publicly available by now (see also the link above thru to Jacques’ blog for some set-up details and further links). One major observation that I think however emerges is that there are multiple different senses in which we can speak of the “splitting” of languages — and it therefore often depends on the level of analysis how the relationships between languages should be represented.

My diagram above says nothing directly about linguistics, and is simply an abstract interleaving of two disparate tree structures: a macro-level, represented by branch distances; and a micro-level, represented by the graph topology. If you look closely, you can also see that there are indeed two micro-trees in the graph, unconnected to each other. (They likely would join paths sometime further down in history, had I continued drawing.)

There are 12 leaf nodes in this “double-tree”, which we may call A, B, C, …, L. Depending on which level of analysis we are looking at, there are two possible taxonomies generated by the two tree structures:

  • a “macro classification”:
    • [[A, [B, [C, [D, E]]]], [F, [G, H]], [[[I, J], K], L]]
  • a “micro classification”:
    • {{A, {{B, C}, D}}, {{E, {{F, G}, H}}, I}}
    • {{J, K}, L}

There are not many subgroups that would occur in both structures! The only such one is the triplet {F, G, H}… and even the subgrouping of this again diverges. There is moreover an interesting chronological complication with the splitting of this group: the micro-level branching occurs in its entirety substantially earlier than the macro-level branching.

In principle, it would be also possible to nest a third tree yet, of arbitrary structure, deeper inside the picture — so that upon zooming in, the graph representing microstructure again resolves into a set of unconnected nanostructures, branching and turning in tandem. And so on, ad libitum: fit then in an additional picostructure inside the nanostructure, or perhaps: use the current macro-division as a base for a megastructure with another geometry again entirely. (Moving from two dimensions to three or more will be required, if we wanted to fit in “non-contiguous” subgroups such as {A, C} or {E, F, J}.)

My approach here is also but one of various possibilities for “mixing” trees together. It does have one interesting constraint: in all cases, a macro-branching between two leaves takes place later, or at most at the same time (e.g. E | F), as their micro-branching. — But we could also imagine e.g. a single three-dimensional tree, whose 2D projections in a number of different directions each form a new tree of a different shape. In this case, branchings visible e.g. in the XZ-plane could be equally well earlier or later than the corresponding branchings visible in the YZ-plane.

If we imagined the above tree to indicate language relationships, perhaps linguist fieldworkers’ initial instinct would be to group the 12 varieties as 4 languages, according to the macro-structure:

  1. {A}, clearly a variety of its own;
  2. {B, C, D, E} as a set of “closely related” varieties;
  3. {F, G, H} as a more diffuse dialect continuum;
  4. {I, J, K, L} as an intermediate case.

But at some point, a closer look into the dialect diversification of these varieties might indicate e.g. that the features separating A from B-E include some traits that go quite far back, already before the B-E / F-H split. Other troubling isoglosses might also surface, where A thru I shared one value, J thru K another — and where we were regardless unable to show that the latter, “more closely related” varieties truly have innovated, and not the diverse remainder. At some point “language 2” might end up renamed a “dialect continuum” or a “linkage”, while the “more diffuse” language 3 might firmly retain its clade status. If “language 4” also would end up analyzed as a linkage is less obvious. Perhaps linguists would still hang on to analysing at least the split that distinguishes A-D from E-I as multiple unconnected events (one for E, one for F-H, one for I?)

Commentors in the session soon pointed out that my illustration reminds them of the concept of incomplete lineage sorting (ILS) from evolutionary biology. This is, roughly speaking (and any readers with more evobio under their belt than I have, feel free to correct me if this is inexact), the phenomenon that while speciation takes a parent species’ entire gene pool with it, some diversity may later end up being lost in daughter species. And if a species S with two alleles of a gene G splits into two daughter species, and allele G₁ eventually survives only in daughter S₁ while allele G₂ survives only in daughter S₂, we might end up wrongly concluding that the distinct alleles only developed in the daughter species. Moreover, if this kind of a situation takes place a couple of times, a gene may futher seem to have split into alleles in the “wrong” order, compared to the actual family tree of the species.

This is however not quite the same phenomenon that I am attempting to point at.

The exact linguistic counterpart of ILS is levelling: if we reconstruct a morphophonological alternation pattern in a proto-language, let’s say *a ~ *b, it will be possible for descendants to analogically eliminate one or the other alternant, and to end up with unvarying *a or unvarying *b. I have many opinions on levelling (most of them critical of reconstructing alternation from non-alternating reflexes; or of projecting attested alternation patterns deeper than necessary)… but that would be an overly large tangent to go on right now. Suffice to note that yes, levelling indeed also creates counter-tree-like isogloss configurations.

We could also define “lexical levelling”, brought about by the loss of inherited vocabulary. Mechanistically, this might look like a different phenomenon from morphological levelling, [1] but in terms of isogloss patterning, it often ends up looking exactly the same. An ancient proto-word might survive only in one group of descendant languages (and end up looking like an innovation particular to it); or it might be lost in a few descendants quite early on (and end up making the other descendants look like a subgroup defined by the introduction of this word); or it might survive in a ragtag assortment of not especially closely related descendants (and make it very clear that the occurrence or non-occurrence of a given word is not a strong genetic signal).

There is however a key difference between lineage sorting and my meta-trees. The “proto-variation” I’m trying to indicate by this meta-tree is not internal to a language variety. It is instead built from variation between the idiolects (topolects, etc.) that a given language is composed of.

Genes are obviously different entities from species, and likewise allomorphs (words) are different entities from languages, so it’s not a huge surprize that their family trees might not match each other; perhaps not even resemble. Two seemingly unrelated genes could turn out to be related, once you look a couple billion instead of just a couple million years back. It is hard to tell how common the same might be for seemingly unrelated words, given that our knowledge of linguistic history remains far shallower than our knowledge of evolutionary history… but even if we assumed that no such cases exist at all (which is, by the way, demonstrably untrue), loaning still often enough suffices to generate completely opaque doublets such as wool and flannel, or atoll and esoteric.

Language contrasts, dialect contrasts and idiolect contrasts meanwhile are only qualitative variations of the one and same thing: linguistic variation between speakers. And yet we can also sketch a situation where a “language split” ends up taking place along different fault lines than an earlier “dialect split” did.

This observation is by no means my own invention. For example, my Helsinki colleague J. Häkkinen calls this phenomenon “boundary shift” in a paper published a few years ago. [2] The particular example he refers to (certain divergences in vowel history in the common West Uralic era) has by now been explained otherwise, [3] but other candidates could easily be located as well. A few that spring to mind within western Uralic would be the numerous isoglosses connecting Votic with the Eastern Finnic (Savonian-Ingrian-Karelian-Veps) language group, e.g. the innovative 1st and 2nd person plural pronouns *möö, *töö, [4] rather than with Estonian, generally considered the closest relative of Votic; or the treatment of initial *d₂- in Samic, where Southern and partly Ume Sami show a development to *θ- > /h-/, but most languages show instead a development to /t-/, which happens to be also found in Finnic. [5] It is likely that many such conflicting isoglosses simply represent secondary contacts, much after the initial separation of the language groups, or even independent developments altogether, but I indeed see no reason to assume that they must all be somehow secondary. Many examples could well have taken root already during the initial dialect divergence of the involved language groups.

We know from dialectology and sociolinguistics that linguistic innovations almost always have a “width”. Instead of taking place in a single isolated variety, with inheritance from there to a set of descendants, they rather spread across some number of related-but-distinct varieties. (This is a point that François and Kalyan justly stress in their papers, if with different terminology.) A boundary shift is, then, nothing more than a change in how far exactly isoglosses coming in from a given direction end up spreading. The conventional usage of “language area” or “language contact” mainly comes up when new innovations extend wider than older ones did, and we often speak of dialect area X extending some influence to dialect area Y. But the opposite is possible as well: if new innovations “shrink” — they stop reaching a particular group of varieties — then not only does this lead to these varieties “splitting away” as a relict area from an earlier group of related varieties: it also leads to their earlier sibling varieties now “changing course” to instead align with some other adjacent “cousin” varieties.

This is the phenomenon that I attempt to capture by the various bunched right-angle turns in my opening graphic. For example, the split between “language 1” and “language 2” involves three micro-lineages (B-C, D and E) turning away in unison from the micro-lineage of variety A — even though the micro-lineage of E has already much earlier split away from that of A-D, and also the split between A and B-D is already well enough in effect. There is therefore a boundary shift here: the macro-lineage formed by A, B-D and E is broken, and only the latter two continue on together (B-D now moreover split into B-C and D). After this, new innovations again continue to accrue across the macro-lineage for a while, as represented by the linear “branch” section.

This situation does not amount to an “unitary protolanguage”, since the three lineages are, in fact, already micro-separate. An attempt at reconstructing a unitary Proto-BCDE would have to reach much deeper than this period to be able to unify also the deepest micro-divergences.

But, just about equally importantly — a single unified Proto-BCDE regardless exists, if way back there (in this case it is, in fact, simultaneously also the proto-variety behind everything from A to I). “Boundary-shrinking” in this sense can thus only operate on closely related varieties; and it can only decrease the similarity of some varieties from their earlier siblings. It is not capable of leading to the “convergence” of unrelated languages. Whatever macro-group ends up being formed by some separate lineages is not in any way converging: it is merely maintaining its pre-existing divergences at a given level, while language varieties outside the group are free to diverge further off. (Of course other processes, such as loss of archaic vocabulary, can well lead to actual linguistic convergence.)

The distinction I draw here between micro-lineages and macro-lineages however also has a different readily applicable interpretation in linguistics: genealogy vs. typology. We find no problem in stating something to the effect that Finnish and Turkish are agglutinative vowel harmony languages, while Livonian and German are a fusional vowel-reduction languages: this is taken as nothing more than a relatively superficial system of classification, separate from the “true”, i.e. genetic classification (according to which Finnish and Livonian are both Finnic, while Turkish and German are not even Uralic). But regardless, just as (proto-)languages can split into multiple descendants, language areals can similarly over time split into multiple typologies. Starting from a single point far enough back in time, we should be again able to trace a tree of diverging typologies, which is also again 1) likely to diverge in structure from any genealogical tree, and 2) likely to have all of its splits located later than the corresponding genealogical splits.

Typological divergences definitely also often involve boundary shifts of their own. If Livonian at some point in its history has taken a turn towards fusional typology, then it also has to have taken a turn away from agglutinating typology, and this quite well amounts to boundary shrinking of the “(core) Finnic macro-lineage of agglutinative typology”. Or, inversely: the relatively clean agglutinative morphology of common Finnic, still preserved in e.g. standard Finnish and Karelian, has in many later descendants been muddled by various processes of apocope and syncope: such is the case at least in Livonian, Estonian, Southwestern Finnish, Veps, and partly Ludic; more recently also in some dialects of Ingrian and Votic. This has the effect of turning inherited polysyllabic vocalic stems into “thematic stems”, arguably a step towards a more fusional typology (and at least in Livonian and Estonian, this has been a basic building block for many other innovations in morphology). Regardless, looking from the perspective of early dialect divisions in the Proto-Finnic era, the varieties involved are just about a scattershot. [6]

There also seems to be deeper similarity in here to dialect diversification, not only in the resulting tree structures, but also in the actual details of linguistic change. “Genetic macrostructural”, or “linkage-defining” wide-spreading innovations indeed have various features in common with “typological” wide-spreading ones:

  • They may ignore the microstructure of the dialect continuum;
  • They may spread in phases, taking root in different micro-lineages at different times;
  • Where independent, they may spread also over each other, forming patchwork-like rather than concentric isogloss patterns;
  • They may end up being reversed, if a counterinnovation arises;
    (I’m thinking here principally about “isomorphic” sound changes, that only affect the phonetic realization of a phoneme or a phoneme sequence, not its relation to the rest of the phonology; innovations in syntax may be applicable as well)
  • And finally, they can take the leap to “fully areal”, and spread also to “unrelated”, or at least not at all closely related language varieties.

Due to the lack of clear distinction on which linguistic innovations count as “macro” and which as “micro”, François & Kalyan have suggested roughly that we should treat them all as equally genetic. But I would claim that an opposite approach is just as well possible: since there is also no clear distinction between innovations that count as “macro” and innovations that count as “typological”, perhaps we should treat them as equally non-genetic.

So how do we reconcile these two extremes? A trivial solution would be to claim that no genetic relatedness between language varieties exists, but this obviously gets us into other conceptual problems quite fast (not to mention the troubling echoes of Marrism). Another option might be to instead deny the idea that we can speak of “the” genealogy of a language. Whenever many different and contradictory tree structures emerge, it may be worth checking if we could consider each of them to represent the descent of a different thing. A language’s nominal syntax does not have to have the same exact (areal or dialectological) origin as its vowel inventory, which does not have to have the same origin as its verb morphology, which does not have to have the same origin as its metalworking vocabulary; and perhaps it is a mistake to think that we can pick out the “One True Tree” from among the histories of these various subsystems.

But a third option yet, which I am growing increasingly fond of, would be to first grant that, yes, all usually recognized linguistic innovations are more or less “typological” or “areal” — but to then seek a deeper level yet that we could use as the rooting for the genetic origin of a language variety. My current contender for such a level is local continuity, forming what I call the bundle model.

In the absense of dialect levelling events (the introduction of expansive acrolects through e.g. migrations, mass media, or standardized schooling), a topolect specific to a given location has been primarily descending from the earlier topolect of that same village, as far back as language-level continuity gets us. A fundamental division of language varieties into topolects is also relatively unambiguous: just about any speaker either lives, or doesn’t live, in a particular village. No especially coherent division into topolects smaller than a village is possible either (at least as long as we’re talking about settled, non-urbanized, agricultural societies). [7]

A given linguistic innovation that forms an isogloss somewhere across a dialect continuum is, then, not what actually splits two topolects apart. Their existence is merely evidence that two topolects on different sides of the isogloss had already split from each other at the time. A primary splitting event instead corresponds to either the foundation of a new settlement altogether; or to the introduction of a novel language variety to a pre-existing settlement (no matter if as L2 or L1).

There is admittedly the complication that topolect monogeny is not ensured. Any new settlement could gain its speaker base from more than one pre-existing settlement; and the resulting new topolect can quite possibly end up taking on a mixture of its parents’ traits, instead of starting off as essentially a copy of one of its parents.

As for secondary splitting events, i.e. the actual language diversification, these could be instead said to form “bundles” of local micro-lineages: a category which includes as subtypes all three of “language areas”; “linkages” of related languages; and “subgroups” defined by common features. The differences between the three are, in the bundle model, considered differences in degree, not kind, with no sharp boundaries between them. However, it seems to be necessary to note that there are at least two gradual transitions here: half-a-continent-spanning language areas are still clearly different from local linkages, which in turn are also clearly different from small, tight bundles of topolects.

Also, amusingly enough, not only is it possible for a bundle to comprise language varieties of differing genetic backgrounds — it is also possible for a genetic group of languages to fail to be identified by a corresponding feature bundle. I expect many large-scale subfamilies to be indeed genetic subgroups, in addition to their unambiguous bundle status. But within any one such subfamily, it is easily possible for various smaller genetic groups to have formed, and then split up again, fast enough that no actual linguistic markers managed to establish themselves as characterizing the entire group (and only it).

What would be different for “secure” subfamilies (and “primary” language families) is moreover not their speed of formation. I would equally well expect that e.g. the main local-continuity genetic groups of Finnic had already split from each other before the vast majority of the innovations that today characterize the Finnic subfamily (no matter if one current primary branch would amount to half the Finnic language area; or to a single backwoods town somewhere in southern Estonia). It is the extinction of other early connecting varieties that allows me to be relatively sure that, yes, there was once a common genetic ancestor of the Finnic languages that was also not the genetic ancestor of e.g. any of the modern-day Samic languages. This common genetic ancestor could very well still predate various innovations that did spread to both the Finnic and Samic languages, putting it well within Proto-Uralic times, and thus looking distinctively non-Finnic. If we look for biological parallels, this “common genetic ancestor” thus functions the most like the identical ancestors point.

By contrast, reconstructible Proto-Finnic, no matter if we define this loosely by the last innovation common to all the languages (e.g. in phonology, the best candidate is *š > *h), or more strictly by the last innovation that is not predated by any innovation particular to a smaller set of varieties (in phonology I’d suggest for this something like the raising *aa > *oo, *ää > *ee), instead functions as the mere last common ancestor of the “population” of Finnic language varieties. In practice, this would mean something like the last language variety whose distinguishing linguistic characteristics were eventually uptaken by all other Finnic varieties known to us (either with or without allowing for the survival of additional earlier characteristics).

The bundle model also seems to have the benefit that we could make much closer use of archeology in determining when have various micro-lineages originally split from each other. If a cultural wave that we identify as Finnic reaches Southwestern Finland already in 500 BCE — then very well, let us assume that the deepest distinctions between individual western Finnish dialects could have already taken root at the time (and not at whatever time distinctions first start turning up in phonology, or morphology, or vocabulary). After this, we expect to see the foundation of new Finnic-speaking settlements in quick gradual succession, followed by the slower bundling of linguistic innovations (and possibly isoglosses) on top. But just as dialectologists and “linkageists” have long observed, there is no reason to a priori expect these later innovations to form a clear nested tree-like structure.

I have thus ended up agreeing partly with both the Jacques-List and the François-Kalyan camps. As per the latter, yes, we should stop trying to force our analyses of linguistic innovations into a tree shape by default; but per the former, no, this does not mean that we should up-end the concept of “genetic relatedness” entirely, and start applying it also to what are obviously areal units joined only by relatively late innovations (and though I’ve barely even touched the topic in this discussion, also: no, F & K ‘s “historical glottometry” is not an especially illuminating way of demonstrating the historical development of language groups).

For closing, I present here another imaginary diagram, this time more heavily un-tree-like (highly dialect-continuumish), and with some specific features of the bundle model illustrated. — For credit, this is again not completely original work. My key convention of presenting isoglosses as horizontal lines connecting multiple varieties is inspired, foremost, by earlier articles by Sammallahti and Viitso. [8]

  • Solid lines indicate micro-lineages, just as before;
  • Wide-angle turns indicate spreading events;
  • Small-angle turns (mostly) indicate boundary shrinking events;
  • Dashed lines indicate (some) isoglosses, bundling micro-lineages together;
  • Dead ends in T indicate language replacement events;
  • Dead ends in X indicate abandoned settlements.


I leave it to you to explore the picture further, e.g. to figure out how many processes that I have discussed above you can find illustrated.

[1] They also do share some important mechanistic similarities. If we treat morphophonology as lexicalized rather than surface phonological — then “alternating stem variants” will be nothing more than lexically separate words altogether; and “morphological levelling” amounts to the loss of such “transparently suppletive” words from a paradigm. This is often showcased by morphophonological alternants that lose their original function, but remain in some specialized one.
— A simple example might be Finnish syöpä ‘cancer’. Originally this is simply the active present participle of syö- ‘to eat’; however, it has been ousted from this function by a newer form syövä ‘eating’. Here -vä is the most regular front-vocalic APP ending, analogically drafted in from much more common bisyllabic verb roots (e.g. elä-vä ‘living’, tietä-vä ‘knowing’, käänty-vä ‘turning’, pese-vä ‘washing’), where it is phonologically regular (due to lenition *p > *b > v between unstressed vowels). Hence, the history here involves three steps: 1) the semantic enrichment syöpä ‘eating’ > ‘eating; cancer’; 2) the introduction of the more regular form syövä into the paradigm of ‘to eat’; 3) the loss of the form syöpä ‘eating’.
[2] Häkkinen, Jaakko (2012): “After the protolanguage: Invisible convergence, fake divergence and boundary shift”. — Finnisch-Ugrische Forschungen 61: 7–28.
[3] The Erzya dialects in question seem to agree with Samic in suggesting (West) Uralic *we- in a couple of words, in contrast to forms suggesting *(w)o- in the other Mordvinic varieties. This though turns out to be merely a part of a more general late conditional sound change *u- > /vi-/ in these dialects; see Ante Aikio’s article in SUSA 95: 42.
[4] Discussed in some detail by Terho Itkonen (1983): “Välikatsaus suomen kielen juuriin“. — Virittäjä 2/87: 214–217.
[5] An example taken from the isogloss map of Finno-Ugric by Tiit-Rein Viitso (2000): “Finnic Affinity”. — Congressus Nonus Internationalis Fenno-Ugristarum I: Orationes plenariae & Orationes publicae: 153–178.
[6] This actually goes further yet. Also “Estonian” and “Finnish” have been known for long to be basically typological groupings formed in this fashion, both comprising multiple different genetic micro-lineages, some of which are not especially close in origin. Very roughly, if a Finnic variety is fully consonant-gradating, relatively archaic in its morphology otherwise, mostly nonpalatalizing and lexically Swedicized, it is “Finnish”; if it is consonant-gradating, fully syncopating and apocopating, and lexically Germanized, it is “Estonian”. Laxing the definitions a bit might also allow us to call Karelian, Ingrian and Votic “typologically Finnish”, versus Livonian “typologically Estonian”. — Constructing a definition of “typologically Veps” as a third areal is left as an exercize for the reader.
[7] A slightly modified model, allowing for “locations” to be territories rather than settlements, as well as for more fluid transitions and exhanges between tribal units, would seem be required for nomadic and certain hunter-gatherer societies. This might also provide some degree of explanation for, and new tools for addressing, the difficulties in reconstructing the linguistic pre-history of areas characterized by heavy diffusion between “unrelated” or not closely related languages, such as Australia and Central Asia. I do not think I am quite going into reviving the punctured-equilibrium paradigm of linguistic history here, which likewise denies the possibility of figuring out clear tree-like linguistic histories for mobile societies… but discussing the distinctions between that model and mine would be too much to chew on right now.
[8] See e.g. Sammallahti, Pekka (1977): “Suomalaisten esihistorian kysymyksiä“. — Virittäjä 2/81: 119–136.
– Viitso, Tiit-Rein (1999): “On Classifying the Selkup Dialects”. — Europa et Sibiria. Veröffentlichungen der Societas Uralo-Altaica 51: 441–451.

*wu > *u in Finnic

One minor phonological innovation in Finnish is mentioned in historical overviews far more often than could be expected from its lexical frequency: the loss of a palatal semivowel *j when preceding its vocalic counterpart *i. This is probably because the shift has been fossilized as a morphological alternation [1] in the word veli ‘brother’ (< *velji), stem velje-. The change also shows up in some old derivatives, e.g. nelikko ‘group of four’ (< *neljikko) from neljä ‘four’.

For phonological analysis, both synchronic and diachronic, a principle that I find valuable is back/front symmetry. This follows as a special case of what is perhaps the main result of featural phonology: phonemes are not atomic entities, but rather bundles of features. And so sound changes or phonological processes that are conditioned on vowel height tend to ignore vowel backness and roundedness. Here we would then expect to also find the corresponding shift involving labial (semi)vowels: pre-Finnic *-w- or proto-Finnic *-v- > ∅ before *u or *ü (= in shorthand: *U).

Yet it turns out that this question is barely discussed anywhere. I have e.g. found no mention of such a development in Lauri Hakulinen’s Suomen kielen rakenne ja kehitys. [2] Martti Rapola’s Suomen kielen äännehistorian luennot does not fare much better (as in perhaps predictable though, since his focus is firmly on dialectal developments within Finnish, not on pan-Finnish innovations).

Let’s try having a look if there is any evidence to be found on this matter.

In support

Given the absense of clear evidence for *U-stems in Proto-Uralic times, there are not many words where we can reasonably assume the sequence *-wU- to have existed in pre-Finnic times. Just one clear word-initial case of loss can be found: *wülä- > PF *ülä- ‘up(per)’ — cf. Permic *vɨl-. [3] Slightly odder is *wud₂ə ‘new’ (and even this, I believe, should be regardless derived from an even earlier *wod₂ə, though this is of no direct relevance for the current topic). This turns up as PF *uuci (Fi. uusi etc.), seemingly with vocalization, rather than loss, of the initial glide. We could also e.g. assume a metathesis *wu- > *uw- as an intermediate stage.

Still, Proto-Finnic clearly had *u-stems, whatever their origin. And it seems that there is still a decent amount of of evidence for a simplification *-wU- > *-U-  in these. Already within Finnish I can find three clear doublets involving word derivation:

  • kalvaa ‘to gnaw’ ~ kaluta ‘id.’ (< ? *kalvuta) [4]
  • kärventää ‘to scorch’ ~ käry ‘burnt smell, rancor’ (< ? *kärvü)
  • raivo ‘fury’ ~ raju ‘fierce’ (< ? *raivu)

Comparison with Samic also turns up three likely cases.

  • Lule Sami iellvet ‘to note’ (< ? PS *ealvē-) ~ Fi. äly ‘intellect’, älytä ‘to realize’ (? < *älv-ü)
  • Proto-Samic *ocvē ‘wet snow’  (< *učwa) ~ Fi. utu ‘mist, fog’ (< ? *učw-u) [5]
  • Proto-Samic *toalvō-  ‘to lead, to take somewhere’ (< *tolvo- < ? *talwəw-) ~ Fi. taluttaa ‘to lead, to walk someone’ (< ? *talvu-tta- < *talwəw-)

I hypothesize that a close scan of *U-stem roots and derivatives in the other Finnic languages would turn up further evidence as well.


Much like is the case with -ji-, Modern Finnish does however allow the sequence -vU-.

Many of these cases can be shown to have been formed secondarily, and could be hypothesized to have come about only after *-v-loss. E.g. some go back to earlier *-βu- < *-bu- (I give here only non-paradigmatically-alternating cases):

  • juovu-ttaa < *joobu-tta- ‘to get/make someone drunk’ (← juopua ‘to become drunk’)
  • taivu-ttaa < *taibu-tta- ‘to bend’ (← taipua ‘to bend’)
  • vaivu-ttaa < *vaibu-tta- ‘to sink (tr.), lull’ (← vaipua ‘to sink, to fall asleep’)
  • viivy-ttää < *viibü-ttä- ‘to delay’ (← viipyä ‘to be late’)
  • voivu-ttaa < *voibu-tta- ‘to tire (tr.)’ (← voipua ‘to tire (intr.)’)

some involve loaning:

  • laavu ‘lean-to’ ← Samic, cf. e.g. NS lávvu ‘id.’
  • siivu ‘slice’ ← Swedish skiv ‘id.’
  • laiv-uri ‘skipper’ (← laiva ‘ship’; -Uri is a loan suffix from Swedish)
  • päiv-yri ‘almanac’ (← päivä ‘day’)

and others yet result from a late assimilation of unstressed *-AU- to -UU-: [6]

  • arv-uuttaa < *arvautta- < *arvad-u-tta ‘to ask riddles’ (← arvata ‘ to guess’)
  • raiv-uu < *raivau < *raivad-u ‘clearing’ (← raivata ‘to clear land, etc.’)
  • tavu ‘syllable’ < older †tavuu < *tavau < *tavad-u (← tavata ‘to spell’)

A few remaining derivative examples could be assumed to have been formed only after *-v-loss, or to have been reverted by analogy.

  • harv-uus < *harv-us ‘sparseness’ (← harva ‘sparse’) [7]
  • kaiv-u ‘digging, trench’ (← kaivaa ‘to dig’; this is an IMO unetymological doublet of *kajwa-w > kaivo ‘well’)
  • kasv-u ‘growth’ (← kasvaa ‘to grow’; the phonologically expected kasvo already means ‘face’)
  • kuiv-u- ‘to dry’ (← kuiva ‘dry’)

A soundlawful [8] doublet of the last one is possibly found in dialectal kujua ‘to wilt’.

Regardless, there remains a more problematic residue, which prevents me from simply assuming that *-vU- always > *-U- at some relatively early Finnic period. These are all basic noun roots with primary *-v-, where morphophonological alternation as a source of analogy cannot be possibly blamed for anything.

  • koivu ‘birch’. The only real excuse I could think up here is that in South Estonian the root is instead an o-stem, kõiv : kõivo-. So perhaps there has been here a later shift from *-vo to *-vu in North Finnic…? (The root has not been attested from North Estonian; in Votic it probably only occurs as an Ingrian loan; Livonian provides no evidence for the distinction between *-o and *-u.) This would still not be a regular sound change though, given aivo ‘brain’, arvo ‘value’, hieho < *hehvo ‘heifer’, kalvo ‘film, membrane’, etc. [9]
  • savu ‘smoke’ seems like it might actually be a positive example of the change, to an extent. On the basis of South Estonian sau ~ Votic and dialectal Olonets Karelian savvu [10] it would be possible to reconstruct PF *savvu; then, just as could be predicted, one *-v- is lost in Finnish. However, this only leads to the question: why does *-v-loss not occur in the previous three varieties as well? Its loss is still seen in e.g. ‘mist’: SEs udsu, NEs udu, Votic utu.
    An explanation may lie in the earlier history of this word. Samic *sōvë ‘smoke’ and Mordvinic *suf-ta- ‘to smoke’ indicate that the earlier form of the root was simply *sawə, not anything like **sawəw. Erkki Itkonen has supposed [11] that the Finnish word is not formed by suffixation, but rather by apocope-then-anaptyxis. In PF times, all former bisyllabic words ending in *-jə were contracted into diphthongs (e.g. *täjə > *täi ‘tick’, *wajə > *woojə > *voi ‘butter’); so in parallel, we would then expect also *sawə to have been contracted to *sau. But no nominal roots of the shape ˣCVU occur in the native lexicon of Finnish (and the scarce loanwords such as tau ‘tau’ or tiu ’20 items’ are on the recent side as well). Itkonen therefore posits a back-development *sau > savu, to better abide with the canonical bisyllabic root structure. The South Estonian form could then be considered an archaism. Perhaps likewise also the identical monosyllabic reflexes in Southwestern Finnish; although since SW Finnish clearly has had contraction in secondary cases with *-Vbu- > -Vvu- > -Vu- (papu ‘bean’ : SW plural pau ~ standard pavut), this wouldn’t really provide any additional sound change economy.
  • vävy ‘son-in-law’ is almost entirely parallel to the above. We again have North Estonian väi, South Estonian väü, Olonetsian vävvy, suggesting PF *vävvü — although, this time Votic shows shorter vävü. We could well again follow Itkonen’s solution and assume PF *väü. On the other hand, Samoyedic *weŋü suggests to me that the proto-form could this time have been something like *wEŋəwə, predicting indeed PF *vävü < *wäwəw. [12]
  • havu ‘conifer branch’. This could again come from *hau > *havvu > havu, as per Itkonen, in light of Olonetsian havvu. On the other hand, a loan etymology from Baltic (cf. Lithuanian žabas ‘branch’) and Ludian/Veps habu suggest that the proto-form was actually *hapu (with exceptional widespread levelling to the weak-grade stem), or perhaps *habu (with an exceptional unalternating *b).
  • sivu ‘side’. This word definitely does not seem to go back to **sivvu / **siu, given Olonetsian sivu. It might be possible to derive this as a Germanic loanword, in which case this could again be analyzed as a late-comer, but there are several phonological difficulties (e.g. what Old Norse actually has is síða< *sīdǭ, not the seemingly required ˣsíð < **sīdu < **sīdō; western Finnish dialects do not have forms along the lines of ˣsiru or ˣsilu that would be predicted from earlier *siðu; vowel length would be expected to remain in a sufficiently recent loan).

This leads me to suggest that the shift *-vU- > -U- has only taken place following another consonant. Most of my six initial examples are compatible with this. In case of koivu, we’d need to assume this got its -u only after the phonologization of *-oj as the diphthong /oi/; while raju and kujua might need to be analyzed as having originated in western Finnish specifically and spread from there to other varieties. Itkonen’s account of savu and vävy continues to work too, since the key forms like savvu show a geminate -vv-, not a diphthong + glide ˣsauvu (as modern Finnish prefers in cases like this, e.g. sauva ‘pole’). But we could also take a slight shortcut, supposing that these never had a geminate in most of Finnic, and that -vv- in Olonetsian (and Votic?) is indeed a late local innovation rather than an archaism.

In one broad stroke, this conditioning also takes care of just about all of the counterexamples above that could perhaps involve secondary counterfeeding (the types of juovuttaa, laavu, raivuu, kaivu). Additionally, among the positive examples, in one case the involved -v- might indeed derive earlier *-b-: kärventää ‘to scorch’ (tr.) seems like an affective/ideophonic variant of korventaa ‘id.’, which is derived from korveta (: korpeaa) ‘to scorch’ (intr.) < PU *korpə-.

As a third line of evidence in favor of this approach, let’s note that *-ji- > *-i- also seems to not take place following a vowel (laji ‘kind, species’, lujin ‘hardest’ ← luja ‘hard’, nuijia ‘to clobber’ ← nuija ‘club’, ojittaa ‘to dig ditches’ ← oja ‘ditch’) and is probably a post-Proto-Finnic change (*velji ‘brother’ > Karelian veľľi ~ velli, Votic velli). Maybe even particular to Finnish! Es. veli can be derived just as well through apocopated *velj (compare e.g. *neljä > *nelj > neli ‘4’).

Tracing the implications further, I even suspect that cases like PU *täjə > PF *täi = Fi. täi ‘tick’; PU *wajə > *woojə > PF *voi = Fi. voi, as mentioned above, have probably not develeped through a stage such as *täji, *vooji — but have involved the direct apocope of PU *-ə following a glide. In principle this predicts that words of the shape *CVji would perhaps have been possible already by Proto-Northern Finnic, from PF *CVjei < earlier *CVjA-j. Suitable roots for forming derivatives of this kind were rare, though.

This may seem to create problems for accounting for words of the shape CVvi : CVve-, like PF *kivi = Fi. Es. etc. kivi ‘stone’… but by now I have, also for other reasons, ended up with the hypothesis that these involve either the levelling of earlier alternation (*kiü : *kive- → *kivi : *kive-), or a geminate in Proto-Finnic that blocked this apocope (e.g. *povvi ‘bosom’ > Fi. povi, Votic põvvi, Es. *põvv > põu).

A second group — and more?

I have not exhausted above the examples known to me where a development *-vU- > -U- could be supposed for Finnish (or elsewhere in Finnic). However, all words remaining up my sleeve show some ambiguity: they involve syllable contraction *-VvU- > -VU-, and they could be analyzed also as cases of syncope followed by vocalization: *-VvU(C…) > *-Vv(C…) > -VU(C…)-. This hypothesis gains some support also from that several examples could have involved the loss of some vowel other than close rounded *-u- or *-ü-. They also commonly enough involve secondary *-v- from *-b-.

The following clearly have involved earlier *-vU-:

  • haukka ‘hawk’ < havukka (attested in eastern Fi.!) < *habukka — cf. Veps habuk
  • hius (single) hair’ < *hivus < *hibus — cf. Karelian hivus, Veps hibus
  • säyseä ‘tame’ < ? *sävüseä — cf. sävyisä ‘id.’; sävy ‘tone, hue’

The following may have had *-vU-, but other possibilities are reasonable as well:

  • auttaa ‘to help’ < ? *avu-ttaa / *avi-ttaa; aulis ‘willing to help’ < ? *avu-lis
    — cf. apu ‘help’, Veps abutada ‘to help’; or Western Fi. avittaa ‘to help’ (with counterparts in southern Finnic such as Es. aitama)
  • keuhko ‘lung’ < ? *kevu-hko / *keve-hko; köykäinen < köyhkäinen ‘light, feeble’ < ? *kevü-hkäinen / *keve-hkäinen
    — cf. kevyt ‘light’; or kepeä ‘light’
  • liukas ‘slippery’ < ? *livu-kas / *live-kas — cf. lipu ‘slipperyness’; or livetä ‘to slip’, lipeä ‘lye’ (liueta : liukenee ‘to dissolve’, pro ˣlipVeta, and liukua ‘to slide’ have to be analogical; the latter’s soundlawful doublet seems to be lipua ‘to glide’)
  • soukka ‘narrow’ < ? *sovu-kka / *sovi-kka — cf. sopukka ‘nook’; or sopia ‘to fit’

The following have no evidence specifically in favor of *-vU-:

  • aukko ‘hole’ < ? *ava-kko — cf. avata ‘to open’ (or < ? *auɣekko, cf. auki ‘open’, aueta : aukenee ‘to open’ (intr.); unlikely though given Livonian ouk)
  • kiukku ‘anger’ < ? *kiiva-kku — cf. kiivas ‘quick-tempered’
  • loukko ‘nook’ < ? *love-kko — cf. lovi : love- ‘cleft’
  • reuhtoa ‘to yank around’ < ? *revihtoa / *revehtoa — cf. repiä ‘to tear’ (tr.); revetä ‘to tear’ (intr.)
  • riuska ‘brisk’ < ? *rive-ska / *riva-ska — cf. ripeä ‘id.’, rivakka ‘id.’
  • saukko ‘otter’ < ? *sava-kkoi — cf. sapa ‘tail’ (but alternately from *sagukkoi, cf. *sagarma(s) ‘otter’ > Es. saarmas, Veps sagarm)
  • tiukka ‘tight’ < ? *tiivi-kka — cf. tiivis ‘compact’
  • tyyssija ‘abode’ < ? *tyve-s- — cf. tyvi : tyve- ‘base’ (even -yy- < *-yi- might be possible!)

General syncope after -v- however clearly cannot be assumed. Some examples that do not alternate with related bisyllabic forms, even through derivation, include: havista ‘to swish’, havitella ‘to strive for’, hävitä ‘to disappear, lose’, kavala ‘treacherous’, kivahtaa ‘to snap at’, kuvottaa ‘to be/make nauseous’, navakka ‘strong (of wind)’, ovela ‘shrewd’, ravistaa ‘to shake’, ravita ‘to nourish’, sivellä ‘to brush (paint etc.)’, suvanto ‘river pool’. To these could be also added an abundance of more or less transparent derivatives such as avuton ‘helpless’, kivittää ‘to stone’, kovasin ‘whetstone’, lävitse ‘thru’, savinen ‘clay-y’, syventää ‘to deepen’, tavallinen ‘normal’, toivomus ‘wish’, but I believe the point is made without going for completeness.

I could still see some patterns in favor of reconstructing at least conditional syncope. Most of the contracted examples involve following *-kk-; most involve a short first syllable (contrast the juovuttaa ja laavu types earlier); most seem to be “weak grade” formations, where the 2nd syllable would originally have been always closed (including also hius : hiukse-).

But what this is also reminding me of is the pattern of modern colloquial Finnish “clipped” or “slang” derivatives. These are not formed by agglutination, but instead by taking the initial CV(V)C or CVCC sequence of a word, shortening a long vowel if necessary [13], and appending a suffix after that. Some examples of derivation of this kind include:

  • -(t)sa: kotitalouskotsa ‘home economics (as a school subject)’ maantietomantsa ‘geography (as a school subject)’
  • -(t)si(-): fundeeratafuntsia ‘to think’, kannattaakantsia ‘to be worth doing’, miljoonamiltsi ‘million’ (of money), parvekepartsi ‘balcony’
  • -(t)su: fantastinenfantsu ‘fantastic’, rantarantsu ‘beach’; common in nicknames, e.g. Anna, Anni, Annika (etc.) → AntsuMillaMiltsu, Valtteri Valtsu
  • -(t)ska: juttujutska ‘thing(y)’, tietokonetietska ‘computer’
  • -(t)ski: jäätelöjätski ‘ice cream’, nuotionotski ‘campfire, bonfire’
  • (t)sku: banaanibansku ‘banana’, materiaalimatsku ‘(reading) material’

And -kka is one of the more productive suffixes of this kind. E.g.

  • harjoitusharkka ‘training’
  • junglejunkka ‘jungle’ (the electronic music subgenre!)
  • linja-autolinikka ‘bus’
  • liikuntaliikka ‘physical exercise (as a school subject)’
  • maisteri ‘Master (degree)’ → maikka ‘teacher’
  • purukumipurkka ‘chewing gum’
  • SörnäsSörkka ~ Sörkkä ‘district in Helsinki’

We also know some examples of this exact derivation pattern whose spread of cognates suggests fairly great age. Three good examples are the informal family terms eukko ‘woman, wife’ (< *emkko?) (cognate in Karelian), probably from emo / emä ‘mother’; ukko ‘man, husband’ (cognates in almost all Finnic languages), from uros ‘male’; veikka, veikko ‘brother, comrade’ (cognates in all Northern Finnic languages), from veli ‘brother’ (< *velji, as mentioned). I take it as probable that clipped derivation has been around for a good millennium or two in Finnic by now, even if it has never been very likely to leave lasting records.

As for examples that could bridge this handful of ancient-looking examples with 20th-century slang, I’m foremost thinking of examples of adjectives showing “suffix alternation”. At least formally, the possibility of reanalysing a stem and agglutinating -kka to that is possible. But nothing really precludes a “clipping” analysis either. E.g.:

  • jämeä ‘stiff’ ~ jämäkkä ‘sturdy’ (PU *jämä)
  • kimeä ‘high-pitched’ ~ kimakkaid.‘ (*kima, √kima?)
  • kalpea ‘pale’, kalvasid.‘ ~ kalvakka ‘paleish’ (*kalpa, √kalpa?)

— But even if some of the examples above are indeed clipped derivatives (I would suggest kiukku and tiukka as probable cases, due to e.g. their proto-forms with long vowels), this is unlikely to be the full story either. In particular haukka is not a derivative of any kind, but rather a loan in its entirety (← Proto-Germanic *habukaz).

Since it seems futile to cover the remaining cases by a single rule, it is probably wise to not attempt this. I am therefore leaning towards the option that there are no less than three similar but distinct sound changes involved here:

  1. *V̆vU > VU, in western Finnish (the haukka and also pau, koju type)
  2. *CvU > CU, across all Finnish varieties, perhaps most of Finnic, though later than *b > *β > v (the käry, taluttaa type)
  3. *Vwə > *VU, in Proto-Finnic times under so far unclear conditions (a few e-stem derivatives such as loukko and tyys-; possibly the savu group).

Type 3 seems moreover likely to be identical to the rise of some Proto-Finnic instances of long *UU: e.g. PU *śowə > WU *śuwə > *śuw > PF *suu = Fi. Es. etc. suu ‘mouth’; PU *tiwənə > *tiwnə > *tiüni > PF *tüüni = Fi. tyyni ‘calm’. [14]

It remains to be seen how well an analysis of data also from outside Finnish will support this division. To reiterate, I would in particular predict being able to find some further examples of type 2 from the other Finnic languages, involving derivatives in -U that have no exact Finnish counterparts.

An initial blind test already turns up at least one candidate in confirmation. Taking at random one Finnic root of suitable shape: *harva ‘rare, sparse’, I could predict that a derivative *harv-u would later yield haru. A word of this shape indeed turns out to be attested from southern Karelian, in the reasonably suitable meaning ‘watered-down milk’. But a fuller derivative hunt will have to wait for later.

[1] I was going to say “morphophonological”, but really my view is that at least some 80% of all “processes” proposed by morphophonologists educated in generative phonology are not synchronic rules of phonology at all, but merely the still-visible historical residue of former diachronic sound changes. In this particular case, too, it’d take far more mental gymnastics or morphophonological epicycles to explain why underlying /velji/ would surface as [ˈʋeli], while e.g. in the plural genitive, apparent underlying /velj-i-en/ surfaces as [ˈʋeljien]  — than to simply assume that the nominative of ‘brother’ is stored as the separate lexeme /veli/.
(To be fair, I’ve seen recent generativist work taking the stance that a level of “lexical” phonology between “deep structure” and surface realization needs to be posited after all, e.g. Kiparsky, “Formal and Empirical Issues in Phonological Typology“. This will likely go a good way towards rectifying the situation, but it may still be a while before people will be willing to consider e.g. that most allomorphy can be modelled as simply a subtype of synonymy.)
[2] So far, anyway. Any book that has ~1200 footnotes will contain much information that is not in the expected place.
[3] Even here I am actually not fully sure that breaking *ü- > *vɨ- can be ruled out (similar to Mordvinic, where *ü- > *ve-). Reconstructing instead Finno-Permic *ülä- would make it slightly easier to reconcile this with East Uralic *ilə- (> Mansi *äl-, Khanty *eeL-, Samoyedic *i-). But the zero onset in the latter could perhaps also be explained as analogy from *ëla- ‘down’.
[4] The case of kaluta is mentioned by Rapola; he however entertains also the possibility that they would not involve suffixation, but rather a “Sievertian” development -lv- > -lu- (and, presumably, the resulting trisyllabic stem *kalua- being then reanalyzed as if it were an original contraction stem *kaluda-, hence the modern infinitive kaluta and not kaluaa). There are no exact parallels for such a change; southwestern Finnish has the relatively similar -sv- > -su- (kasuaa ‘to grow’, rasua ‘fat’), but kaluta is pan-Finnish.
[5] A comparison I have previously proposed in the comments section.
[6] It would be an interesting question how these derivational cases diverged from *-Abi > *-AU > *-AA in 3rd person singular forms (as in *aja-bi > *ajau > ajaa ‘drives’), but I would presume some analogy in some direction is involved.
[7] Vowel length in this suffix is, per the usual explanations, due to complicated multi-stage analogy.
[8] To coin a translation for the useful concept expressed by German lautgesetzlich / Finnish äännelaillinen.
[9] In southwestern Finnish dialects different forms, such as koju ‘birch’ or aju ‘brain’, can also be found. Influence from Estonian is very much not ruled out though.
[10] Karjalan kielen sanakirja lists the forms savvu, vävvy and havvu from the southernmost dialects of Olonetsian, in the villages of Kotkatjärvi, Nekkula and Riipuskala.
[11] Itkonen, Erkki. “Beiträge zur Geschichte der einsilbigen Wortstämme im Finnischen”. — Finnisch-Ugrische Forschungen 30: 1–54.
[12] With Lehtinen’s Law blocked by the third-mora element, hence not *veevü. — Samic *vīvë is very difficult to account for. The apparent development *-ŋ- > *-v- has previously inspired suggestions of loaning from early Finnic, but in light of also the stem vowel mismatch, something like *wäŋəwə > *weŋəwə > *weŋwə > *wēɣwə > *vējvë > *vīvë (where the original *-ŋ- isn’t what yields *-v-) could also be within the realms of possibility.
[13] Modern Finnish still disallows overheavy syllables containing a long vowel and a coda cluster. Pointti ‘(rhetorical or score) point’ and jointti ‘marijuana joint’ are possibly first heralds of the syllable structure CVVCC making a more general entrance, but e.g. tietska is rather syllabifiable as tiet.ska, with a word-internal onset cluster, much like we need to assume also for loanwords such as ekstra (= probably eks.tra).
[14] My account of *üü in here is tentative — it would have to pre-date *ti > *ci, and it’s possible that there are grounds to exclude this ordering. I’ll have to fiddle with my poset model of Proto-Finnic relative chronology to see if this can be made fit in…

12 + 1 old Indo-European loan etymology sketches

Most of the following are not-fully-polished thinking-out-loud analyses. Feel free to point out any inconsistencies, unadmitted weaknesses, and other general plotholes that you may spot.

1. peni

No clear Proto-Uralic root for ‘dog’ is known. We instead have one eastern and one western candidate: Ugric #ämpɜ on one hand (though close /e/ in Hungarian ëb raises suspicions on if the involved words are common inheritance with each other after all), Finno-Permic *penä(j) on the other. Samoyedic has a third root yet, *wën, but this has been explained as an early loan from Tocharian.

The Finno-Permic root has been often incorrectly reconstructed as *penə (UEW: *pene); but Samic *peanëk and Mordvinic *pińɜ both indicate *penä, while the Finnic *i-stem *peni- (not **pene-!) can derive from either earlier *penə-j- or *penä-j- equally well.

IE loan origin seems possible to suggest for this as well. Getting from the usual PIE word *ḱwō : *ḱun- to the Uralic form may seem difficult, for one because the substitution *Kw → *p does not really have credible parallels (while examples with something like *Kʷe- → *ko-, *ku-, *kü- are better attested). We can however find secondary /p/ developing in a suitably close-by branch: Central Iranian [1], where *ḱw > *ćw (> ? *cβ) > *sp.

The front vowel in Uralic creates some problems. If I was called Jorma Koivulehto, this would be my cue to propose an alternate *e-grade protoform for Indo-Iranian and to propose postdating the common Indo-Iranian sound change *e > *a as at least this late; a manoeuver that he has previously used to account for some other II loanwords as well. Or, in principle, another option would be to assume an intermediate dialect group of Indo-European, featuring a mix of Iranian and more archaic features. [2]

These are not especially parsimonious lines of approach, though. Instead, I have begun to suspect that not all such “e-loans” are archaisms retaining PIE *e at all. They seem to be disproportionally western in distribution, contrary to what we’d expect from ancient loans acquired before *e > *a in II (at least if we still wanted to hold some later loans as essentially Proto-Uralic — though this is perhaps not warranted either).

An explanation could perhaps lie inside Uralic. One of the more heavily Iranian-influenced branches of Uralic is Permic. In here, PU *e and, under some conditions, PU *a happen to have the same reflex: *o (thus Komi /pon/ ‘dog’, but also e.g. *aśkəl > /vośkol/ ‘step’). Most accounts have assumed that the trajectory of the development *e > *o was straight backwards drift, something along the lines of *e > *ö > *ȯ > *o. It however seems difficult to find any precedents at all for an unconditional labialization *e > *ö (even if the later steps seem plausible). I therefore wonder if this was rather a centralization development along the lines of *e > *ə̈ > *ɜ > *a? which would then have been followed by a general shift *a > *o, as a part of the late Proto-Permic back chainshift (where also *o > *u, *u > *ʉ > /ɨ/). And then — perhaps pre-Permic Indo-Iranian loanwords with *a could have been by default nativized with *e in more western Uralic dialects: e.g. Iranian *spān- (accusative stem) → pre-Permic *panV → western Uralic *penä?

Even disharmonic *pana → *pena could be an option. As noted, in Finnic we only find the *j-derivative *penVj > *penej > *peni ‘dog’ (SSA mentions Savonian pena ‘brat’, but due to narrow distribution this seems more likely to be a late descriptive backformation than the original root); while Samic, Mordvinic and Mari fail to show the loanword-introduced distinction between *e-a and *e-ä.

Accounting for Hungarian fene ‘wild’, which in the past has occasionally been considered a reflex that has semantically drifted out of sync, seems more difficult under this scenario. I would be content to leave it out of this etymology.

2. kero

I’ve identified another new “e-loan” candidate as well. This is the root traditionally reconstructed as PU (PFP) *kerV ‘throat’, reflected in e.g. Finnish kero, Estonian kõri, Samic *kërës, Permic *gor. However, resemblance with PIE *gʷel- ‘throat’ is unavoidable, even more so once we factor in early Indo-Iranian sound changes to reach *ger-.

As also in a couple of other cases [3], the “sporadic” initial voiced stop in Permic appears to simply continue the initial voiced stop on the IE side. It follows that loaning into unitary Proto-Finno-Permic cannot be assumed: we’re probably rather dealing with separate loaning in Permic and Samic/Finnic. Perhaps then again in the latter through the former? The Finnic and Samic words seem to each point to different stem shapes too, namely preF *kera- vs. preS *kerəs — the latter retaining the characteristic IE masculine nominative singular ending, the former showing disharmony characteristic of loanwords. This would go well with a late date of the word’s introduction too.

3. äimä

A Proto-Uralic word *äjmä ‘needle’ has been supposed for long, with reflexes in several branches (Samic, Finnic, Mari, Permic & Samoyedic). There are some reasons to be suspicious of this reconstruction, though, despite the seemingly perfect match between e.g. Finnic äimä (only attested in Finnish + Karelian) and Samoyedic *äjmä.

Firstly, this word constitutes one of the exceptions to *ä-backing in Finnic, as recently identified. An initial suggestion (Kallio 2012, Zhivlov 2014) has been that the change was blocked before syllable-final *j. The other relatively clear example of this (*päivä < ? *päjwä ‘sun, day’) has been suspected of being a possible derivative from a root of the shape *päjə, though, and I’ve proposed reconstructing original trisyllabic *päjəwä. The third example that could perhaps show blocking before coda *j is PF *äjjä ‘big; grandfather’ (with cognates only in Samic + Komi), but this can also be suspected to be secondary. Nowhere else is there evidence for geminate *-jj- in Proto-Uralic; moreover, the term’s distaff counterpart, PF *ämmä ‘grandmother’, seems to be derived from PF/PU *emä ‘mother’ by some kind of iconic intensifying gemination. [4] This could have been the case for *äjjä as well. Perhaps its pre-Finnic ancestor had only plain *-j-, and maybe also different vocalism altogether.

Since the evidence for this alleged exception development is starting to look questionable, it’s worth considering if the reasons for the absense of *ä-fronting in äimä could lie elsewhere as well. As a word root with a medial consonant cluster, a phonetically natural explanation would be to trace this, too, back to an earlier derivative *äj-mä < *äjə-mä.

A second reason to suspect that PU *äjmä might not have been a basic word root comes from that also the PU cluster *-jm- seems to be otherwise unattested in primary word roots! Most examples are clear derivatives in *-mA; e.g. *kojma ‘man’ (in P, H, Ms + Selkup) ← *kojə ‘male’; *wajma ‘heart, spirit’ (in F, Mo) ← *wajŋə ‘breath, spirit’; alleged *kejmä ‘lust’ (in S, F, P)  ← *kixə- ‘to rut’ (and thus better: *kixəmä); alleged *śajma ‘manger’ (in F, Mo) ← *sewə- ‘to eat’ (and thus better: *sewəmä).

Thirdly, a derivative analysis actually also makes good semantic sense. *äjmä is one of the clearest-reconstructible Proto-Uralic tool terms — and the suffix *-mA is regularly used to form instrumentals in Finnic (as *-in : *-imE-), with occasional cognates in or close to this function also elsewhere in Uralic (e.g. Mordvinic *kundamə ‘handle’; Tundra Nenets /sædoʔmā/ ‘thread’)

Altogether I therefore find it quite likely that the PU term for ‘needle’ was originally a derivative, and should perhaps be amended to *äjəmä. The basic root **äjə- does not appear to otherwise survive, but this analysis suggests a meaning such as ‘to pierce, (to be) sharp’.

— Unexpectedly, this exercise in internal reconstruction has now brought us quite close to the PIE root for ‘sharp’: *h₂aḱ-. The sound correspondences (*h₂ ~ ∅, *a ~ *ä, *ḱ ~ *j) do not suggest loaning directly from PIE, but Indo-Iranian *Hać- would make a more promising candidate for this (compare PIE *h₂aǵ- > PII *Hadź- → PU *aja- ‘to drive’).

One issue remains: we would expect PU to have rather substituted Indo-Iranian *ć by its own voiceless palatals, *ć or *ś (as also in previously known loanwords like *śëta ‘100’; *waśara ‘hammer’). Phonotactics may have interfered, though. There are almost no examples in widespread Uralic vocabulary of *-ć- or *-ś- as a single word-medial consonant; I only know of one truly good example (*kośəw or *kośəkV ‘long’), while most other cases that have been posited can be suspected to be instead from a cluster *-ńć-, from a geminate *-ćć-, or to be post-PU areal vocabulary. Perhaps this fact can have motivated a substitution *-ć- → *-j-.

4. kangertaa

Earlier this year I have, in a talk (slides in Finnish) at the XLIII Kielitieteen päivät conference, introduced a new model of the *ë/*ï split in Eastern Uralic. To summarize in brief, earlier research has supposed three essentially unrelated splits:

  • PU *ë > Samoyedic *ë in closed syllables, *ï in open ones (thus Janhunen)
  • PU *ë > Khanty *ïï, from which by the Khanty “ablaut” > *aa in several words (thus Steinitz); or, *aa by default and *ïï as an unexplained exception development (thus Sammallahti)
  • PU *ë > Hungarian i or a, with unclear conditioning (possibly initially *a, with i as a back-development in palatal environment)

My suggestion is that all three are in fact related, and conditioned by the original stem type:

  • PU *ë-a > Smy. *ï ~ Kh. *ïï ~ Hu. a (e.g. *ïlə- ~ *ïïL- ~ al- ‘under’ < PU *ëla, cf. Fi. ala)
  • PU *ë-ə > Smy. *ë ~ Kh. *aa ~ Hu. i (e.g. *ńëj ~ *ńaal ~ nyíl ‘arrow’ < PU *ńëlə, cf. Fi. nuoli)

(A few facets of this model I have already mentioned in some earlier blog posts.)

The conditioning appears to have later been blurred by the introduction of Indo-European loanwords, which has introduced words that rather point to a development *ë-a > Kh. *aa. Four examples of this correspondence are known by earlier research:

  • alleged PU *śëta > Kh. *saat ‘100’ (cf. Fi. sata)
    ← Indo-Iranian *ćata-
  • alleged PU *śëlka(w) > Kh. *saaɣəL ‘pole’ (cf. Fi. salko)
    ← (pre-)Balto-Slavic *dźalga-
  • alleged PU *kënta(w) > Kh. *kaant ‘foundation for a storehouse on a post’ (cf. Fi. kanta ‘basis’, kanto ‘tree stump’)
    ← Indo-Iranian *skandʰa-
  • alleged PU *pëŋka > Kh. *paaŋk ‘fly agaric’ (cf. Smy. *pëŋkå- ‘to get drunk’)
    ← PIE *(s)pongo- ‘mushroom’; or Indo-Iranian *bʰanga- ‘hamp, ? intoxicant plant’ (only in Indo-Aryan)

I propose that all of these have simply been borrowed late enough to escape the *ë/*ï split in native vocabulary. They do not even seem to point to common East Uralic *ë: in Hungarian we find száz ‘100’ (not ˣszíz), and szálka ‘splinter’, szálfa ‘log’ (not ˣszílka, ˣszílfa).

A fifth case can be added to the tally. A recent etymological comparison from Aikio [5] connects Finnic *kangërta-, Samic *kōŋkërtē- ‘to crawl, move with difficulty’ with the long-known Ugric verb root *këŋkV-. We see here quite similar vowel correspondences as above: in particular, long á in Hungarian hág ‘to step (up on)’, *ëë in Mansi *këëŋk- ‘to climb’. In Western Khanty we find an “u-ablauted” reflex *xooŋx- ‘to climb’ (possibly < PKh *kɔɔŋk- ← ? #kaaŋku-), while Far Eastern /kɑŋət-/ and Western *xaaŋteep ‘stairs, ladder’ point to a stem variant *kaaŋt- (presumably < earlier *kaaŋk-t-). This time the West Uralic cognates do not require an earlier *a-stem, but they also do not necessarily speak against it. While *-ər- is a rather rare verbal derivational suffix, a well-attested precedent is *pu(ń)ća- (> Samic *počē- ‘to squeeze’ etc.) → *puć-ər- ‘id.’ (> Fi. pusertaa, Hu. facsar ‘id.’ etc.)

The various Uralic words appear likely to derive from the IE verb root *ǵʰengʰ- ‘to step’. Hungarian and the Khanty words for ‘stairs’ would remain semantically the most archaic, with ‘to climb’ developing as a later meaning (if within Uralic or in some loangiving IE variety is not obvious), ‘to crawl’ probably even later. To account for the lack of satemization, we would need to reckon with very early loaning from just about PIE; or, as seems a tad more likely to me, secondary diffusion to Ugric through early West Uralic and pre-Germanic.

UEW’s hesitant comparison of Komi /kaj-/ ‘to climb’ with this word group does not seem to be really feasible.

5. ilo

Finnic *ilo ‘joy, mirth’ has no accepted etymology. A few Samic counterparts are known, but these are limited to the central dialects, and can be easily analyzed as loans from Finnic. Possibly in more than one layer though; forms pointing to Proto-Samic *ë < *ɪ and showing a more divergent meaning, such as Pite âllo ‘inclination’, can plausibly have been earlier loans than forms retaining /i/, such as North illu ‘joy’.

Since the word has word-initial *i-, it’s possible to ask if this might go back to earlier *je-, as I’ve proposed to be the case for several other words in Finnic as well. This seems to allow finding a promising loan original in Indo-European: the root *ǵelh₂- ‘to laugh’. IE *ǵ⁽ʰ⁾ → Uralic *j is well enough attested in some early loanwords of both Indo-Iranian and Balto-Slavic origin. This particular root does not happen to be reflected in either branch, but perhaps the next best thing is still available, namely Armenian. [6] We are not limited to bare root comparision, either: it appears possible to match the ending in the derived noun *ǵélh₂-ōs ‘laughter’ (> Greek γέλως, ? Armenian ծաղր) with *-o in Finnic.

Another Finnic noun, *ilka ‘tease, (mean) trick, practical joke’ could be perhaps analyzed as a parallel loanword from this PIE root. This would then involve a seemingly more archaic sound substitution *h₂ → *k, though I’m sure this and *h₂ → ∅ can have coexisted for a while (compare etymology #10 below). On the other hand, the older explanation as some kind of a backformation from *ilkëda ‘bad, mean’ (of Germanic origin) remains entirely feasible as well, and perhaps semantically preferrable. It also looks phonologically more straightforward, since in an old enough loanword an ä-stem **jelkä > **ilkä would be more expected than a disharmonic a-stem.

6. keev

One of the more obscure Finno-Samic etymological comparisons, though still well captured by the usual major sources, is an animal husbandry term surviving only in Livonian and Eastern Samic: Liv. keev ‘mare’ (borrowed also into Latvian: ķēve) ~ Inari Sami kiäváš, Skolt ǩiõvv etc. ‘reindeer cow’ (< PS *kēvë). The traditional reconstruction has been *keewe. Following the abandonment of vowel length in pre-Finnic reconstruction stages, this probably needs to be amended to *käwə, with lengthening *ä > *ää > *ee due to Lehtinen’s Law in Finnic (and as business as usual in Samic).

This adds up to an interestingly symmetric behavior of low vowel + glide roots in Finnic: “homorganic” *-äjə, *-awə apparently remain unaffected (as in Fi. täi ‘louse’, savi ‘clay’), while “heterorganic” *-äwə, *-ajə are lengthened.

One other example of *-äw- is known too though, without lengthening — and it’s a perfect minimal pair, even: *käü- ‘to go, walk’ (~ frequentativ *käv-ele-), suggesting likewise earlier *käwə-. However, as this is nowadays normally considered a Germanic loanword (← *skēwjan-) [7], it could be assumed to have arrived only after inherited *-äwə- >> *-eewe-. Despite some searching, I know no clear examples of vowel lengthening due to LL among the Baltic and Germanic loanwords in Finnic. (It ranks as one of the earliest Finnic sound changes also in relative chronology, and I would presume it has taken place already during the initial dialect diversification of West Uralic, somewhere around the upper Volga watershed.)

Back to *käwə: as a cultural term with narrow distribution, loan origin is likely already a priori. And indeed, at this point, resemblance to Indo-Iranian starts again being apparent: cf. *gāwš ‘cow’ (< PIE *gʷōw-). The meaning ‘mare’ in Livonian is a little bit off, but surely no more of an issue than e.g. the long-accepted comparison Finnic *lehmä ‘cow’ ~ Mordvinic *ľišmɜ ‘horse’. We also know of at least one precedent of an II loanword from the same semantic field: the common western Uralic words for ‘reindeer’ (approx. *počaw, if we wanted to set up a single proto-form [8]) derive from PII *paću ‘cattle’ (< PIE *peḱu-).

It is not clear to me if *ā → *ä should be cause for worry. The typical frontness/backness development across Iranian appears to be for *a to front vs. *ā to back (including in Ossetian, which suggests that this split has taken root early). However, loaning from the oblique stem *gaw- would be possible as well.

7. seaibi

The common Samic word for ‘tail’ is reconstructed as *seajpē. For pre-Samic (≈ proto-West Uralic), *sejpä or *šejpä would be implied. The word sports an unusual medial cluster *-jp- and has no reliable cognates elsewhere in Uralic; it can be easily suspected to be a loanword.

Indo-Iranian again offers a good loan original candidate. Indeed, several of them… Late Avestan xšuuaēpā-, Sanskrit śepa- and Prakrit cheppā- (all ‘tail’) fail to point to any clear common proto-form (though some ad hoc cluster could surely be set up [9]). They all regardless suggest, at minimum, the same consonant skeleton *S-jp- as in Samic, which seems a bit too good to be a complete coincidence.

As we’re again dealing with an “e-loan”, but now without Permic cognates, initially the explanation options would seem to be positing early loaning (which however seems unlikely per inner-II irregularities), or a la Koivulehto, late retention of *e. However, the II diphthong *ai likely could have later developed separately to a form close enough to *ej. Indeed, *ai monophthongizes in most (if not all?) later Iranian languages, even though per Avestan and Old Persian we know this development to have been firmly post-Proto-Iranian.

8. oksi

Attempts at reconstructing a PU word for ‘bear’ are most likely futile, due to ubiquitous taboo circumlocutions being used for the animal even by several groups of modern-day Uralic speakers. In the southwesternmost branches, Finnic and Mordvinic, one common root is identifiable though: *oktə, giving F. *okci / *oht-o (> standard Fi. hypercorrect otso) and Mo. *ovtə (? *oftə).

PIE *h₂r̥tḱos ‘bear’ may at first glance look quite far-removed from this. Factor in laryngeal loss and *tK-metathesis though, to reach *r̥ḱtos: rather closer already. A three-consonant cluster **-rkt- could not have been retained in early Uralic, so substitution as simply *-kt- seems possible. Initial *o could represent a variety of histories — e.g. direct substitution for syllabic *r̥, an early IE dialectal feature (cf. Latin ursus?), or even a word-initial development *a- > *o- in Uralic.

Unexpected retention of *o in Mordvinic (compare e.g. *oksə-nta- > *uksnə- ‘to vomit’) might also receive an explanation through this etymology. Aikio (2013) (see again footnote 5) reports one apparent environment where the development *o > *o is regular: before *ŋ, as in e.g. *joŋsə > *joŋs ‘bow’, *poŋə > *poŋ(gə) ‘bosom’. This could be further generalized to the environment before a velar sonorant: *o > *o appears to be regular also before *w (*powa > *pov ‘knob’, *śawə > *śowa > *śovə-ń ‘clay’); and even before *lk (*olkə > *olgə ‘straw’, *ńolkə > *nolgə ‘snot, slime’), where *l may have been at the time realized as *[ɫ]. If so, then perhaps an early pre-Mordvinic *orktə was similarly realized with [rˠ], which could have triggered *o > *o, before the full nativization of the root as *oktə?

This is all fairly complicated though, and other explanations are surely possible: e.g. that by the time of loaning, PU *u had already been reduced to [ʊ] in pre-Mordvinic; and *[ʊr] was then used as a substitute for Indo-European *r̥. Assuming that epenthesis to [ər] had already taken place in the latter would help too.

This time, loaning from Indo-Iranian seems to be out of the question, since I gather that nowadays the prevailing analysis is that Sanskrit kṣ in ṛ́kṣa- ‘bear’ does not result from metathesis, but from (hypercorrect?) dissimilation from *tś < *tć < *tḱ. This seems to be confirmed by how Prakrits have riccha ~ accha with cch, rather than expected kkh < *kṣ.

It may be somewhat of an issue that direct descendants of *h₂r̥tḱos have not been not attested from our next most likely loangivers: Balto-Slavic and Germanic. However, as their attested words for ‘bear’ are analyzable as taboo circumlocutions as well (“brown one”, “honey-eater” etc.), it is probably reasonable to assume that the older word was still around as well up until some point, instead of self-destructing as soon as PIE split into dialects. The Finnic word later shows a rather similar history: *okci has been mostly eclipsed by its substitute *karhu (which has later been still felt strong enough to require circumlocution), and it only survives as diminutives in Finnish and Estonian; in some place names; and in Livonian okš.

Or indeed: we would seem to have little reason to assume *oktə having been the earlier main term for ‘bear’ on the Uralic side. It could also have spent its history mostly as a circumlocution term, and risen to a new neutral term only in Mordvinic and Livonian separately.

9. xaws

Northern Mansi /χɑws/ ‘ash-gray’ ~  Southern Khanty /χɑ̆wəs/ ‘gray-haired’ is a part of the common Ob-Ugric lexicon with no known Uralic or Ugric origin. There are also phonological reasons to assume that this is indeed an innovation: Southern Khanty word-medial /-w-/ in a back-vocalic environment is highly rare.

If you’ll bear with me for another historical phonology tangent: the canonical analysis by Steinitz is that no Proto-Khanty medial **-w- is to be reconstructed at all, and that medial *-ɣ- developed in Western (= Southern + Northern) Khanty to /-w-/, when stem-final and following either *o, *oo, or a front vowel (but not following other labial back vowels: *ɔɔ, *uu). The latter condition sounds awfully arbitrary, though. There seems to be no good reason why labialization should happen only after close-mid vowels specifically. The words reconstructed with his *-ooɣ or *-oɣ also fail to align with expected vowel correspondences. For regular examples, compare Southern /joχət/ ~ Far Eastern /joɣət/ ‘bow’ (< *jooɣət) or Southern /tŏχət/ ~ Far Eastern /tŏɣəl/ ‘feather’ (< *toɣəL). In the cases with /w/, we instead find correspondences such as Southern /taw/ (with a front vowel!) ~ Far Eastern /loɣ/ ‘horse’ (< ? *loɣ), or Southern /ŏw/ ~ Far Eastern /oɣ/ ‘stream’ (< ? *ŏɣ).

In Western Khanty, any exceptional vowel developments can in principle be explained as being conditioned by /-w-/, regardless of how this first arose. But if /-ɣ-/ in Eastern Khanty is supposed to be a retention, it would be rather bizarre for it to condition exceptional vowel developments exactly in those word roots where a WKh /-w-/ also exceptionally develops.

What I consider more likely is that a distinction between *-w- and *-ɣ- should be reconstructed for Proto-Khanty after all, although we can only clearly identify it in back-vocalic words in Western Khanty. [10] This finds support from etymology, too. In a few cases, (Western) Khanty words with /-w-/ derive from Proto-Uralic roots that also have *-w- (e.g. ‘stream’ above < PU *uwa; compare e.g. Northern Sami avvit ‘to leak’), and seem to have simply retained the consonant; while words of the shape /(C)OɣəC/ generally derive from words with an earlier cluster *-kC- or *-Ck- (compare e.g. NS juoksa ‘bow’, dolgi ‘feather’).

The ‘gray’ word seems to provide corroboration for this reanalysis of Proto-Khanty. The traditional reconstruction scheme cannot really accommodate Southern Khanty words of the shape /COwəC/; at best they could be secondary derivatives from a root of the shape *COɣ. And while Northern Mansi is known to have several loanwords from Northern Khanty, in this case no Northern Khanty reflex appears to exist. Hence the NMs cognate would seem to show that the word cannot be considered a late innovation in Southern Khanty: the word should be traced in its entirety at least back to the common Ob-Ugric period.

Going further back from there, though, runs into difficulties again. Reconstructible Proto-Uralic clusters of the shape *-wC- are in Khanty regularly simplified to just *-C- (e.g. *lewlə ‘spirit’ > PKh *liiL; *kowsə ‘spruce’ > PKh *kooL), while those of the shape *-Cw- seem to give *-Cəɣ (e.g. *tälwä > *teləɣ ‘winter’). This leaves us with no plausible inherited source for apparent Ob-Ugric *kaws ‘gray’.

There may be some grounds for attempting setting up a concrete loan etymology, as the adjective shows intriguing resemblance to PIE *ḱyeh₁wós ‘gray, dark’. Phonetics remain problematic though. Loaning from Indo-Iranian (Sanskrit śyāva- etc.) is again not an option, due to the retained initial velar: the routing would either have to be from just about PIE, or from a specifically Centum variety. Tocharian B kwele, with syncope of the original root vowel and an additional suffix, is however not really close enough either. — The second problem is the back vowel *a in Ob-Ugric, matching poorly with PIE *-ye-. I could of course speculate if this word was derived not directly from Indo-European, but instead from whatever substrate preceded Ob-Ugric in western Siberia… but this contributes nothing productive.

For the time being, in the absense of phonetic parallels or other clarifications, this comparison seems to be stuck in the limbo of “possible but not probable”.

10. aač

An alleged Proto-Uralic (Proto-Finno-Ugric) word for ‘sheep’ has been for long reconstructed as approximately *učə (UEW: *uče). The reflexes however show a tremendous amount of irregularities (more on this to come later in a separate post of its own), and I am convinced that this etymon is mostly erroneous: the words might be instead separate IE loans of varying ages.

The case seems to be the clearest for Ob-Ugric. Mansi *aaš ~ Khanty *aač is, in itself, a very regular comparison. This is however just about the only allegedly inherited word where the vowel correspondence *aa ~ *aa appears. Most others are either of unknown origin, Indo-Iranian loans, or even late Komi loans. The raising *aa > /oo/ in non-southern Mansi is as late as 18th century, and the same change in Southern Khanty could be fairly recent as well. All the way up to this terminus ante quem, loanwords of any origin could easily have been adopted with *aa everywhere across Ob-Ugric.

A natural loan origin is provided by Proto-Iranian *adz- ‘goat’ (< PII *Hadź- < PIE *h₂aǵ-), whose unpalatalized *dz would have been substituted on the Uralic side by *č (as also e.g. in ‘reindeer’, tangentially mentioned above). The minor semantic difference seems like a lesser obstacle than the numerous phonetic difficulties in connecting these words to their western Uralic equivalents (such as Fi. uuhi, Erzya /uća/); and could be even related to sheep-rearing faring generally better than goat-rearing in the colder taiga zone.

In the absense of phonetic or other faultlines to dig into, I do not take any stance here on if we should assume loaning into already separated (pre-)Mansi and (pre-)Khanty, or into unitary (pre-)Proto-Ob-Ugric, which does not seem to make a difference on the viability of the etymology either way.

11. hajt

The Hungarian verb hajt comes with numerous meanings. Analyses normally break these into two homonymous groups, one with a rather polysemic range of meanings such as ‘to drive, to herd, to move, to repeat’; the other with the more restricted range ‘to fold’.

The first cluster has been equated with Mansi *kujt- ‘to chase’. As the correspondence *-t- : *-t- normally goes back to a cluster *-tt- or *-pt-, these verbs probably need to be analyzed as derivatives from a root *kajV- or *kojV-; indeed also UEW’s reconstruction approach.

This root however looks quite similar to the other, better-known and wider-distributed (S, F, P, Ms) Uralic root for ‘to drive, chase’, which is *aja-. I believe this is not an accident. The latter has been long since considered a loanword derived from, as mentioned above, PIE *h₂aǵ- ‘to drive’. The H-Ms root can be analyzed as a parallel loan from the same as well: the initial *k- is straightforwardly accountable by the reasonably well-attested word-initial substitution pattern *h₂ → *k. If this should be taken as chronologically earlier (it probably requires a relatively un-weakened sound value for *h₂ at the time) or simply a competing nativization strategy is not obvious, but will not create any significant trouble either way.

12. jam

The Proto-Samoyedic word for ‘sea’ has been reconstructed as *jam (yielding, among other reflexes, Old Nganasan jam, Nenets jām). An etymology suggested by Helimski derives this, through earlier *ľam < *lamə, as a loanword from Proto-Tungusic *lāmu ‘id.’

The notion of Proto-Tungusic loanwords in Proto-Samoyedic strikes me as unexpected, though. There are several thousand kilometers separating the Sayan mountains (the likely Samoyedic homeland, or at least close by to it) and the lower Amur (the likely Tungusic homeland). It might be possible to reckon with adjustments of various kind of course, such as adoption from early Evenki (the only Tungusic variety that has clearly been in contact with most of the Samoyedic-speaking area), combined with pushing the pan-Samoyedic development *l- > *j- substantially forward.

However, another etymology seems to be available too. The Tocharian A word for ‘sea’ is lyam, which would work as a loan original about as well as the Tungusic word. Loaning from Samoyedic into Tocharian is apparently ruled out, since this is a word with a good Indo-European pedigree (akin to e.g. Greek λίμνη).

There are a few phonetic kinks to work out. Both the IE etymology (thru earlier *lim-, the zero-grade of √(s)leym- ‘slime etc.’) and Tocharian B lyäm /lʲɨm/ seem to get in the way of straightforward loaning from Proto-Tocharian into Proto-Samoyedic: we’d instead expect something like **ľïm > **jïm or **ľɪm > **jə̈m in that case. Even the Toch. A vowel transcribed ‹a› was likely something in the *[ɐ ~ ə] region, in contrast to ‹ā› being the cardinal /a/, and so we might instead expect to see PSmy **ľəm > **jəm?

The chronological point brought along by having to prefer loaning from Toch. A specifically may provide a solution, though. If we again assumed that *ľ- > *j- took place late across Samoyedic (a slightly weaker assumption than postdating both this and the earlier change *l- > *ľ-), it will be relevant that Southern Samoyedic regularly shifts *ə > *a. After this, ‘sea’ would presumably be loaned from Tocharian as *ľam; and upon diffusion of the term into more northern dialects, the vowel could well be retained. — Alternately, late loaning would also allow assuming that Tocharian */lʲ/ was substituted as *j.

It might even be possible to tie both etymological groups together, and to suggest a borrowing chain Tocharian → Samoyedic → Tungusic. [11] Tungusic has no palatal lateral **ľ, so early South Samoyedic *ľ- would be naturally substituted as *l-. (If the vowel correspondences check out in this direction, too, seems however like a more precarious question that I am not currently equipped to address.)

That’s all I have on early loanwords from Indo-European into Uralic, for the time being. I have one going in the opposite direction too, though:

1. blow

Germanic *blewwan- ‘to beat up’ has no known Indo-European etymology. Etymological dictionaries sometimes set up a PIE preform *bʰlewH-, but without any other comparative evidence backing this up.

This root shows clear similarity though to the widespread Uralic root for ‘to hit’, usually reconstructed as *lewə-. Being attested as far as Mansi and Samoyedic, loaning from Germanic is right out of the question. Loaning from PIE would be theoretically feasible, but this does not really seem like sufficient grounds for projecting the Germanic verb that far back, either. If this resemblance is onto something, we would seem to have to instead consider the direction Uralic → Germanic.

Initial *bl- may look like an obstacle. However, this could be accounted for by a fossilized prefix *b- < *bi- ‘at’ (much like can be seen in German bleiben, Swedish bli vs. dialectal English belive). Semantically this works perfectly: “to beat” is precisely “to hit at, to keep hitting at”. Loss of the prefix vowel would probably have to have happened here already in PGmc, though.

The geminate *-ww- looks a bit trickier to account for. Nothing would strictly speaking prevent taking this as evidence for instead reconstructing Uralic *lewwə-; but again, since there is no substantial evidence for geminate glides in PU otherwise, this would be firmly an obscurum per obscurius explanation. Perhaps the proposed pre-Germanic reconstruction with *-wH- is the key instead. It would be quite possible to also reconstruct Uralic *lexə-, and assume that *-wH- represents the substitution of the early Finnic reflex of *-x-, which I believe was at one point likely a back unrounded glide, roughly [ɰ] or [ɣ]. Pre-Germanic *-w- could continue the velar glide aspect of this sound, *-H- the fricative aspect.

All of this matches poorly though with my earlier hypothesis that we should instead reconstruct Uralic *lüwä- or *lüxä-, from which Germanic **(b)li- or **(b)lu- would surely be expected instead…

[1] I.e. all Iranian languages other than the Persid and Saka groups.
[2] This possibility is especially suggested by how Iranian and its closest surviving Western relative, Slavic, seem to share a decent number of characteristic innovations that are missing either from Indic or from Baltic: e.g. the alveolarization of palatals (*ḱ > *ć > *c), secondary palatalization of the common Satemic velars, the shift *kh₂ > *x, the *B / *Bʰ merger, the *ā / *ō merger, or monophthongization of all diphthongs. Some of these could be independent, but the number seems a bit high for none of these to have been areally transmitted from one to the other.
[3] I do not aim for a full review in this post, but cf. e.g. Udmurt /bord/, Komi /berd/ ‘wall’ < “PU *pärtä” ← PIE *bʰr̥dʰ- ‘board’.
[4] For “intensive gemination” in family terms in Finnic, cf. also *ukko ‘old man’, likely an irregular derivative from *uros ~  *uroi ‘male’.
[5] Mentioned tangentially in the recent paper “The Finnic ‘secondary e-stems’ and Proto-Uralic vocalism”, in SUSA 95, and findable even in the handouts of his associated talk in 2013. — I would however continue to derive Finnic *kankëda ‘stiff’ from the noun *kanki ‘bar’, as per the analysis in SSA.
[6] Given the modern theory that the PIE “palatovelars” and “plain velars” should be reanalyzed as plain velars and back velars / uvulars, and that the former were only ever fronted in the Satem languages, loaning from any Centum group would be unconvincing for sound correspondences such as this, I think. I do not think loaning from pre-Armenian specifically is feasible either, but attestation there seems to suggest that the root may have once existed in early Indo-Iranian or Balto-Slavic as well.
[7] Germanic long *ē being reflected as short *ä in this word may seem mysterious. This is still perfectly accountable though by the original account given by Koivulehto upon presenting this etymology: it likely indicates a stage of development in Finnic where *ää had already been raised to *ee, while pre-Northwest Germanic still had open front *ǣ (later > *ā). This leaves just short *ä as a qualitatively faithful substitution option. — A couple of cases with *ā → *a seem to show similar development as well: the main candidates are *apila ‘clover’, *lapida ‘spade’, from Baltic *ābilis, *lāpetā, where the appearence of medial *-i- indicates somewhat late loaning.
[8] Though *o ← *a < *e worries me somewhat. If Finnish poro (< *podoi?) were a very early loanword from Samic, we might be able to get away with *pačəw instead.
[9] Lubotsky in Indo-Aryan ‘six’ proposes *pćw-. Would this mean the word being originally a derivative or a compound based on *peḱu-?
[10] I believe some indirect evidence for this contrast in other positions can be uncovered as well, but that would be a discussion for another time.
[11] Also Mongolic *lamug ‘swamp’ (> literary namuɣ), which has been proposed as an Altaic cognate of the Tungusic word, might then belong in this cluster.

Alternations and “alternations”; with data from Finnish

A theoretical device in historical linguistics that I think can easily go abused is the basic morphophonological concept of “alternation”.

To lay some groundwork: an initial issue, on which I may expand more at some point, is that several grades of what is meant by “alternation” in the first place can be distinguished. All of them come with their own behavior, and trying to treat them as equal is a surefire way of going off-track in analysis.

Firstly, some archetypal examples of morphophonological alternation are easy to think of: systematic phenomena like consonant gradation in Finnic and Samic (and the less-known case of Nganasan), or consonant mutation in Celtic. These permeate a language’s lexicon on all levels, including neologisms and other newly gained vocabulary, and are often employed for specific grammatical functions.

It would be an obvious error to treat all morphophonology as having similar wide-reaching signifigance though. In what I would call the second category, even indubitably productive alternation patterns can be far more minor, applying perhaps only to a single morpheme. Consider e.g. the voicing and vocalization alternations in the English past tense morpheme -(e)d and the plural morpheme -(e)s; these require separate accounts to cover all corner cases (zero-suffix pasts like led, trod and found are not quite the same as zero-suffix plurals like fish, sheep and mice), and what similarities they show in their phonological behavior are easily seen to result from general phonetic constraints — not from them sharing abstract “suffix mutation rules”.

Thirdly it is quite common for particular types of alternations to be at least partly lexicalized. Examples like English teach : taught (< *tǣk-ja- : *tǣx-t) or Finnish niellä ‘to swallow’ ~ nälkä ‘hunger’ (< *ńälə- ~ *ńäl-kä) have obviously long since ceased to be anything but fossilized relicts. This may quite well go also for “marginally productive” alternations that only replicate themselves by analogy. Nobody would claim that PIE ablaut is productive in English, and regardless this has not stopped people from creating new strong past tense forms like shit : shat (given here the analogy of sit : sat).

Unproductive alternations can also go deeper yet, involving loaning. English is a good source of examples: we can consider e.g. Germanic/Romance doublets, such as stand ~ statue (going all the way back to PIE), or for a slightly younger example, ward ~ guard (the latter originally a Frankish loan in French, and thus linked at the West Germanic level). Cases where both sides are of loan origin are possible as well, e.g. Latin/Greek doublets such as serpent ~ herpetology, Latin/French doublets such as regal ~ royal, or doublets originally derived within Latin, such as cause ~ excuse (← causa ~ *ex-causa). Such alternations might be impossible to identify at sight, and only a deeper knowledge of etymology and language history will end up demonstrating that they in fact go back to a common root. [1]

(There would be also a neurolinguistics blog post to be written on if “productive morpho(phono)logy”, as separate from phonology, (morpho)syntax and lexicon exists as its own phenomenon at all, or if it’s all simply an issue of more or less fossilized analogies — but that’s not my main topic, nor really even particularly within my expertise.)

There is additionally a fourth sense of “alternation” however, which I think goes the least appreciated: language-internal false cognates. Whenever alternations of some sort occur within the paradigm of a single word, it’s usually a good starting idea to suppose some kind of a historical divergence, rather than flat-out suppletion. Whenever two words aren’t directly related though, and only show some degree of semantic and phonetic resemblance, presuming a relationship is far more risky. A comparison of e.g. English beak with peak, though perhaps plausible on the face, does not suffice to allow us to infer the existence of “an alternation b ~ p” — except in the banal descriptive sense that these two semantically close-by words really do differ only in their initial consonant. Historically, this similarity appears to be entirely accidental.

I’ve avoided giving too many examples above to steer clear of feeding confirmation bias. While languages typically indeed contain numerous unproductive doublets and marginal alternations, these can be entirely indistinguishable from mere chance similarities. I would consider it methodologically invalid to claim that just because two words show similarity, they should be considered etymologically related through “some” unspecified means. This kind of a conclusion should always require specific confirmation from other comparative data.

Not necessarily language-external comparison, mind you. E.g. if an alternation can be attributed to a sound change in a particular context, it would be expected that the same change has affected other words as well, and therefore created multiple similar doublets. For a specific example, lexical doublets in Finnish of the type sortaa ‘to break down; to oppress’ ~ sorsia ‘to tease’ (the latter appearing to be derived from the former with the frequentative suffix -i-) can be put on a much firmer ground already as soon as compared with the existence of an alternation -t- : -si- also in inflection, as in e.g. kaartaa ‘to move in an arc, to go around’ : past tense kaarsi. [2]

Aside from “pure” chance similarity, another risk involved in doublet-hunting is semantic contamination: similar shape can lead two words to drift toward similar meanings, if given the chance. One cautionary example could be Finnish kastua ‘to become wet’ (also kastaa ‘to dip in’, kaste ‘dew’ etc.) ~ kostua ‘to become moist’ (also kostea ‘moist’), which at first sight appear to be some kind of a related doublet, perhaps comparable to other examples of an “a ~ o alternation” (say, kajo ‘shimmer’ ~ koi ‘dawn’ [3])?

However, we know from historical, dialect and other Finnic data that the original meaning of kostua has been ‘to return, to be returned’ (and compare in Modern Finnish still the expression kostua jostain ‘to benefit from (a scheme or deal)’ [4]) — which allows it to be regularly analyzed as a reflexive derivative of the base verb kostaa, whose main meaning nowadays is ‘to avenge’ (< ‘*to return something’), clearly unrelated to moisture or wetness. The meaning ‘to become moist’ seems to have developed through the stage ‘to return to usable condition’, which for traditional leather-based (and, to some extent, wooden) tools and items can well have meant remoisturization after drying. Also relevant is the culinary habit of softening dry preserved bread in broth, brine etc. before consumption. [5] But it seems oddly specific that this very specific semantic development would have just accidentally happened to a verb with close similarity to kastua; and it is probably a good idea to analyze this similarity as having outright motivated the semantic development.

In any case, the conclusion still is that there is no morphophonological alternation a ~ o involved here: only two etymologically unrelated word groups, some of whose members have converged in meaning.

Since etymologists are usually mainly concerned with establishing connections between words, not in tearing them down, I would expect that there are people who fail to appreciate just how easy it is for words to show accidental or at least unetymological similarity, though. It is also often difficult or impossible to positively demonstrate that a given similarity definitely is accidental; and even producing calculations on the odds of accidental resemblance will be difficult, given how there are not really any “default hypotheses” about word origins that we could feed into these.

I believe there is however at least one method for demonstrating that accidents indeed happen: we can attempt seeking phonetically unlikely doublets, and see how easy it is to get these together, as compared with doublets that would seem to suggest some other, more phonetically expectable alternation.

Over some years, I have been collecting comparisons of this type from within modern Finnish, taking up cases of any imaginable phonetic variation (generally within the initial CV(C)C unit; anything going on in later syllables is often better considered “merely” morphology). Systematic surveying is difficult, and what I have so far is likely still biased in favor of alternations I have looked more into. Regardless the results so far are clear: even without giving too much slack for semantics, it is possible to get together at least a few surface doublets for approximately any alternation pair imaginable, while alternations with some actual historical motivation behind them generally generate larger amounts of doublets.

Some examples:

  • a ~ e: lavea ‘wide’ ~ leveä ‘wide’
  • ai ~ ie : taitaa ‘to know (a skill)’ ~ tietää ‘to know (information)’
  • e ~ i: vehnä ‘wheat’ ~ vihne ‘awn’
  • e ~ ää: retikka ‘radish’ ~ räätikkä ‘rutabaga’
  • eu ~ uu: peuhata ‘to frolic, play rough’ ~ puuhata ‘to be busy, work on various small things’
  • h ~ m: houkka ‘fool’ ~ moukka ‘boor’
  • ha ~ e: harha ‘illusion, delusion’ ~ erhe ‘error’
  • ht ~ v: kuihtua ‘to wilt’ ~ kuivua ‘to dry’
  • i ~ ö: itikka ‘mosquito’ ~ ötökkä ‘bug’
  • iu ~ ui: hiukka ‘little bit’ ~ huikka ‘sip’
  • j ~ n: koje ‘machine’ ~ kone ‘machine’
  • k ~ l: äklö ‘sickeningly sweet’ ~ ällö ‘icky’
  • kk ~ pp: tukko ‘wad’ ~ tuppo ‘wad’
  • l ~ s: lingota ‘to sling’ ~ singota ‘to shoot off’
  • m ~ s: karmea ‘terrible’ ~ karsea ‘ghastly’
  • m ~ ∅: muhkea ‘grand, bountiful’ ~ uhkea ‘voluptous’
  • n ~ ∅: nilja ‘slime’ ~ iljanne ‘slippery ice’
  • o ~ ä: vongata ‘to pester (esp. for sex)’ ~ vängätä ‘to pester (of children)’
  • p ~ r: pöyhkeä ‘snooty’ ~ röyhkeä ‘arrogant’
  • r ~ v: rako ‘cleft’ ~ vako ‘furrow’
  • r ~ ∅: varsa ‘foal’ ~ vasa ‘calf’
  • s ~ t: surma ‘death’ ~ turma ‘ruin, accident’
  • s ~ ∅: kaista ‘stripe, lane’ ~ kaita ‘narrow’
  • sk ~ v: rieska ‘flatbread’ ~ rievä ‘flatbread’
  • t ~ v: tai ‘or’ ~ vai ‘or’
  • uo ~ u: nuokkua ‘to nod off’ ~ nukkua ‘to sleep’
  • ää ~ äy: ääri ‘edge’ ~ äyräs ‘brim’

This is a relatively representative sample, in that more than one of the above examples have demonstrably unrelated origins; more than one are also demonstrably related; several can be suspected to be the product of contamination in some direction; most however have no particular known explanation.

You can download the full list here (Unicode encoding; contents only in Finnish so far). In case you run into encoding woes, you can also try accessing this on pastebin. I have indicated a few etymological analyses so far, but most cases await fuller analysis. Further data can surely still be gathered as well. If anyone is interested in collaboration (analysis, adding in references to earlier literature, just adding in new potential doublets, etc.), feel free to get in touch with me.

For now I will not go into what kind of more detailed conclusions could be drawn from the data… though I imagine already simple eyeballing should be enough to highlight some features.

[1] On this topic, I often wonder how much of Latin we could in theory reconstruct from just the abundant loanwords it has left in modern western European languages. Or for that matter, if given no corroboration from the rest of Romance, would we be able to identify this reconstructed Latin as an early stage of French (rather than merely an extinct relative)?
[2] That the past tense of sortaa is typically sorti is then easily accountable as analogical, especially given that other verbs yet may show active vacillation, e.g. soutaa ‘to row’ : past tense souti ~ sousi.
[3] This example, for what it’s worth, is rewindable back to Pre-Finnic *kaja- ‘shine’ ~ *kajə ‘dawn’ with a slightly “weaker” alternation, and we could be dealing with some kind of an original derivation pattern in either direction; but this remains to be confirmed.
[4] I suppose an analysis as ‘to get so excited that you wet your pants’ might be theoretically possible, if we only knew of the modern sense of kostua.
[5] For further discussion of this word family’s history, see Hakulinen, Lauri (1940): Kostea ja kostua. Virittäjä 44.

Problems in Indo-European vocalism, part 1

Looking at Indo-European studies has for a while now been giving me an impression that the usual vowel system reconstruction has unnoticed flaws in it.

They are different issues from the long-running debate on the reconstruction of the stop system, though. The traditional *i *e *a *o *u, easily attestable around the world, surely has nothing wrong in it in terms of synchronic phonology. Adding in the syllabic resonants *m̥ *n̥ *r̥ *l̥ won’t be a major typological problem, either. Rather… weird things start to pile up once we instead survey the development of this vowel system in the IE languages.

For a starting point, let’s consider Anatolian. I claim no particular expertise in the area though, so instead of getting my hands dirty with data, my commentary here follows fairly closely some short overviews by H. Craig Melchert. [1] He ends up positing (in an update to earlier views about a simpler 4+4 system) a vowel system almost identical to PIE for Proto-Anatolian: five short vowels *i *e *a *o *u, their long counterparts including *ā < *eh₂, as well as an unpaired long vowel *ǣ < PIE *eh₁ (early on also a later redacted “*ẹ̄” < PIE *ey). All of these yield their own distinct correspondence sets, and I would not try to claim that we need to merge or split some of these phonemes. But there are some imbalances in how some different contrasts develop. Melchert does not go into featural phonology, but if we are to trust his transcription, both *e and *o would be mid vowels. Their development tendencies however diverge. There is one general similarity: most Anatolian languages seem to show a trend of qualitatively simplifying the vowel system, towards plain *i *a *u. This is completed only in Luwian, but elsewhere, too, the mid vowels have a tendency to merge with other stuff. Short *e often yields *i in some kind of raising contexts: e.g. following *j, or when pretonic (kind of resembling Germanic). In a few other positions, there are conditional developments to *a, such as before *n in Hittite and Lydian. However, by contrast, there seems to be no evidence for a raising development *o > **u. Most Anatolian languages have generally merged *o and *ō into *a and *ā. Melchert only reports three features that allow distinguishing *o and *a:

  • In Hittite, stressed *o in closed syllables yields long /ā/, while *a remains short /a/.
  • In Lydian, stressed /o/ is found next to a labiovelar (either a stop *Kʷ or the glide *w).
  • In Lycian, the general treatment is *o > /æ/ (transcribed e; no comment on what happens to *ō).

If, in a language family elsewhere, we were faced with two correspondence sets — one of them *a ~ *a ~ *a, the other *a/ā ~ *æ ~ *a/o — I would definitely not conclude that we are to reconstruct *a and *o respectively. And I would assume that Melchert, too, only ends up reconstructing a mid vowel *o, because this is what the second Anatolian vowel corresponds to in traditional PIE, not because the reflexes so demand. Even /o/ in Lydian looks like it might represent some kind of an assimilation from the adjacent labiovelars, rather than the preservation of original rounding.

The long vowel situation seems even more worrying. We would definitely expect to see a raising *ō > **ū at least somewhere, at minimum in languages like Lycian or Luwian where *ē > *ī, if these two had made up a similar class of long mid vowels. But apparently we only get /ā/ everywhere. Melchert reports for this contrast but a single distinguishing feature: apparently *dwō- yields /dā-/ in Hittite, versus no such loss of the glide for *dwā-. This seems to me much too iffy grounds for setting up a separate *ō.

Ignoring traditional PIE for the moment and instead reconstructing *a₁ (in place of *a) versus *a₂ (in place of “*o”), there would seem to be more promising options for phonological interpretation available. In terms of height, I’d assume that it was actually the latter that was the more open vowel *[a]. This is fairly directly suggested by the different treatment in Hittite: all other things being equal, more open vowels tend to be realized as longer. No clear evidence seems to exist for a difference in backness; *a₁ remains stable-ish (though I presume there would have been some variation on if a stands for central [a] or back [ɑ]), while *a₂ has both clearly fronted (Lycian) and backed (Lydian) reflexes. This, however, provides another reason to suspect a lower value for *a₂, given that backness contrasts tend to be more labile among lower vowels.

What this seems to leave available for *a₁, then, is some kind of a weaker vowel value still prone to lowering, like [ɐ], [ɜ] or [ə], probably both [-front] and [-round]. It seems a bit curious how this has not been retained as such anywhere, [2] but hardly any more so than the failure of *o to surface consistently anywhere (or any other family-wide “sweep” development, such as *s > *h in early Iranian or *a *ā *u *ū > *o *a *ъ *ɨ etc. in early Slavic).

Compare to this e.g. the develoment of English short a (Early Modern /a/) and laxed u (Early Modern /ə/): the former has split over the last few centuries into a variety of lengthened (BATH lexical set, father, “tense æ”), fronted (TRAP set) and/or backed (PALM and WATER sets) reflexes, while a new neutral short /a/ is in numerous varieties filled in from earlier †/ə/ > †/ʌ/. Similar vowel histories can be found moreover e.g. in Samic varieties (old *ā > á being more heavily split in allophones etc. versus lowered *ë > â/a remaining more neutral) or in Samoyedic (old *å *a yielding a large variety of reflexes versus *ə, often lowered, remaining more neutral).

Melchert’s most recent work also mentions the recent discovery of a “new” /o/, /ō/ for several Anatolian languages, in earlier work conflated with u, ū. The short version mainly evolves from labiovelar + syllabic resonant, the long version from *aw, *ow, both also from *u next to laryngeals (thus this /ō/ corresponds to late PIE *ū, from *uH; remaining cases of Hittite /ū/ are instead from *ew, or from stressed open-syllable lengthening of *u). These are therefore clearly distinct from traditional PIE *o, *ō. If this new *o could have been in place already in Proto-Anatolian (apparently plausible at least in non-final syllables), it’s all the more reason to not suppose also the simultaneous retention of old *o.

Given that Anatolian retains numerous archaisms, and the possibility of it being the earliest split-off of Indo-European entirely, we can also proceed to ask an important question: would Proto-Anatolian *ɜ *a or traditional PIE *a *o be the more archaic state of affairs? I would end up preferring the former: a chain shift a > ɑ > o, ə > ɜ > a is more typical than the opposite.

As soon as we’ve formed this hypothesis for a “skew triangular” (or perhaps even “square”? [3]) vowel system *i *e *ɜ *a *u for not only Proto-Anatolian, but also Early PIE altogether, there will be numerous immediate implications. I will not go into listing all of these just yet… But to mention one, this will nicely amount to addressing the now and then raised typological objections about the rarity (and possible absense entirely before laryngeal coloring) of traditional PIE *a. In the new system, this turns out to translate into the rarity of the more marked vowel *ɜ, while the proper cardinal open vowel *a is quite frequent indeed.

[1] 1992, “Relative Chronology and Anatolian: The Vowel System”, in Rekonstruktion und Relative Chronologie. Akten der VIII. Fachtagung der indogermanischen Gesellschaft, ed. Robert Beekes;
1993, “Historical Phonology of Anatolian” 1993, Journal of Indo-European Studies 21/3-4;
and 2015, “Hittite Historical Phonology after 100 Years (and after 20 years)“, in . I have not yet seen his 1994 monograph Anatolian Historical Phonology, but the 2015 paper seems to summarize the main points.
[2] In writing at least. It should be kept in mind that epigraphic evidence does not actually constitute phonetic evidence.
[3] Since *e may have well been half-open [ɛ] rather than half-close [e].

Another Phonological Relict in South Estonian

Some days ago, I decided to go for a re-reading of Setälä’s classic Yhteissuomalainen äännehistoria (1891) (that’s “Common Finnic Historical Phonology”, for the non-Finnish-reading people in the audience). This proved a good idea, in yielding not just the confirmation of some issues I had been wondering about; but also various detail observations new to me that seem to support a theory of mine in the works.

I mean the thesis introduced at the end of my last post: the characteristic Finnic sound change *š > *h did not take place in unitary Proto-Finnic, or even in unitary Core Finnic (following the splitting-off of South Estonian and Livonian) but spread across the Finnic language area even later, after its splitting into dialects entirely.

One of these details appears in the Finnic word for ‘goose’, normally reconstructed as *hanhi (> e.g. Fi. hanhi, Es. hani). We are quite sure that this goes back to earlier *šanši, given that it’s a long-known loanword from PIE *ǵʰans- (most likely thru Baltic); and also given the recent observation that it could be traced back to even earlier *šänšä, allowing treating Erzya /šenže/ ‘duck’ as a “non-native cognate”.

Since the word fails to show up in Samic — or rather, shows up there in an entirely different form *ćōńëk, allegedly from a pre-Germanic alternative formation *ǵʰan-ut- according to an etymology from Koivulehto [1] — we probably still shouldn’t assume loaning into common West Uralic. Another point in favor of this seems to be given by the Finnic sound change *-ńć- ~ -ńś- > *-ć- ~ *-ś-: that this early denasalization only applies before a palatalized sibilant seems best explained by assuming that the clusters *-nš- and *-ns- had not yet even entered the language by this point (neither of them occurs in material inherited from Proto-Uralic). [2]

Denasalization before sibilants is a fairly natural sound change though. A second round of the same has later taken place again in the southern Finnic area, this time with compensatory lenghtening, affecting *-ns- found in innovated Proto-Finnic vocabulary (as in Es. põõsas ~ Fi. pensas < PF *pënsas ‘bush’) or developing thru *-nc- from the assibilation of *-nt- (as in Es. kaas ~ Fi. kansi < *kansi < PF *kanci < *kanti < PU *kamtə ‘lid’). And the interesting fact is: in South Estonian this affects ‘goose’ as well! yielding haah’ instead of the expected ˣhahn’.

You might protest that surely the loss of a nasal should be just as natural before /h/. This is also the mechanism Setälä appeals to. Crucially though: words showing *-nh- of some other origin are not denasalized. As just mentioned in my last post, they instead metathesize, yielding e.g. *tenho > tehn ‘thank’, *vanha > vahn ‘old’ (again, just like other sonorant+h clusters, regardless of if they go back to *-Rš- or not). ‘Goose’ appears to be the only example of this denasalization development. [3] I would not brush off as a coincidence the fact that it is also the only example that can be securely traced back to *-nš-.

This situation might not be obvious, as two other Finnic words with *-nh- have still been proposed to come from *-nš-. Yet newer research appears to have shown by now that neither example holds water.

*vanha ‘old’ is the first case with alleged earlier *-nš-, traditionally compared with Udmurt /vuž/, Komi /važ/, of the same meaning. Komi /a/ would be irregular as a counterpart of Finnic *a, though, and a recent proposal from Mikhail Zhivlov [4] identifies a better etymology for the Permic words: borrowing from Baltic *wetuša- ‘old’ (cf. Lithuanian vetušas). The development *e > /u ~ a/ seems to be regular before a lost medial consonant, as in PU *wetə > Udm. /vu/ ~ K. /va/ ‘water’. [5] A different etymology for Finnic *vanha has been proposed too: borrowing from Germanic *wanhaz ‘bent, crooked, bad’. This seems uncertain due to the semantic difference, but if the Permic connection fails, it appears to be the explanation we will have to default to. LÄGLOS is of the opinion that it would be exactly the existence of Permic cognates that shows this etymology to be unviable, not any formal flaw.

The second is *inhiminen ‘human’, which has been traditionally compared with Mordvinic *inžə ‘guest’. A loan etymology by Koivulehto derives these from PIE √ǵenh₁- ‘to beget’. Disassembling this requires a bit more analysis though. Given that the usual sound substitution for Indo-European *ǵ has been Uralic *j, Koivulehto suggests that the words continue the zero-grade *ǵn̥h₁-, with the sequence *ǵn̥- substituted as *in- (rather than *jVn-). Since we still have /i-/ and not the expected **e- in Mordvinic, the word would then have to have been loaned fairly late — but my soundlaw *je- > *i- for Finnic seems to “get in the way” of this: Koivulehto’s reconstruction could be quite well amended to a common proto-form *jenšä-, derived instead from the IE full grade.

Other considerations still chafe against this analysis. Firstly, Koivulehto also assumes a sound substitution *H → *š, but as has been recently argued by Adam Hyllested, [6] this is likely mistaken, and we should instead assume *H → *h straight away. Most of Koivulehto’s alleged examples are restricted to Finnic, and thus show no direct evidence for *š at all. For a few others, with cognates in e.g. Samic that explicitly point to *š, alternative etymologies have been suggested. If I were doing a more detailed review, I would consider also the possibility that they represent “etymological misnativization”, with IE *H → Finnic *h substituted as *š either in the other Uralic languages involved, or already in an archaic mediating Finnic variety.

Secondly, in Finnic we have no evidence for a bare root **inhä, only for the longer stem *inhimV- (mostly further suffixed with the adjectival/deminutive ending *-inen, but a few forms like Ludian inahmoi could in principle be parallel rather than “suffix-switched” derivatives). This seems to not match at all with the usual patterns of Finnic nominal derivation. We would expect something ending in *-imV-  to be either a nominalization (in *-mA-) from a frequentative verb (in *-i-), or a superlative. Instead the Indo-European derived noun *ǵenh₁mn̥ ‘offspring’ (> Latin genimen, Sanskrit janiman, etc.) seems to provide a better morphological match: it even provides half of the ending *-inen, whose presence in the neutral word for ‘human’ is otherwise a bit puzzling. In Mordvinic we see no signs of this though, which would seem to suggest that the ‘guest’ word has a different etymology entirely.

(Thirdly… in South Estonian only Northern-type reflexes inemine ~ inimene seems to be attested, so even if the history here had really been *ǵenh₁- > *jenšV- > *inhV-, it would not affect my analysis of ‘goose’ anyway.)

How late this reanalysis requires pushing *š > *h exactly is not clear. The terminus post quem on show is after the Southern Finnic denasalization (or perhaps concurrently with it: earlier in North Estonian vs. later in South) — but this is itself difficult to date. At minimum this would have to be later than the splitting-off of Northern Finnic, which in principle might however go quite deep into the Proto-Finnic period.

There is some weak evidence for some dialect diversity within the future Estonian area at this time as well. Another minor observation of Setälä’s is that, in a few central Estonian dialects, *Vns > *VVs postdates the diphthongization of original *aa and *ää to /ua/ and /iä/. This won’t have to mean that the entire denasalization development is this late, though: a nasal vowel stage *ṼṼs would make a very believable intermediate, with full loss of nasality only later.

The form haah’ also does not even appear to be common across the entire South Estonian dialect area, but is rather limited to its southernmost fringes. To some extent this probably means that the literary / North Estonian form hani has simply displaced the native form in some parishes… but a very similar distribution also seems to hold for tehn and vahn. In principle it would be possible that also the southwesternmost area of South Estonian had already split off by the time of *š > *h, and that the general Central Finnic soundlaw *nh > *n is the regular development elsewhere in the SE area.

This analysis may also raise a few methodological questions. Is it really legitimate to suppose a development *Vnš > *VVš for pre-South Estonian only on the basis of a single etymology? On one hand, it is clear that granting an open check for positing single-example sound changes with highly specific conditioning would allow rewriting the historical phonology of any language completely to taste. On the other hand, in this particular case we have some very strong constraints to avoid this failure mode: aside from the bare output (haah’), we can independently establish also all three of the input (*šanši), the specific conditioning environment (loss of *n before a sibilant) and the general phonetic motivation (the articulatory complexity of a nasal-sibilant transition) of the sound change I’m assuming.

Much seems to depend on how we model sound change phonologically. Do changes target, or are they conditioned by atomic phonemes — or by the features of neighboring segments? If the former, then we will be forced to treat *Vns > *VVs and *Vnš > *VVš as two parallel changes that have only incidental similarity; if the latter, then it will become possible to treat them as the one and the same sound change *VnS > *VVS, and to proceed to infer early dialect diversity within the Finnic languages.

[1] I am on the skeptical side though, and would expect anything showing Samic *ć ← PIE *ḱ to have been adopted from a Satem variety.
[2] The same relative dating is similarly suggested by how this sound change seems to extend to Mordvinic as well. None of the textbook examples such as PU *kuńćə ‘urine’ have known reflexes in Mordvinic; but one binary comparison, Erzya /saźi-/ ‘to gain, get’ ~ Permic *sudź- ‘to reach’ seems best reconstructed as *sëńćV-.
— It might be additionally a good idea to assume that the heterorganic clusters *-ŋs- and *-ŋš-, known in one word each (*joŋsə > PF *jousi ‘bow’; *jaŋša- > PF *jauha- ‘to grind’) had already changed to *-xs-, *-xš- in Finnic before the denasalization of *-ńć-.
[3] ‘Thank’ and ‘old’ are actually morever the only two examples of *-nh- > -hn- that I can get together on a quick search.
[4] I do not know of a more substantial publication on this yet, but an initial release has been in the proceedings of the 2008 conference Языковые контакты в аспекте истории. (My thanks to André Nikulin for the reference.)
[5] Rather than setting up a separate marginal Proto-Permic vowel *å, I would prefer explaining this correspondence as a conditional development in Komi from Proto-Permic *o (normally > Udm. /u/ ~ K. /o/). Finding a phonetically reasonable account of the development regardless remains to be done. A few possibilities that would initially seem plausible are blocked e.g. by how both *-ej- and *-at- still yield the expected /o/ in Komi (cf. /voj/ ‘night’, /śo/ ‘100’).
[6] In a conference paper to be found his PhD thesis Word Exchange at the Gates of Europe. Again, I do not know of a “more proper” published version.

A Phonotactic Allewrgy…?

There are, I think, several things off about the current understanding about the treatment of the consonant clusters *wr and *wj in Proto-Finnic.

There are no generally accepted instances of *-wr- in Proto-Uralic (though see below for one proposal), and examples with *-wj- are rare enough that so far none of them happens to have Finnic reflexes (probably the most reliable is *jäwjə ‘beard lichen’, with reflexes in just three branches: Samic + Khanty + Samoyedic). Within the Finnic comparative data, no direct evidence for these clusters appears either.

Cases involving these clusters in Proto-Finnic are therefore solely Indo-European loanwords. In these, two different lines of treatment have been generally accepted.

The first is metathesis to *-jw-, *-rw- > *-iv-, *-rv-. These latter clusters clearly occur in material inherited from pre-Finnic (e.g. PF *kaiva- ‘to dig’ ~ Samoyedic *kajwå ‘spade’; PF *sarvi ‘horn’ ~ Samic *ćoarvē ‘id.’). One classic example of the metathesis of *-wr- has been known for centuries: the word for ‘lake’, PF *järvi ~ PS *jāvrē. [1] Among Baltic loanwords, about three other examples can be found: *karva ‘hair’, *tarvas ‘bull’ and *torvi ‘horn (instrument)’ (~ e.g. Lithuanian gauras, tauras, ‘id.’; Latvian taure ‘id.’). ‘Lake’ has been proposed to be a loan as well, except from an earlier stage of Balto-Slavic, to account for reflexes also in Mordvinic and Mari.

Cases of metathesis of *-wj- are a newer discovery. Germanic *-wj- being continued as Finnic *-iv- was established some decades ago by Koivulehto, [2] with examples such as *laiva ‘ship’ ← Gmc. *flawją ‘id.’ (> e.g. Old Norse fley); *raivat- ‘to clear out, esp. woodland’ ← Gmc. *strawjan- ‘to strew’; *raivo ‘skull’ ← Gmc. *trawją ‘vessel’ (ONo treyja). Examples in loanwords from other sources seem to be rare so far, but one is the Estonian rivername Koiva, located in northern Latvia; and whose Latvian name is instead Gauja.

The other development is fortition to *-pj-, *-pr-. Both of these are clusters introduced in loanwords in the first place, and examples of this development are generally later loanwords from Germanic. Examples are not too numerous, but they include *hipjä ‘skin’ ← Gmc. *hiwją ‘appearence’ (ONo ); — *hapras ‘brittle, weak’ < *šapras ← Gmc. *sawraz ‘filth, dirt’ (ONo saurr); *sapra ‘a type of haystack’ ← Gmc. *sauraz ‘pole’ (ONo saurr); *äpräs ‘bank, steep shore’ ← Gmc. *awriz or *awraz ‘sandbank’ (ONo eyrr, aurr).

I do not aim to question any of these etymological correspondences. However, I find the idea that both developments would have arisen specifically as sound substitutions to avoid the “phonotactically forbidden” clusters *-wj-, *-wr- implausible.

There is one principal problem. While the Proto-Finnic period involved a hefty reduction in the total consonant inventory of the language (loss of palatalized *ć *ś *ń, the “spirants” *d₁ *d₂ *x, the postalveolar affricate *č and the velar nasal *ŋ), it on the other hand brought a clear increase in phonotactic complexity. Some new types of consonant clusters that appear to have been introduced include:

  • stop/affricate + liquid, e.g. *sëpra ‘company’, *atra ‘plough’, *ocra ‘barley’, *nakris ‘turnip’; *täplä ‘spot’, *kakla ‘neck’
    (no *tl though)
  • stop + nasal, e.g. *litna ‘town’, *sakna ‘sauna’;
  • stop + *j, e.g. *kapja ‘hoof’, *patja ‘mattress, pillow’, *acja ‘thing’, *vakja ‘wedge’;
    (also, in native vocabulary, *-tv- < *-d₂w-, in e.g. *patvi ‘tinder’;)
  • fricative + nasal, e.g. *käsnä ‘callus’, *lehmä ‘cow’, *ahnas ‘ferocious’;
  • fricative + semivowel, e.g. *rasva ‘fat’, *ohja ‘guide’, *rahvas ‘people’;
  • nasal + geminate stop, e.g. *temppu ‘trick’, *kontti ‘leg; backpack’, *lonkka ‘hip’;
  • liquid + geminate stop/affricate, e.g. *harppat- ‘to take a long stride’, *kartta- ‘to avoid’, *tarkka ‘acute, accurate’; *hëlppo ‘easy’, *hëltta ~ *helttä ‘cockscomb’, *malcca ‘Atriplex sp.’, *palkka ‘salary’;
    (found through inflection and derivation also in native vocabulary, e.g. *jält-tä, partitive sg. of *jälci ‘cambium’)
  • liquid + affricate, in at least *porcas ‘pig’;
    (found also in native vocabulary through *-Rtə >> *-Rci)
  • liquid + fricative, e.g. *varsa ‘foal’, *vërho ‘drape’, *kulha ‘bowl’;
  • *n + fricative, e.g. *pënsas ‘bush’, *vanha ‘old’;
  • geminate nasal, e.g. *konna ‘toad’;
  • geminate liquid, e.g. *villa ‘wool’.
    (from earlier *ln)

This was not a momentaneous revolution in phonotactics, of course. For a few of these, examples of rather uncertain Uralic derivation have been suggested (e.g. ‘turnip’ has been compared with Mansi *nëër, Khanty *naaɣər ‘pine nut’); others have been introduced in relatively early loanwords and have thus “non-native cognates” elsewhere in Uralic (e.g. Mordvinic *purćəs ‘pig’); others may not have yet been introduced in Proto-Finnic proper, but rather in some of the early Finnic dialects (such as *käsnä, found only in Northern Finnic). None of this rocks the overall picture, though: if loanwords were able to feed in new types of clusters, they were taken up as-is, just about as far as possible.

(The same process has also kept going later on. Even in varieties such as standard Finnish, where there has been no post-Proto-Finnic syncopë to generate new native clusters, the ongoing flow of various Indo-European loanwords still has by now introduced loads more of novel consonant clusters, such as /-stm-/ in astma, /-ŋ(k)st-/ in gangsteri, /-kstr-/ in ekstra.)

So why would *-wj- and *-wr- have been specifically and stubbornly avoided for centuries? Especially when this general type of cluster, semivowel + sonorant, was able to occur in native vocabulary all along, as is shown by e.g. Fi. läyli ‘heavy’ < PF *läüli < PU *läwlə, or the above-mentioned Fi. kaivaa ‘to dig’ < PF *kaiva- < PU *kajwa-.

I propose that the main part of the solution is that the the alleged “metathesis upon substitution” did not quite occur. This was instead a regular sound change, one that merely happened to mainly operate on loanwords.

Some indirect support is provided, I think, by how other examples of continuant cluster metatheses are already known in Finnic, too. These include:

  • *-jh- > -hj- in North Estonian (lahja ‘thin, lean’ ~ Fi. laiha)
  • *-nh-, *-lh-, *-rh- > -hn-, -hl-, -hr- in South Estonian (vahn ‘old’, võhl ‘witch’, kahr ‘bear’ ~ Fi. vanha, velho, karhu; NEs. vana, võlu, karu)
  • *-wh- > -hv- in both NEs. and SEs. (kehv ‘poor’ ~ Fi. köyhä)
  • *-sn- > -ns- in Western Finnish (runsas ‘plentiful’ ~ Livvi ruznaz)

For the metathesis *-wj- > *-jw- in particular there is also an areal parallel from Ter Sami and Lule Sami (you may recall I have already mentioned the Finnic metatheses currently under discussion in that post, too).

Since these metatheses affect only a part of the Finnic (or Samic) languages, sound change seems to be the only explanation available. It is not clear to me why continuant clusters would be particularly prone to metathesis though, and it’s possible that e.g. the connecting factor in the first three changes could be the metathesis of *h specifically. Regardless, it seems rather arbitrary to instead prefer a sound substitution explanation for *-wj- and *-wr-.

A number of the individual words to have been metathesized specifically point towards a sound change rather than a sound substitution, too.

1) For ‘lake’ there are two possible arguments. The first is chronological: *jäwrä could be regularly reconstructed already for Proto-West-Uralic (or for Proto-Finno-Volgaic, if you were to subscribe to such a stage). Alternately, there may be a phonetics argument available. Research in Uralic substrate vocabulary in Western Russia has led to supposing a “Meryan” reflex *jäkr- as well, reflected in lake names with an element ягр- or яхр-. [3] Both phonetic typology, and the proposed early Balto-Slavic etymology of this whole ‘lake’ root (either from *yewH-ro- ‘body of water’ [4]; or from *eǵʰe-ro- ‘lake’? [5]) suggest that the velar element in these has not developed from *w, but is instead an archaism, pointing to *jäkrä or *jäxrä as the earliest shape of the word in Uralic, with lenition to *-wr- at least in pre-Finnic and pre-Samic.

None of this is still completely watertight though, as another possibility yet is that early BSl. *-wHr- was substituted as *-Kr- in pre-Meryan, but as *-wr- at least in pre-Samic. If so, loaning to Proto-Finnic could have happened independently as well. (Mordvinic and Mari only show simple *r and they can swing any way really.)

2) With ‘horn’, the appearence of *o in Finnic may point to *towrə as an earlier form. This ties in with a larger topic: several Baltic as well as some Germanic and Indo-Iranian loanwords in Finnic seemingly still preserve PIE *o — but in a few of the cases we are actually dealing with PIE *a instead, one of these being this particular word (East Baltic *taure is, obviously enough, a derivative of *tauros ‘bull, aurochs’). I’ve prepared a small survey of this matter some time ago, [6] and among other results it turns out that cases with *au → *ou seem to be especially frequent. I suppose that this indicates that Proto-Baltic or Proto-Balto-Slavic had already merged short *a and *o at the time, but that the diphthong *au was during the time realized in some applicable dialect with a labialized first component, roughly [ɒu]. The loanwords with *au → *au would then have to be analyzed as later (as is already the case as well in explanations that appeal to the late retention of PIE *o), or as coming from a different Baltic dialect.

3) The above argument applies almost intact also to Koiva, for which we can likewise posit *Koiva < *Kowja ← *G[ɒu]jā. Here original pre-Balto-Slavic *ou can be suspected as well, though.

4) With ‘ship’, assuming metathesis as a sound law seems to provide a small improvement for the historical phonology of Livonian. In native vocabulary and sufficiently old loanwords, the development of *-Viv- in Livonian is initially *-Vuv-, possibly with monophthongization in modern Courland Livonian (well paralleled by known developments such as *-Vll- > -VVl-, or *-Vlj- > *-Vľľ- > -VVľ-):

However, for ‘ship’ we instead find *laija > lǭja : laij-. This could be explained by the metathesis *-wj- > *-jw- having never happened in Livonian. Thus, just as *-jw- > *-Viv- assimilates to *-Vuv-, also *-wj- > *-Vuj- assimilates to *-Vij-; and the development *lawja > *laiva only holds for the rest of Finnic. [7]

5) Finally, my earlier promised possibly inherited example of *-wr-: *korva ‘ear’.

Older research has taken comparison with Samic *koarvē, approx. ‘prop’ (NS also bealljigoarvi ‘earhole’), Permic *kʷor ‘leaf’, Hungarian dial. harap ‘dry grass’ as grounds to reconstruct PU *korwa ‘blade, leaf’. On semantic grounds, the alleged Samic cognates look like loans from Finnic though. The direct development ‘blade’ > ‘prop’ appears improbable; while the development ‘blade’ > ‘ear’ > ‘handle’ > ‘prop’ (the two last stages are verifiable as polysemic meanings of *korva and its derivatives in Finnic) seems to be etymologically blocked, since Samic still retains the original PU root for ‘ear’, *pealjē < *peljä.

A competing proposal comes from Juha Janhunen, who in ’81 has compared Finnic *korwa with Samoyedic *kåw ‘ear’. In his original opinion, the root here would be approx. *kawə, irregularly labialized in (pre-)Finnic, and extended to a derivative *kow-ra > *korva. Semantically this is clearly better.

I do not find ad hoc labialization in Finnic enticing, though. And there’s also another phonological issue: *kåw is the only Proto-Samoyedic root with a shape *CVw in Janhunen’s reconstruction, while a number of more reliable examples instead point to the regular development being *CVwə > *CV. [8] Etymologically the proposal has its problems as well. Supposing two synonyms for ‘ear’ with complementary distribution (*kawə in Finnic + Samoyedic, *peljä everywhere else in Uralic) might work under a scenario where Finnic and Samoyedic are two early offshoots of Uralic, but seems less likely if they sort into their own respective wider subgroups, West Uralic and East Uralic (as I think is the most probable).

Despite all these issues, this idea might regardless be onto something. I would instead assume that the original root here is *kow-; and that, while it is not retained as such in any Uralic language, a parallel derivative from this, formed already in Proto-Uralic with the common verbalizing suffix *-l(ə)-, is the well-attested verb for ‘to hear’. This has traditionally been reconstructed as the rather Finnocentric *kuule-, but in my opinion thus better: *kow-lə-. [9] Several reflexes seem to indicate *o; these include Mordvinic *kuľə-, Mari *kola-, Mansi *kʷaal-, and, if it has any input from here, Hungarian hall- (though Old Hungarian hadl- would seem to show that this is instead from PU *kontV-lə- ‘to listen’). Permic *kɨl- and Khanty *kɔɔL- are the only reflexes compatible with short-vocalic *kulə-, and they might simply result e.g. from a raising *ow > *u, similar to the development *ow > *uu I assume for Finnic. [10]

It also seems likely to me that the Samoyedic words for ‘ear’ are derived from this root in some fashion, even if probably not as direct inheritance. PU *o > Samoyedic *å is after all the regular development in any environment other than *CoCə. To make progress, I’d suggest that the PSmy reconstruction itself requires adjustment. Janhunen’s monosyllabic *kåw seems to be largely based on Nganasan kou, but this could just as well come from a bisyllabic proto-form such as *kåjå, through the regular loss of post-tonic *-j- and raising of *å (an exact parallel is PU *kaja > PSmy *kåjå > Ng. kou ‘sun’). Given the developent *-wj- > *-j- in *jäwjə > *jüjə ‘beard lichen’, I would prefer assuming an agentive derivative *kow-ja > *kåwjå > *kåjå ‘hearer’. — Or perhaps a more heavily contracted *kowlə-ja > *kol-ja > *kåljå > *kåjå, for an exact parallel with attested forms like Fi. kuulija? [11] This would even have some strange synergy with the derivation of Smy. *timä < *temä ‘tooth’ from *sewə-mä ‘bite, biting’.

Getting back on track, though. If the metatheses *-wj-, *-wr- > *-jw-, *-rw- took place as regular sound changes in Proto-Finnic times, this will naturally lead to a full absense of *-wj- and *-wr-, as the Finnic comparative data indeed suggests. So far, so good.

However, what from there on? Should we not just as well expect these clusters to be recreated right away by the next few batches of loans, instead of fortition to *-pj-, *-pr-?

At this point I would like to direct attention to the fact that the (Western) Finnish reflexes of these words do not show explicit signs of such a fortition. My example words listed above surface as hauras ‘brittle’, saura ‘haystack’, äyräs ‘bank’, dialectal hiviä ‘skin’ (though Standard Finnish has adopted the fortited form hipiä). This is though indeed also the regular Finnish development of *-pj-, *-pr- (cf. *kapja > kavio ‘hoof’, *sëpra > seura ‘company’)… so, as long as we wanted to route these loans through Proto-Finnic, it will still be preferrable to indeed reconstruct e.g. *hapras, *hipjä, in order to regularly account for all reflexes, including also such ones as Northern Karelian hapraš, hipie.

But consider now the possibility that these aren’t loans dating all the way back to Proto-Finnic; and rather loans acquired after its breakup, taken up in the first place in Western Finnish, and mediated from there to the other Finnic varieties. In this case, the appearence of *-pr-, *-pj- could instead be a type of “etymological nativization gone awry”: e.g. the pre-Karelian dialect would by this time still have remained without **-wr-, but it would have had *-pr- as an equivalent of West Finnish *-wr-. This could have motivated adopting the cluster not phonetically, but rather “phonologically”. [12]

This firstly allows us to get rid of the strange back-and-forth phonological development in Finnish: words like hauras would simply preserve the Germanic original’s diphthong altogether. Secondly, this allows for some variation in the reflexes elsewhere in Finnic: if different Finnic dialects had to individually deal with adopting West Finnish *-wr- somehow, some of them could have opted for different strategies in different words. And we indeed find a *-wr- ~ *-pr- vacillation in e.g. Fi. teuras ‘sacrificial animal’, teurastaa ‘to slaughter’ ~ Krl. teuraštoa ‘to slaughter’ | Es. tõbras ‘head of cattle’ ~ Votic tõbras ‘elk’. This lexeme is likely from Germanic *þeuraz ~ *steuraz ‘bull’, but no single PF form can be set up. Instead of assuming two parallel loans (*tëpras ‘head of cattle’, *tëuras ‘sacrificial animal’?), it will be possible to reckon with just a single early Finnish loan *teuras, further adopted in differing ways into Karelian and Southern Finnic.

There is one non-trivial cost as well, though. ‘Brittle’ happens to be one of the words showing the characteristic pan-Finnic sound change *š > *h. If the word regardless spread across Finnic by diffusion from dialect to dialect, it will be now fairly difficult to assume that this sound change occurred in unitary Proto-Finnic; it will instead have to be an “areal-genetic” post-Proto-Finnic development. [13]

I am prepared to defend this dating in detail. *š > *h has already been proposed by multiple researchers to date as later than the split between South Estonian and the rest of Finnic. Drawing it out it further yet would not seem outrageous considering what we know of the typical expansion history of this kind of “major”, i.e. phonologically simple but innovative sound changes — while it would seem to allow the phonological fine-tuning of a handful of other known etymologies as well. But that will have to be a topic of its own.

In case my analysis here is correct (and I think it at minimum should prompt some kind of a more detailed defense for why would there ever have existed a “metathetic sound substitution”), there is a moral to be learned as well. The Finnic languages are often taken as phonologically archaic; this is undoubtedly the case with regards to several features of their inherited lexicon, most prominently the bisyllabic root structure. However, loanwords have been a consistent source of new phonotactic complexity. It is then to be expected that there have been several layers of “renormalization” — processes that have pushed these new root shapes back in line, towards the native word structure. And this may have occasionally swept a few native words along as well. Such innovations will probably be impossible to identify as long as we only look at the native component of the vocabulary, however.

[1] Although often enough people with preconceptions about the archaicity of Finnic have also assumed that the metathesis was on the Samic side instead — despite how this would have to be irregular: Samic quite well allows *-rv-, as in e.g. ‘horn’.
[2] Essentially singlehandedly in his 1970 article “Suomen laiva-sanasta“.
[3] See e.g. Pauli Rahkonen (2011), “Finno-Ugrian hydronyms of the River Volkhov and Luga catchment areas“.
[4] Most IE cognates seem to point to meanings like ‘river’ or ‘flowing’, but the derivatives in modern Baltic such as Lithuanian jūra ‘sea’, jaura ‘bog’ may have gained this more stationary meaning early on. I wonder if this semantic shift might have originally taken place near the wide and slow-flowing middle parts of the Volga.
[5] Could it be possible for this root, apparently well-attested only in Balto-Slavic, to be a backloan from Uralic…? It would have to be at least old enough to be pre-Satemization, though, and the “epenthetic” thematic vowel seems hard to explain in this fashion as well.
[6] You can find a working version over here; written in Finnish though. Maybe I will post an English summary here at some point.
[7] Dating the assimilation *-jw- > *-ww- very early in pre-Livonian would also work. In this case, newer loanwords could be still subject to the metathesis *-wj- > *-jw-, they would just be later on assimilated in the opposite direction: *-Viv- > *-Vij-. This might be indeed preferrable in light of two other data points. The first is *vaiva ‘bother, trouble, ailment’, which yields Liv. vǭja; it is however a Germanic loanword, whose original seems to require reconstruction with *-jw- (given e.g. Old High German wēwa). The other is the known Livonian developments *-Vlv- > *-ll-, *-rv- > *-rr- (e.g. *sarvi > *sarro > sǭra ‘horn’) taken together, which would surely predict that at this same time *-jv- > *-jj- as well (and not > *-vv-).
[8] E.g. *śowə > *so ‘mouth’; *sewə- ‘to eat’ > *te-mä > *timä ‘tooth’.
[9] This would also then disprove the often presented Indo-Uralic comparison with the PIE root for ‘to hear’, *ḱlew-. Instead I believe that better IE comparanda might be √h₂ew- ‘to perceive’ (from which *h₂ōws ‘ear’ is derived); or perhaps *(s)kewh₁- ‘to sense’. (Are these a doublet of some sort?)
[10] For Khanty, another possibility is that this is from earlier *kʷaal-, as in Mansi; this could have come about as a distant assimilation *kVwC- > *kʷVC-. While speculative, this idea is not quite entirely ad hoc: a possible parallel is *käwd₁ə ‘rope’ > Mansi *kʷääləɣ.
[11] It would be remotely within possibility to also suggest starting from *korwa or *kowra, as required by Finnic, combined with an ad hoc loss of *r in this cluster. However, I suspect that Finnic *harva ‘sparse, rare’ may be cognate with Samoyedic *tïrå ‘dry’ (< PU *šërwa; cognates in various other branches for both “sides” of this comparison are known as well), which would allow establishing a rather more natural development: PU *rw > Smy. *r.
[12] This gets perhaps even more phonetically plausible, if we assumed the “cluster series shift” to not have happened immediately from *-wr- to *-pr-, but rather from something like more innovative Western Finnish *-wr- to slightly more conservative Western Finnish *-βr-. This latter cluster would then have had no other option than to be uptaken as *-pr- in pre-Karelian / Ingrian / Estonian / etc. — On the other hand, this would require such a fine-grained Finnish dialect distinction to have indeed existed at the time, which may prove problematic.
[13] One other technically possible but again contrived explanation would be to assume that the word was initially lost from all Finnic varieties except Western Finnish, and that it later staged a return from there.

PIE verb roots, for the people

Last fall I blogged about a possible project on charting the distribution of reconstructed Proto-Indo-European terms in the descendants languages. Some discussion on here focused on the likely unreliability of the data, sourced for my initial survey from a conveniently available but unreferenced Wiktionary appendix.

This was not a choice out of ignorance as much as out of availability. To my knowledge, no public database of reasonably up-to-date etymological Indo-European data is currently available anywhere.

There is no reason though for us to resign to an inequal access to information, with easily found free data being of poor quality vs. “proper” data being locked away in exorbitantly expensive dead-tree-format publications. Data and theories, per se, are uncopyrightable, after all.

I am therefore happy to announce having digitized a list of PIE verb roots, as recorded in the LIV + in its online Addenda und Corrigenda. [1] A basic version is available at the English Wiktionary. You may also be interested in taking a look at the fully tabulated data, in spreadsheet form. The notes in my master file on word derivation and distribution are sketchy at best though, and will require further work to fill in. [2]

While this file is probably necessarily public domain, if anyone reading ends up using or referencing it somewhere, I would appreciate a shoutout or similar.

As comes to actual analysis, at this point the data mainly allows a look at root structure. I might as well note in this post some basic facts that stick out.

For starters, the usual stop phonation constraints (against **D-D, **T-Dʰ, **Dʰ-T) surface reliably. A more interesting related pattern emerges too: I’ve sometimes seen it suspected that the unusual PIE cluster *wr- could come from earlier *br-, therefore tying together with the lack of stem-initial *b-. (Not a lack altogether: at least in the preliminary data, *b still occurs often enough in stem-final position.) However, if this was assumed, we would end up with quite a large number of pre-PIE stems of the shape *b-D; 5 of the 12 roots with *wr- show a stem-final voiced stop; as in *wreg- ‘einer Spur folgen’. So either we’d need to also assume the reconstructible voicing constraints to have emerged only later; or to fine-tune this hypothesis to some kind of a chainshift like *bʰ- > *w-, *b- > *bʰ-.

I would be content to abandon the idea though and to instead assume that most cases of *wr- have rather arisen either thru the reduction of a 1st syllable of earlier roots (in PIE-internal terms ≈ as zero-grade derivatives of some root shaped *(C)wer-, *Cewr-), or thru some Schwebeablaut-ish metathesis process.

There is more interesting stuff going on with resonants. I do not recall seeing this discussed in the context of PIE root structure anywhere before (which of course could be ignorance on my behalf), but several non-trivial constraints on their distribution are apparent. Here are some quick observations on this topic:

  1. No roots — or perhaps better: “sonorant cores” of a shape **-R₁eR₁- occur. This is a fairly trivial application of the universal principle of Similar Place Avoidance, though.
  2. No cores of a shape **-ler-, **-rel- occur either. Again, this is fairly simple to understand as similar consonant avoidance.
  3. The core **-nel- is also absent: this seems less expected, but may have the same motivation as the above. It could also be an accidental gap, though, as onset *n- is relatively rare altogether, and *-len- is well attested. Perhaps it is rather the abundance of *-ney- and *-new- roots that should be questioned.
  4. *m in the onset does not appear to quite count as a sonorant. There are just about no roots beginning with a cluster *Tm-, where *T would be a stop consonant (the lone example is *dʰmeH- ‘blasen’). We do find *sm-, *Hm-, but then again, *sT- and *HT- are possible just as well.
    This also lines up well with how a few cases of *mR- occur as well. Historically, they seem likely to be mostly “zero-grade clusters” again; but this etymological explanation does not suffice to explain the absense of other sonorant-sonorant clusters such as **nR-, **lR-.
  5. Sonorant cores of a shape *-yeR- seem unexpectedly rare altogether. No examples with **-yel-, **-yer-, **-yen- occur at all, and only a single example of *-yem-.
  6. Conversely, even when looking at roots with stem-final obstruents only, onset *-y- is curiously common preceding a stem-final back consonant (velar, laryngeal or *w): 29 cases out of 33, or 88%, show this environment! I wonder if we could assume that such roots reflect some specific pre-PIE front vowel, which was diphthongized to *ye before back consonants. It would likely have to be separate from the source of PIE *-ey- though, which does not seem to have any aversion against occurring before velars and laryngeals.
  7. Initial *h₂w- appears to be more common than all other laryngeal + glide clusters altogether, and it is also quite common stem-finally (i.e. as *-h₂w-, not *-wh₂-!). I wonder if this should be assumed to represent an earlier single phoneme such as *[ħʷ], created even further back from the ancestor of *h₂ by the same processes that led to the rise of the PIE labiovelar series?

I could extend my discussion to onset and stem-final consonant clusters as well, but they do not seem to show anything especially interesting for me to raise up just yet.

[1] Two corrections on reconstruction remain mysterious to me: an alleged removal of a root **meyH- ‘lang werden’ (the two roots I’ve recorded with this shape do not seem to have such a meaning), and the adjustment of a root *kelh₁- to *k¹elh₁- (no such root occurs in the original data; although the root *kel- ‘antreiben’ is adjusted to *kelh₁- in another correction).
[2] I have at the moment no recollection what the column labeled “st” signifies, but I am leaving it in for possible further elaboration.
edit: On re-checking the data, apparently this indicates the number of branches with verbal reflexes given by LIV in the running text. However, footnotes often list nominal derivations, and closer checking also shows that some entries even list a few additional uncertain verbal reflexes in footnotes… meaning that this will be not quite an actual measure of the distribution of the reflexes. Perhaps I will remove this in later editions.

A note on the Mitian Argument

An article to have caught my attention tonight: Mikael Parkvall (2008), Which parts of language are the most stable?, Sprachtypologie und Universalienforschung 61/3.

The main momentum of the paper is to define a statistical measure of the “arealness” or “geneticness” of a particular linguistic feature. This can be accomplished with fairly elementary calculations, once given a large dataset (the author uses, not especially surprizingly, WALS). Typologists will likely find the excercise illustrative, both in its general array of eyeball-able results, and in demonstrating how even the simplest bit of math can go a long way. [1]

One result stands out to me: among the features found the most strongly genetic, at #3 stands “M-T pronouns” — i.e. the likes of Uralic *minä, *tinä, and their suggested distant relatives in Indo-European, Yukaghir, Turkic, Mongolic, etc. (families that, taken together, form a subset of the Nostratic macrofamily hypothesis known as “Mitian”). Parkvall does not fail to notice this result either.

This may still require a number of caveats. WALS does not pack a very large number of etymological data sets, and is more geared towards features that can instead illuminate areal patterns. And, perhaps as a warning, the #1 most genetic feature on the list turns out to be “presence of phonemic clicks”.

As people who dabble in linguistic classification most probably know, click consonants have traditionally been held as a defining marker of an alleged “Khoisan” language family of southern Africa, first proposed by notorious “lumper” Joe Greenberg. However, putting together more conventional evidence for this grouping has over the years proven near-impossible, and these days conservative analyses instead seem to have settled on distinguishing some 3-4 separate families (the larger units with some acceptance being Khoe, Tuu, and Ju-ǂHoan) in place of unified Khoisan.

(An additional point, if you look closely at the math behind the stats, is that the highly genetic assessment of clicks gets a slice of its homogeneity score not just from the high homogeneity of the “Khoisan” families in their presence of clicks; but also from the complete homogeneity of all non-African language families in their absense of clicks. This argument can be expected to equally apply to any other trait that is truly a single-family or single-geographical-area idiosyncracy, rather than one found sporadically around the world.)

Regardless, we see “Mitianness” still squarely beating out various common tell-tale signs of established-family genetic relatedness, such as the presence of ejectives; sex-based noun gender systems; or polysynthesis.

At some point in the future, once we have an “etymological WALS” at our disposal, it would be moreover interesting to repeat this experiment with a few other lexical variables. E.g. how do numerals or body parts stack against pronouns in genetic classification? What are the stablest kinship terms? How good a job does the Swadesh list really do? Are there any interesting surprizes to be found in words for abstract concepts? Do old and universal enough cultural concepts (think “pottery”, “hunting technology”) behave as if they were core vocabulary? Etc, etc, time will tell.

[1] Of course, something like 90% of the time, “the simplest bit of maths” seems to be all that we have yet in linguistics. This is surely great news for people who are not professionals, but who want to follow linguistics arguments along from home; or for the career plans of people like myself, who know enough undergrad-level maths to craft a couple other elementary mathematical tools for testing this or that hypothesis, if necessary. On the other hand, it is a less than promising sign about the overall quantitative reliability of our field in general, so far…

On *ü in Mari vs. Proto-Uralic

It is always a low note of sorts when a scientific dispute gets resolved by quietly shifting consensus (e.g. due to proponents of one side passing away) rather than by actual discussion.

One of these seems to be the status of Proto-Uralic *ü. In literature up to about the mid-1900s, various skeptical viewpoints can be found on if a contrast between *i and *ü should be reconstructed or not. They dwindle away in later times however, with the modern researcher only really encountering any trace of the issue when perusing the UEW, which still provides proto-forms with *ü only as an alternative to proto-forms with *i. So far I have regardless been unable to locate any turning point source that argues in detail in favor of establishing *ü after all.

For sure, all major overviews of comparative Uralic vocalism (Steinitz 1944, Collinder 1960, Sammallahti 1988) still reconstruct contrastive front rounded *ü (or, in the case of Steinitz, largely equivalent reduced *ö̆), and give what they see as the regular later development in most individual languages. It is thus fairly simple to reverse-engineer a rough argument for in which cases to reconstruct *ü. Altogether, especially the following three contrasts appear to be relatively robust and in etymological correspondence to each other:

  • Finnic *i : *ü
  • Hungarian ë : ö
  • Khanty *e : *ö (perhaps rather *[ɪ] : *[ʏ])

Also the *i : *ɨ contrast in Permic correlates well with this (though *ɨ can also derive from PU *u and *ä).

Numerous further conditional developments, including also indirect traces in several Uralic languages that lack front rounded vowels, have also been identified. Collating these in one place would probably amount to an almost full answer to old skeptical viewpoints, which mostly have focused on the possibility that the contrasts seemingly pointing to *i : *ü have separately developed in each language.

I think one subgroup remains an open problem though. A phonetically equivalent contrast also appears in Mari, between *ĭ (> generally /ə/, in a couple of dialects /ɪ/ or /i/) and *ü̆ (> Hill Mari /ə̈ ~ ʏ/, Meadow Mari /y/). But this particular contrast seems to do a poor job at matching with the Proto-Uralic *i : *ü contrast, as could be reconstructed on the basis of the other languages. While reflexes with “correct” labiality seem to be in the lead, an abundance of counterexamples is also apparent: [1]

  • PU *i > Ma *ĭ: 15 cases
    *ićä ‘father’ > *ĭćä ‘older brother’, *kičək > *kĭčək ‘fresh snow’, *kirä- > *kĭre- ‘to hit’, *kiśkə- > *kĭške- ‘to throw’, *minä > *mĭńə ‘I’, *ńičkä- > *jĭčke- ‘to pluck’, *pićlä > *pĭćle ‘rowan’, *pilwə > *pĭl ‘cloud’, *pištä- > *pĭšte- ‘to put’, *pitä- > *pĭće- ‘to hold’, *śikšta (← II) > *šĭštə ‘beeswax’, *śilmä > *šĭnćä ‘eye’, *tinä > *tĭńə ‘thou’, *wittə > *wĭć ‘5’
  • PU *i > Ma *ü̆: 6 cases
    *kiwə > *kü ‘stone’, *piŋə > *pü ‘tooth’, *nimə > *lü̆m ‘name’, *śixələ > *šülə ‘hedgehog’, *šikšna (← Baltic) > *šü̆štə ‘strap’, *sitV- ‘to bind’ > *šüðəš ‘bind’
  • PU *ü > Ma *ĭ: 9 cases
    *küjə > *kĭškə ‘snake’, *külmä > *kĭlmə ‘cold’, *küńärä > *kĭńer ‘elbow’, *kütkə- > *kĭćke- ‘to harness’, *mükkä > *mĭk ‘mute’, *ńüktä- > *ńĭktä- ‘to pluck’, *süjə > *šĭjä ‘year ring’, *sükəśə > *šĭžə ‘autumn’, *śüklä (← Turkic) > *šĭɣəľə ‘wart’
  • PU *ü > Ma *ü̆: 11 cases
    *d₂ümä > *lü̆mə ‘glue’, *künčə > *kü̆č ‘nail’, *künčä- > *kü̆nče- ‘to dig’, *küsV > *kü̆žɣə ‘thick’, *kütV > *kü̆ðäl ‘middle’, *sülə > *šü̆lə ‘fathom’, *süskV- > *šü̆škä- ‘to cram’, *śüd₁ə > *šü ‘coal’, *śülkə > *šüwəl ‘spit’, *türə > *tü̆rəś ‘full’, [2] *tüŋə > *tü̆ŋ ‘base’, *wülä > *wü̆l- ‘over’

PU *e also mostly yields Ma *ĭ or *ü̆, again split fairly evenly.

  • PU *e > Ma *ĭ: 15 cases
    *e- > *ĭ- ‘negative verb’, *elä- > *ĭle- ‘to live’, *eštə- ‘to be in time’ > *ĭšte- ‘to do’, *jećə > *ĭške ‘self’, *jekä > *i ‘year’, *keltä- > *kĭlðe- ‘to bind’, *kenčV- > *kĭčälä- ‘to serch’, *neljä > *nĭl ‘4’, *le- > *liä- ‘to be’, *leštə > *lĭštäš ‘leaf’, *peljä > *pĭləkš ‘ear’, *penä > *pi ‘dog’, *pesä > *pĭžäkš ‘nest’, *repäś (← II) > *rĭwəž ‘fox’, *śerV > *sĭr ‘character, nature’
  • PU *e > Ma *ü̆: 12 cases
    *jetV > *jü̆t ‘night’, *kejə- > *küä- ‘to boil’, *kerə > *kü̆r ‘bast’, *pečä > *pü̆nčə ‘pine’, *pečkV- > *pü̆čkä- ‘to cut’, *sesar (← IE) > *šü̆žar ‘sister’, *śečä > *čü̆čə ‘uncle’, *śepä > *šü ‘neck’, *tejnəš (← II) > *tü̆əž ‘pregnant’, *terä (← II) > *tü̆r ‘blade’, *werə > *wü̆r ‘blood’, *wetə > *wü̆t ‘water’

I have included here cases with Proto-Mari *i and *ü only in stems of the shape CV(V-), where the appearence of “full” rather than “reduced” vowels is regular. Some other examples exist as well though, such as *ik ‘one’ (< *ü?), *üpš ‘smell’ (< *i?).

Existing literature does not seem to tackle the issue, and often I get the feeling that authors essentially try to sweep the problem under the carpet. Sammallahti leaves the history of Mari vocalism untreated. Collinder offers, for the cases with *e > *ü̆, only the slightly ad hoc rule that this development occurs “in the vicinity of *w and *r”, while he does not comment on the cases with *i > *ü̆ or *ü > *ĭ. Steinitz’ approach posits a late development *ĭ > *ü̆ again in the vicinity of labial consonants (and raises the possibility that it applies only to Meadow Mari and not even Proto-Mari), but leaves the other cases untreated.

I have not seen any specialized studies that would have fared better either. E. Itkonen in his major 1954 article on the history of Mari and Permic vocalism even explicitly notes that labiality assimilations that he posits next to *w, *p, *r cannot be considered regular. Contrast indeed e.g. ‘blood’ (*we- > *wü̆-) vs. ‘five’ (*wi- > *wĭ-), ‘tooth’ (*pi- > *pü-) vs. ‘cloud’ (*pi- > *pĭ-), ‘blade’ (*-er- > *-ü̆r) vs. ‘to hit’ (*-ir- > *-ĭr-). — Also, since when is *r a labial consonant anyway?

I suspect that already the basic assumptions underlying earlier research on this are incorrect. Instead of the developments *i > *ü̆ and *ü > *ĭ being some kind of exception cases to be explained away, the old skeptic contingent has been right this time: the contrast between Proto-Mari *ĭ and *ü̆ is unrelated to the contrast between Proto-Uralic *i and *ü. Rather, PU *i, *ü and *e merged in the early history of Mari, and this merged phoneme (I will mark it simply as *i) later secondarily split into *i > *ĭ and *ü > *ü̆ again — without regard for its PU origins.

The best single conditioning factor instead appears to be stem type:

  • *i-ä > *ĭ: 23 cases
    *elä- > *ĭle-, *ićä > *ĭćä, *jekä > *i, *külmä > *kĭlmə, *keltä- > *kĭlðe-, *küńärä > *kĭńer, *kirä- > *kĭre-, *minä > *mĭńə, *mükkä > *mĭk, *neljä > *nĭl, *ńičkä- > *jĭčke-, *ńüktä- > ńĭktä-, *pićlä > *pĭćle, *peljä > *pĭləkš, *penä > *pi, *pesä > *pĭžäkš, *pištä- > *pĭšte-, *pitä- > *pĭće-, *repäś > *rĭwəž, *śüklä > *śĭɣəľə, *śikšta > *šĭštə, *śilmä > *šĭnćä, *tinä > *tińə
  • *i-ä > *ü̆: 9 cases
    *d₂ümä > *lü̆mə, *künčä- > *kü̆nče-, *pečä > *pü̆nčə, *sesar > *šü̆žar, *śečä > *čü̆čə, *śepä > *šü, *šikšna > *šü̆štə, *terä > *tü̆r, *wülä > *wü̆l-
  • *i-ə > *ĭ: 11 cases
    *eštə- > *ĭšte-, *jećə > *ĭške, *kičək > *kĭčək, *küjə > *kĭškə, *kiśkə- > *kĭške-, *kütkə- > *kĭćke-, *leštə > *lĭštäš, *pilwə > *pĭl, *süjə > *šĭjä, *sükəśə > *šĭžə, *wittə > *wĭć
  • *i-ə > *ü̆: 15 cases
    *kejə- > *küä-, *künčə > *kü̆č, *kerə > *kü̆r, *kiwə > *kü, *nimə > *lü̆m, *piŋə > *pü, *sülə > *šü̆lə, *śüd₁ə > *šü, *śülkə > *šü̆wəl, *śixələ > *šülə, *tejnəš > *tüəž, *türə > *tü̆rəś, *tüŋə > *tü̆ŋ, *werə > *wü̆r, *wetə > *wü̆t
  • unclear/inapplicable > *ĭ: 4 cases
    *e- > *ĭ-, *kenčV- > *kĭčälä-, *le- > *liä-, *śerV > *sĭr
  • unclear > *ü̆: 6 cases
    *jetV > *jü̆t, *kütV > *kü̆ðäl, *küsV > *kü̆žɣə, *pečkV- >*pü̆čkä-, *süskV- > *sü̆skä-, *sitV- > *šüðəš

The raw accuracy of the maintenance hypothesis (*i > *ĭ, *ü > *ü̆) seems to be 26 cases predicted correctly out of 41 ≈ 63.5% (worse if we also wanted to presume *e > *ĭ). Assuming the typical reflexation to be *i-ä > *ĭ, *i-ə > *ü̆ instead reaches up to 38 correctly predicted out of 58 ≈ 65.5 %. Which is so far only marginally better… But there is room for fine-tuning here as well.

Some of the apparent exceptions in verb roots can be readily interpreted to indicate a shift of stem type in pre-Mari. *ĭšte- ‘to do’, *kĭške- ‘to throw’ and *kĭćke- ‘to harness’ (in red above) show 2nd syllable *e, which normally corresponds with PU *A-stem verbs; thus I would reconstruct pre-Mari *ist-ä-, *kiśk-ä- and *kitk-ä-. Here *-ä- is probably some kind of a transitivizing suffix, well known in Mari (the classic example is probably /koða-/ ‘to stay’ : /koð-e-/ ‘to leave’) and probably dating to earlier times already (reconstructible in a small number of PU doublets such as *künčə ‘nail’ ~ *künč-ä- ‘to plough/dig’; *ipsə ‘smell’ ~ *ips-ä- ‘to smell’). We could also take the final *-e, rather rare in nominals, of *ĭške ‘self’ as grounds to reconstruct pre-Mari *(j)iś-kä.

Similarly, *pü̆čkä- ‘to cut’, *šü̆škä- ‘to cram’ (in blue above) show 2nd syllable *ä, which normally corresponds with PU *ə-stems; and therefore I would reconstruct pre-Mari *pičkə-, *siskə-. The former thus turns out better compareable with Mordvinic *pečkə- ‘to cut’ than with Samic *peackē- ‘to cut (off)’ (< *pečk-ä-), and the latter with Samic *sëskë- ‘to rub against’ than with Fi. sysä-, Es. süska- ‘to push into’.

(This on the other hand creates new problems for *kĭčälä- ‘to serch’, *liä- ‘to be’, *ńĭktä- ‘to pluck’, which now start pointing to earlier *ə-stems…)

I would also take *kü̆žɣə ‘thick’ (also in blue) as pointing to earlier *kizəgV < *küsəkV (akin to Proto-Samic *kësëkV > Northern Sami gassat etc.), rather than the bare root *küsä that most sources report. Perhaps even *kĭškə ‘snake’ should be taken as pointing to PU *küjəwä (> Erzya /kijov/, Hung. kígyó, Smy. *kiwä) > pre-Mari *kiwä(-skV) rather than the bare root *küjə (> PF *küü, Udm. /kɨj/ [3]).

Nominal derivation phenomena could lie behind some of the other exceptions as well, though due to the non-maintenance of the PU stem vowel contrasts in Mari nominals, this will have to be more speculative. For example, Finnic *kidek ‘snowflake’ has a number of parallel derivatives etc. in the descendant languages, and the original root may well have been *kičä rather than *kičə. It would be also possible to assume PU *kičäk, and date the development *-Ak > *-Ek (as seen in cases such as Fi. jauha- ‘to grind’ ~ jauhe ‘powder’; jättä- ‘to leave behind’ ~ jäte ‘trash’) as inner-Finnic.

Consonant environment conditioning does not need to be ruled out entirely either. E.g. *šü ‘neck’ could be taken back to pre-Mari *siw(ä), and *šĭjä ‘year ring’ to pre-Mari *sijə, with the natural developments *iw > *ü̆ and *ij > *ĭ bleeding the usual stem type conditioning. (This provides also another possible line of explanation for ‘snake’.) The latter rule could be even generalized slightly to also capture *wĭć ‘5’.

The phonetics of this hypothesis do not have to be left arbitrary either: a kind of palatal umlaut mechanism seems to work. The root structure *i-ä > *ĭ(-e) remains consistently front-vocalic and illabial; while the root structure *i-ə would probably have been first retracted to something like *[ɨ]-[ə]. After this, I would suppose central *ɨ was labialized to [ʉ], and then re-fronted > [y] > [ʏ]. This development appears internally unmotivated (it could possibly be attributed to areal influence from Turkic) — but it has a good precedent in the fact that Mari is the only Uralic language with a front rounded reflex of PU *ë, for which we must then reconstruct the exactly parallel development [ɤ~ɜ] > [ɵ] > [ø] > [y].

Later vowel harmony between /a ~ ä/, as attested in Hill Mari (but not Meadow Mari) was likely not yet in effect by this stage. This appears to be shown by the straggling cases of Proto-Mari *ĭ-ä: where *ĭ is further reduced and retracted to /ə/ in Hill Mari, the stem vowel surfaces as /a/, not as /ä/. Cf. e.g. /kəčala-/ ‘to serch’, /ńəkta-/ ‘to skin’, /šəja/ ‘year ring’.

[1] This selection has been datamined from both older and newer literature. Individual referencing would go beyond the purposes of this blog post. Various dubious or difficult-to-reconstruct comparisons have been omitted, including e.g. most cases where some or most other reflexes point to original *ä rather than *e.
[2] To my knowledge, this comparison has not been previously presented, though it seems self-evident. The identity of the “suffix” is unclear to me however.
[3] Even this might derive from the longer form *küjəwä: contrast *süjə > /si/ ‘year ring’. Perhaps thus: *süjə > *süj > *si, but *küjəwä > *küjə > *kɨj?

